Bug 2081895 - Use the managed resource (and not the manifest) for resource health checks
Summary: Use the managed resource (and not the manifest) for resource health checks
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: W. Trevor King
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-04 23:49 UTC by W. Trevor King
Modified: 2022-08-10 11:10 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:10:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 >cvo.log (675.72 KB, text/plain)
2022-05-12 23:30 UTC, W. Trevor King
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 771 0 None open Bug 2081895: lib/resourcebuilder/apiext: Restore check for Established=True CRDs 2022-05-04 23:50:21 UTC
Github openshift cluster-version-operator pull 780 0 None open Bug 2081895: lib/resourcebuilder: Drop Get from check*Health functions 2022-05-13 03:00:23 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:10:42 UTC

Description W. Trevor King 2022-05-04 23:49:55 UTC
I'd accidentally broken this in [1], when I moved the health checks out of the Do methods and began passing typedObject into them.  typedObject is the release manifest, which lacks the status attributes we want the health checks to cover.

Affected resource types: CustomResourceDefinition, DaemonSet, Deployment, Job

[1]: https://github.com/openshift/cluster-version-operator/commit/e8ffccb87821241adfb5c2da9b5053a4a53705a3

Comment 2 liujia 2022-05-10 08:36:56 UTC
Hi Trevor

I'm not quite sure what's the problem for this bug yet. Any suggestions for the bug's verify?

Comment 4 W. Trevor King 2022-05-12 23:29:06 UTC
We don't currently have any DaemonSet manifests [1]:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release-nightly@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f107

I'm not sure how we'd block a CRD from being Established=True.  [2] means we can't get the Deployment operator to complain in a way we'll notice unless it is trying to transition between ReplicaSets.  But we can trigger Deployment ReplicaSet updates by updating between two OCP releases, so let's try that.

With a cluster-bot cluster, 'launch 4.10.14'.  Sanity check:

  $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version
  4.10.14

Ask for the update, with --force to break through bug 2083370's lack of precondition retries:

  $ oc adm upgrade --force --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release-nightly@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f107

Wait for it to get going:

  $ oc adm upgrade
  info: An upgrade is in progress. Working towards 4.11.0-0.nightly-2022-05-11-054135: 9 of 796 done (1% complete)
  ...

Cordon the control-plane, to hopefully break a future Deployment rollout:

  $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' nodes); do oc adm cordon "${NODE}"; done

Find any stuck deployments (the ones that stick for you will depend on exactly when you cordoned the nodes):

  $ oc get deployments --all-namespaces | grep 0/
  openshift-etcd-operator                            etcd-operator                               0/1     1            0           34m
  openshift-kube-apiserver-operator                  kube-apiserver-operator                     0/1     1            0           33m

Check out one of those:

  $ oc -n openshift-etcd-operator get pods
  NAME                            READY   STATUS    RESTARTS   AGE
  etcd-operator-85984bc56-rdwr8   0/1     Pending   0          5m8s
  $ oc -n openshift-etcd-operator get -o json pods | jq -r '.items[].status.conditions[]'
  {
    "lastProbeTime": null,
    "lastTransitionTime": "2022-05-12T23:00:42Z",
    "message": "0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) were unschedulable.",
    "reason": "Unschedulable",
    "status": "False",
    "type": "PodScheduled"
  }

Perfect, the cordons are working.  Should be another ~5m before we hit the 10m cap and the Deployment controller starts complaining:

  $ oc -n openshift-etcd-operator get -o json deployment etcd-operator | jq -r '.spec.progressDeadlineSeconds'
  600

CVO at this point is still happily moving past the etcd-operator Deployment manifest ("the Deployment controller is happy with it"):

  $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep 'deployment.*etcd-operator'
  I0512 23:00:39.696737       1 sync_worker.go:928] Running sync for deployment "openshift-etcd-operator/etcd-operator" (76 of 796)
  I0512 23:00:39.995169       1 sync_worker.go:943] Done syncing for deployment "openshift-etcd-operator/etcd-operator" (76 of 796)
  I0512 23:07:50.279472       1 sync_worker.go:928] Running sync for deployment "openshift-etcd-operator/etcd-operator" (76 of 796)
  I0512 23:07:50.479208       1 sync_worker.go:943] Done syncing for deployment "openshift-etcd-operator/etcd-operator" (76 of 796)

Instead, it's getting stuck on ClusterOperator versions not getting bumped (because the operators aren't running):

  $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep 'Result of work'
  I0512 23:07:15.398323       1 task_graph.go:546] Result of work: [Cluster operator etcd is updating versions Cluster operator kube-apiserver is updating versions]

Ok, 10m up, now the Deployment controller is setting ProgressDeadlineExceeded:

  $ oc -n openshift-etcd-operator get -o json deployment etcd-operator | jq -r '.status.conditions[]'
  {
    "lastTransitionTime": "2022-05-12T23:00:40Z",
    "lastUpdateTime": "2022-05-12T23:00:40Z",
    "message": "Deployment does not have minimum availability.",
    "reason": "MinimumReplicasUnavailable",
    "status": "False",
    "type": "Available"
  }
  {
    "lastTransitionTime": "2022-05-12T23:10:44Z",
    "lastUpdateTime": "2022-05-12T23:10:44Z",
    "message": "ReplicaSet \"etcd-operator-85984bc56\" has timed out progressing.",
    "reason": "ProgressDeadlineExceeded",
    "status": "False",
    "type": "Progressing"
  }

Waiting for the CVO to give up on its current sync round and come in again...

  $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --follow | grep 'Result of work\|deployment.*etcd-operator'
  ...time passes, while we are --follow ing...
  I0512 23:14:44.354658       1 task_graph.go:546] Result of work: [Cluster operator etcd is updating versions Cluster operator kube-apiserver is updating versions]
  I0512 23:15:51.225574       1 sync_worker.go:928] Running sync for deployment "openshift-etcd-operator/etcd-operator" (76 of 796)
  E0512 23:15:51.425918       1 task.go:112] error running apply for deployment "openshift-etcd-operator/etcd-operator" (76 of 796): deployment openshift-etcd-operator/etcd-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "etcd-operator-85984bc56" has timed out progressing.)
  E0512 23:16:05.735335       1 task.go:112] error running apply for deployment "openshift-etcd-operator/etcd-operator" (76 of 796): deployment openshift-etcd-operator/etcd-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "etcd-operator-85984bc56" has timed out progressing.)
  E0512 23:16:20.009607       1 task.go:112] error running apply for deployment "openshift-etcd-operator/etcd-operator" (76 of 796): deployment openshift-etcd-operator/etcd-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "etcd-operator-85984bc56" has timed out progressing.)
  ...

Something like [3] would get us failing faster here, if I could get that landed.  Also by this point, KubePodNotReady and KubeDeploymentReplicasMismatch are firing, so we weren't all that exposed by the non-functional Deployment health checks in the CVO.  Back to the --follow logs:

  ...
  E0512 23:22:44.448250       1 task.go:112] error running apply for deployment "openshift-etcd-operator/etcd-operator" (76 of 796): deployment openshift-etcd-operator/etcd-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "etcd-operator-85984bc56" has timed out progressing.)
  I0512 23:22:45.250598       1 task_graph.go:546] Result of work: [deployment openshift-etcd-operator/etcd-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "etcd-operator-85984bc56" has timed out progressing.) deployment openshift-kube-apiserver-operator/kube-apiserver-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "kube-apiserver-operator-5d8bd9b99c" has timed out progressing.)]
  I0512 23:22:45.250629       1 sync_worker.go:1100] Update error 76 of 796: WorkloadNotAvailable deployment openshift-etcd-operator/etcd-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "etcd-operator-85984bc56" has timed out progressing.) (*errors.errorString: deployment openshift-etcd-operator/etcd-operator is not available and not progressing; updated replicas=1 of 1, available replicas=0 of 1)
  * deployment openshift-etcd-operator/etcd-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "etcd-operator-85984bc56" has timed out progressing.)
  ^C

Checking on user-facing reporting:

  $ oc adm upgrade
  info: An upgrade is in progress. Unable to apply 4.11.0-0.nightly-2022-05-11-054135: an unknown error has occurred: MultipleErrors
  ...

Not great.

  $  oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2022-05-12T22:30:10Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2022-05-12T22:59:19Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.11.0-0.nightly-2022-05-11-054135" image="quay.io/openshift-release-dev/ocp-release-nightly@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f107"
  2022-05-12T22:50:55Z Available=True : Done applying 4.10.14
  2022-05-12T23:25:00Z Failing=False : 
  2022-05-12T23:00:01Z Progressing=True : Working towards 4.11.0-0.nightly-2022-05-11-054135: 103 of 796 done (12% complete)
  2022-05-12T23:00:15Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec
  2022-05-12T23:03:50Z Upgradeable=False PoolUpdating: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

But the Deployment complaints in:

  $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep 'Result of work'
  I0512 23:07:15.398323       1 task_graph.go:546] Result of work: [Cluster operator etcd is updating versions Cluster operator kube-apiserver is updating versions]
  I0512 23:14:44.354658       1 task_graph.go:546] Result of work: [Cluster operator etcd is updating versions Cluster operator kube-apiserver is updating versions]
  I0512 23:22:45.250598       1 task_graph.go:546] Result of work: [deployment openshift-etcd-operator/etcd-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "etcd-operator-85984bc56" has timed out progressing.) deployment openshift-kube-apiserver-operator/kube-apiserver-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "kube-apiserver-operator-5d8bd9b99c" has timed out progressing.)]

are probably sufficient to verify this bug, and we can revisit failure reporting in follow-up work.

[1]: https://mirror.openshift.com/pub/openshift-v4/amd64/clients/ocp-dev-preview/4.11.0-0.nightly-2022-05-11-054135/
[2]: https://github.com/kubernetes/kubernetes/issues/106054 
[3]: https://github.com/openshift/cluster-version-operator/pull/573

Comment 5 W. Trevor King 2022-05-12 23:30:27 UTC
Created attachment 1879218 [details]
oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 >cvo.log

I'm attaching my CVO logs, in case that's helpful for digging into the reporting issues.

Comment 6 W. Trevor King 2022-05-12 23:32:44 UTC
(In reply to W. Trevor King from comment #4)
> We don't currently have any DaemonSet manifests [1]:
> 
>   $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release-nightly@sha256:5967359c2bfee0512030418af0f69faa3fa74a81a89ad64a734420e020e7f107

Oops, I got distracted, and forgot to mention:

  $ grep -rho 'kind: D.*' manifests  | sort | uniq -c
     53 kind: Deployment
      3 kind: DNS
      1 kind: DNSRecord
      9 kind: DockerImage

(with no DaemonSet hits).

Comment 7 liujia 2022-05-13 01:39:29 UTC
Going through comment4, still not clear what's the exact problem here because there is deployment fail log too before the fix.

Following the same reproduce/verify steps, i also did some test on 4.11.0-0.nightly-2022-05-08-193101(without the fix) and 4.11.0-0.nightly-2022-05-11-054135(with the fix) yestoday. From cvo log, i see no difference on complaining about Deployment rollout failure. 

```
# cat new.log |grep "error running apply for deployment"|tail -n2
E0512 09:33:11.424095       1 task.go:112] error running apply for deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator" (128 of 796): deployment openshift-kube-scheduler-operator/openshift-kube-scheduler-operator is Progressing=False: ProgressDeadlineExceeded: ReplicaSet "openshift-kube-scheduler-operator-596f6d6588" has timed out progressing.
E0512 09:33:13.764261       1 task.go:112] error running apply for deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (119 of 796): deployment openshift-kube-controller-manager-operator/kube-controller-manager-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "kube-controller-manager-operator-6d7744b5f6" has timed out progressing.)
```
```
# cat old.log |grep "error running apply for deployment"|tail -n2
E0512 09:32:08.247825       1 task.go:112] error running apply for deployment "openshift-kube-apiserver-operator/kube-apiserver-operator" (107 of 795): deployment openshift-kube-apiserver-operator/kube-apiserver-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "kube-apiserver-operator-7b8788dc57" has timed out progressing.)
E0512 09:32:29.710050       1 task.go:112] error running apply for deployment "openshift-kube-apiserver-operator/kube-apiserver-operator" (107 of 795): deployment openshift-kube-apiserver-operator/kube-apiserver-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "kube-apiserver-operator-7b8788dc57" has timed out progressing.)
```
Not sure if i miss anything, paste both of logs for Trevor's further checking.

Comment 12 liujia 2022-05-23 10:00:42 UTC
1) pr#771
For resource Deployment, per comment7, it works well with the fix, but due to the issue in pr#780, we could not ensure "cvo not complaining about Deployment rollout failure" before the fix in #771.

2) pr#780 
Due to it's behind #771, so we still could not check "cvo not complaining about Deployment rollout failure" with #780. Just regression test to check "cvo complaining about Deployment rollout failure" with both of prs.

# ./oc -n openshift-cluster-version logs cluster-version-operator-cd46fcfcb-r7fxc|grep "error running apply for deployment"|tail -n1
E0523 09:57:12.294259       1 task.go:117] error running apply for deployment "openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator" (155 of 802): deployment openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-cloud-controller-manager-operator-76c598d447" has timed out progressing.)

# ./oc adm upgrade
info: An upgrade is in progress. Working towards 4.11.0-0.nightly-2022-05-20-213928: 155 of 802 done (19% complete), waiting on cloud-controller-manager

I think we could verify the bug now. Trevor, do you think we need more test for it?

Comment 13 W. Trevor King 2022-05-24 04:53:05 UTC
I'm fine with comment 12 as verification for this bug.

Comment 15 errata-xmlrpc 2022-08-10 11:10:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.