Description of problem: During 1000 SNO ZTP deployment test, about 1% of the clusters get stuck on the subscription policy due to this kind of error from operator subscriptions: - message: 'error using catalog community-operators (in namespace openshift-marketplace): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"' reason: ErrorPreventedResolution status: "True" type: ResolutionFailed Version-Release number of selected component (if applicable): How reproducible: 100% in scale test Steps to Reproduce: 1. SNO deployment with DU profile at scale (50 clusters or 100 clusters per hour) 2. 3. Actual results: Subscription policy non compliant Expected results: Subscription policy should become compliant Additional info:
1, Create a cluster that contains the fixed PR. mac:~ jianzhang$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-05-05-061544|grep olm W0505 15:42:57.237937 17035 helpers.go:151] Defaulting of registry auth file to "${HOME}/.docker/config.json" is deprecated. The default will be switched to podman config locations in the future version. operator-lifecycle-manager https://github.com/openshift/operator-framework-olm 5d74cef25c663ff581abdb87fa1a94fe7a222144 operator-registry https://github.com/openshift/operator-framework-olm 5d74cef25c663ff581abdb87fa1a94fe7a222144 mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-05-05-061544 True False 2m46s Cluster version is 4.11.0-0.nightly-2022-05-05-061544 2, Create a CatalogSource without the `updateStrategy` mac:~ jianzhang$ cat cs-bug.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: bug-operator namespace: openshift-marketplace spec: sourceType: grpc image: quay.io/olmqe/learn-operator-index:v2 displayName: Bug Operators publisher: OLM QE mac:~ jianzhang$ oc create -f cs-bug.yaml catalogsource.operators.coreos.com/bug-operator created 3, Delete its SA. mac:~ jianzhang$ oc get sa NAME SECRETS AGE bug-operator 2 3m39s builder 2 31m certified-operators 2 36m community-operators 2 36m default 2 40m deployer 2 31m marketplace-operator 2 40m redhat-marketplace 2 36m redhat-operators 2 36m mac:~ jianzhang$ oc delete sa bug-operator serviceaccount "bug-operator" deleted 4, It can be created as expected, LGTM, verify it. mac:~ jianzhang$ oc get sa NAME SECRETS AGE bug-operator 2 7s builder 2 31m certified-operators 2 37m community-operators 2 37m default 2 41m deployer 2 31m marketplace-operator 2 40m redhat-marketplace 2 37m redhat-operators 2 37m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069