Bug 2074612 - OLM failed to recreate SA for the CatalogSource that without poll Interval
Summary: OLM failed to recreate SA for the CatalogSource that without poll Interval
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Alexander Greene
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks: 2080609
TreeView+ depends on / blocked
 
Reported: 2022-04-12 16:17 UTC by jun
Modified: 2022-08-19 01:57 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The CheckRegistryServer function used by grpc catalogSources did not confirm that the serviceAccount associated with the catalogSource exists. Consequence: An unhealthy catalogSource with no serviceAccount could exist. Fix: Update the GRPC CheckRegistryServer function to check if the serviceAccount exists, which will recreate the service if not found. Result: OLM will recreate serviceAccounts owned by GRPC CatalogSources if they do not exist.
Clone Of:
Environment:
Last Closed: 2022-08-10 11:06:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 294 0 None open Bug 2074612: Fix GRPC CheckRegistryServer function (#2756) 2022-04-29 14:20:00 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:06:50 UTC

Description jun 2022-04-12 16:17:35 UTC
Description of problem:
During 1000 SNO ZTP deployment test, about 1% of the clusters get stuck on the subscription policy due to this kind of error from operator subscriptions:


- message: 'error using catalog community-operators (in namespace openshift-marketplace):
      failed to list bundles: rpc error: code = Unavailable desc = connection error:
      desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc
      on [fd02::a]:53: server misbehaving"'
    reason: ErrorPreventedResolution
    status: "True"
    type: ResolutionFailed

Version-Release number of selected component (if applicable):


How reproducible:
100% in scale test

Steps to Reproduce:
1. SNO deployment with DU profile at scale (50 clusters or 100 clusters per hour)
2.
3.

Actual results:
Subscription policy non compliant

Expected results:
Subscription policy should become compliant

Additional info:

Comment 31 Jian Zhang 2022-05-05 09:24:05 UTC
1, Create a cluster that contains the fixed PR.
mac:~ jianzhang$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-05-05-061544|grep olm
W0505 15:42:57.237937   17035 helpers.go:151] Defaulting of registry auth file to "${HOME}/.docker/config.json" is deprecated. The default will be switched to podman config locations in the future version.
  operator-lifecycle-manager                     https://github.com/openshift/operator-framework-olm                         5d74cef25c663ff581abdb87fa1a94fe7a222144
  operator-registry                              https://github.com/openshift/operator-framework-olm                         5d74cef25c663ff581abdb87fa1a94fe7a222144

mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-05-061544   True        False         2m46s   Cluster version is 4.11.0-0.nightly-2022-05-05-061544

2, Create a CatalogSource without the `updateStrategy`
mac:~ jianzhang$ cat cs-bug.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: bug-operator
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/olmqe/learn-operator-index:v2
  displayName: Bug Operators
  publisher: OLM QE
mac:~ jianzhang$ oc create -f cs-bug.yaml 
catalogsource.operators.coreos.com/bug-operator created

3, Delete its SA.
mac:~ jianzhang$ oc get sa
NAME                   SECRETS   AGE
bug-operator           2         3m39s
builder                2         31m
certified-operators    2         36m
community-operators    2         36m
default                2         40m
deployer               2         31m
marketplace-operator   2         40m
redhat-marketplace     2         36m
redhat-operators       2         36m
mac:~ jianzhang$ oc delete sa bug-operator
serviceaccount "bug-operator" deleted

4, It can be created as expected, LGTM, verify it.
mac:~ jianzhang$ oc get sa
NAME                   SECRETS   AGE
bug-operator           2         7s
builder                2         31m
certified-operators    2         37m
community-operators    2         37m
default                2         41m
deployer               2         31m
marketplace-operator   2         40m
redhat-marketplace     2         37m
redhat-operators       2         37m

Comment 33 errata-xmlrpc 2022-08-10 11:06:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.