+++ This bug was initially created as a clone of Bug #2108320 +++ See https://github.com/openshift/os/pull/898 A recent PR in the MCO openshift/machine-config-operator#3243 tipped things over the edge and we now see failures a lot more often. For example, in https://bugzilla.redhat.com/show_bug.cgi?id=2104978 --- Additional comment from skumari on 2022-07-19 13:14:55 UTC --- *** Bug 2108488 has been marked as a duplicate of this bug. ***
Verified on RHCOS 411.86.202207210724-0. This however has not made it into a nightly OCP build yet. [core@cosa-devsh ~]$ cat test.sh #!/bin/bash set -euo pipefail # https://github.com/coreos/rpm-ostree/pull/3523/commits/0556152adb14a8e1cdf6c5d6f234aacbe8dd4e3f for x in $(seq 100); do rpm-ostree status >/dev/null; done echo ok [core@cosa-devsh ~]$ ./test.sh ok [core@cosa-devsh ~]$ rpm-ostree status State: idle Deployments: ● 1994ffeef78d96e6af89e03552214df06465d75e3b4f8a4eb37aa6582814c00e Version: 411.86.202207210724-0 (2022-07-21T07:27:48Z) [core@cosa-devsh ~]$ systemctl status rpm-ostreed ● rpm-ostreed.service - rpm-ostree System Management Daemon Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static; vendor > Drop-In: /usr/lib/systemd/system/rpm-ostreed.service.d └─startlimit.conf Active: active (running) since Thu 2022-07-21 13:45:16 UTC; 52s ago Docs: man:rpm-ostree(1) Main PID: 2059 (rpm-ostree) Status: "clients=0; idle exit in 52 seconds" Tasks: 12 (limit: 5559) Memory: 8.1M CGroup: /system.slice/rpm-ostreed.service └─2059 /usr/bin/rpm-ostree start-daemon Jul 21 13:45:42 cosa-devsh rpm-ostree[2059]: client(id:cli dbus:1.259 unit:sess> Jul 21 13:45:42 cosa-devsh rpm-ostree[2059]: In idle state; will auto-exit in 6> Jul 21 13:45:42 cosa-devsh rpm-ostree[2059]: Allowing active client :1.261 (uid> Jul 21 13:45:42 cosa-devsh rpm-ostree[2059]: client(id:cli dbus:1.261 unit:sess> Jul 21 13:45:42 cosa-devsh rpm-ostree[2059]: client(id:cli dbus:1.261 unit:sess> Jul 21 13:45:42 cosa-devsh rpm-ostree[2059]: In idle state; will auto-exit in 6> Jul 21 13:45:53 cosa-devsh rpm-ostree[2059]: Allowing active client :1.263 (uid> Jul 21 13:45:53 cosa-devsh rpm-ostree[2059]: client(id:cli dbus:1.263 unit:sess> Jul 21 13:45:53 cosa-devsh rpm-ostree[2059]: client(id:cli dbus:1.263 unit:sess> Jul 21 13:45:53 cosa-devsh rpm-ostree[2059]: In idle state; will auto-exit in 6> [core@cosa-devsh ~]$ systemctl cat rpm-ostreed # /usr/lib/systemd/system/rpm-ostreed.service [Unit] Description=rpm-ostree System Management Daemon Documentation=man:rpm-ostree(1) ConditionPathExists=/ostree RequiresMountsFor=/boot [Service] Type=dbus BusName=org.projectatomic.rpmostree1 # To use the read-only sysroot bits MountFlags=slave # We have no business accessing /var/roothome or /var/home. In general # the ostree design clearly avoids touching those, but since systemd offers # us easy tools to toggle on protection, let's use them. In the future # it'd be nice to do something like using DynamicUser=yes for the main service, # and have a system rpm-ostreed-transaction.service that runs privileged # but as a subprocess. ProtectHome=true # Explicitly list paths here which we should never access. The initial # entry here ensures that the skopeo process we fork won't interact with # application containers. InaccessiblePaths=/var/lib/containers NotifyAccess=main ExecStart=/usr/bin/rpm-ostree start-daemon ExecReload=/usr/bin/rpm-ostree reload # /usr/lib/systemd/system/rpm-ostreed.service.d/startlimit.conf [Unit] # Work around for lack of https://github.com/coreos/rpm-ostree/pull/3523/commit> # on older RHEL StartLimitBurst=1000
I don't think we technically need to update the boot images for this - just machine-os-content. The firstboot may be a bit less reliable, but we landed code to do retries in the MCO code.
From the summary in https://bugzilla.redhat.com/show_bug.cgi?id=2104978: "So it looks like rpm-ostreed didn't start yet (but eventually was successful)" Since it was eventually successful, I agree with Colin that machine-os-content should be enough since it self resolves.
Verified on 4.11.0-rc.5 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-rc.5 True False 107s Cluster version is 4.11.0-rc.5 $ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-mlj7g4t-72292-7jqbd-master-0 Ready master 22m v1.24.0+9546431 ci-ln-mlj7g4t-72292-7jqbd-master-1 Ready master 22m v1.24.0+9546431 ci-ln-mlj7g4t-72292-7jqbd-master-2 Ready master 22m v1.24.0+9546431 ci-ln-mlj7g4t-72292-7jqbd-worker-a-plz2t Ready worker 12m v1.24.0+9546431 ci-ln-mlj7g4t-72292-7jqbd-worker-b-4zdxv Ready worker 12m v1.24.0+9546431 $ oc debug node/ci-ln-mlj7g4t-72292-7jqbd-worker-a-plz2t Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") Starting pod/ci-ln-mlj7g4t-72292-7jqbd-worker-a-plz2t-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.2 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c978380274ed551b3d6a8ca53ab2fc1408bfad00b8c235cc7dbe523dbc251d8 CustomOrigin: Managed by machine-config-operator Version: 411.86.202207210724-0 (2022-07-21T07:27:48Z) sh-4.4# cat test.sh #!/bin/bash set -euo pipefail # https://github.com/coreos/rpm-ostree/pull/3523/commits/0556152adb14a8e1cdf6c5d6f234aacbe8dd4e3f for x in $(seq 100); do rpm-ostree status >/dev/null; done echo ok sh-4.4# chmod +x test.sh sh-4.4# ./test.sh ok sh-4.4# systemctl cat rpm-ostreed.service # /usr/lib/systemd/system/rpm-ostreed.service [Unit] Description=rpm-ostree System Management Daemon Documentation=man:rpm-ostree(1) ConditionPathExists=/ostree RequiresMountsFor=/boot [Service] Type=dbus BusName=org.projectatomic.rpmostree1 # To use the read-only sysroot bits MountFlags=slave # We have no business accessing /var/roothome or /var/home. In general # the ostree design clearly avoids touching those, but since systemd offers # us easy tools to toggle on protection, let's use them. In the future # it'd be nice to do something like using DynamicUser=yes for the main service, # and have a system rpm-ostreed-transaction.service that runs privileged # but as a subprocess. ProtectHome=true # Explicitly list paths here which we should never access. The initial # entry here ensures that the skopeo process we fork won't interact with # application containers. InaccessiblePaths=/var/lib/containers NotifyAccess=main ExecStart=/usr/bin/rpm-ostree start-daemon ExecReload=/usr/bin/rpm-ostree reload # /usr/lib/systemd/system/rpm-ostreed.service.d/startlimit.conf [Unit] # Work around for lack of https://github.com/coreos/rpm-ostree/pull/3523/commits/0556152adb14a8e1cdf6c5d6f234aacbe8dd4e3f # on older RHEL StartLimitBurst=1000 sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069