Description of problem: When a MCD pod is restarted the "start-limit-hit" is reached in rpm-ostreed service, the service fails and the MachineConfigPool for this node is degraded. Version-Release number of MCO (Machine Config Operator) (if applicable): 4.11.0-0.nightly-2022-07-16-020951 Platform (AWS, VSphere, Metal, etc.): Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Yes How reproducible: Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: 2. Profile: Steps to Reproduce: 1. Delete a machine config daemon pod (worker pool) Actual results: The worker pool is degraded, and we can see this log in the new machine config daemon pod I0718 15:55:37.137472 1755 start.go:112] Version: v4.11.0-202207132217.p0.ge722bb7.assembly.stream-dirty (e722bb791349c5f74548633d5c3407c414ece9bf) I0718 15:55:37.254275 1755 start.go:125] Calling chroot("/rootfs") I0718 15:55:37.281700 1755 update.go:1976] Running: systemctl start rpm-ostreed I0718 15:55:38.115837 1755 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0718 15:55:38.186795 1755 rpm-ostree.go:324] Running captured: rpm-ostree status --json I0718 15:55:38.223195 1755 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08ca66d84e70c73b09125eedcdf8183e37454839b3cbf8b7489829b29455a5e8 (411.86.202207150124-0) I0718 15:55:38.305993 1755 start.go:101] Copied self to /run/bin/machine-config-daemon on host I0718 15:55:38.307038 1755 start.go:189] overriding kubernetes api to https://api-int.sregidor4-rc2.qe.devcluster.openshift.com:6443 I0718 15:55:38.307648 1755 metrics.go:106] Registering Prometheus metrics I0718 15:55:38.307740 1755 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 I0718 15:55:38.308999 1755 writer.go:93] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig I0718 15:55:38.309141 1755 update.go:1991] Starting to manage node: ip-10-0-52-69.us-east-2.compute.internal I0718 15:55:38.315672 1755 rpm-ostree.go:324] Running captured: rpm-ostree status I0718 15:55:38.355488 1755 daemon.go:1220] State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08ca66d84e70c73b09125eedcdf8183e37454839b3cbf8b7489829b29455a5e8 CustomOrigin: Managed by machine-config-operator Version: 411.86.202207150124-0 (2022-07-15T01:27:45Z) I0718 15:55:38.356987 1755 coreos.go:95] CoreOS aleph version: mtime=2022-06-30 15:10:12.034 +0000 UTC build=411.86.202206301504-0 imgid=rhcos-411.86.202206301504-0-qemu.x86_64.qcow2 I0718 15:55:38.357224 1755 coreos.go:112] Ignition provisioning: time=2022-07-18T07:43:16Z I0718 15:55:38.357243 1755 rpm-ostree.go:324] Running captured: journalctl --list-boots I0718 15:55:38.362993 1755 daemon.go:1229] journalctl --list-boots: -15 176434f3d44c439d8328e6681aa6f21a Mon 2022-07-18 07:41:46 UTC—Mon 2022-07-18 07:44:20 UTC -14 64a8d5f5857d4eb183af8189ed65472b Mon 2022-07-18 07:44:33 UTC—Mon 2022-07-18 11:48:28 UTC -13 83359db7e8cb46debbc629e0bb4c5ccc Mon 2022-07-18 11:48:40 UTC—Mon 2022-07-18 11:51:01 UTC -12 ffe41b05f1f04ad39e66e459da025822 Mon 2022-07-18 11:51:12 UTC—Mon 2022-07-18 11:53:21 UTC -11 168cf7fe963a41fb98f8b76c5c24ff13 Mon 2022-07-18 11:53:34 UTC—Mon 2022-07-18 12:13:58 UTC -10 c53138eb943c4659b0e90e33ece52990 Mon 2022-07-18 12:14:09 UTC—Mon 2022-07-18 12:16:18 UTC -9 fa1ffb82bc3540b5b4a37598584e1279 Mon 2022-07-18 12:16:30 UTC—Mon 2022-07-18 12:41:22 UTC -8 88e513c8e8d64478b7bf0773d9f7b7b5 Mon 2022-07-18 12:41:35 UTC—Mon 2022-07-18 12:43:57 UTC -7 a5a700ac34044d03a08a9a0d9b709ea9 Mon 2022-07-18 12:44:09 UTC—Mon 2022-07-18 12:46:16 UTC -6 37c91c065bae4f75add01ad93e9cd190 Mon 2022-07-18 12:46:29 UTC—Mon 2022-07-18 13:02:15 UTC -5 baa108f90c8e4b4ca38cefb97d582a54 Mon 2022-07-18 13:02:27 UTC—Mon 2022-07-18 13:05:37 UTC -4 71655a9131c247bbbcd1dd4eb55cf6ee Mon 2022-07-18 13:05:50 UTC—Mon 2022-07-18 13:18:15 UTC -3 d4a9af45bef64fa6b35f98e591ba31dd Mon 2022-07-18 13:18:27 UTC—Mon 2022-07-18 13:27:36 UTC -2 ee3d4c69c8ab4a41a6ed1c50f3984152 Mon 2022-07-18 13:27:48 UTC—Mon 2022-07-18 15:44:03 UTC -1 62bfa5b964da4e47856cb92fe892b9ad Mon 2022-07-18 15:44:14 UTC—Mon 2022-07-18 15:55:11 UTC 0 74ccfd95a7ae4aafb2b57283563d2ef8 Mon 2022-07-18 15:55:23 UTC—Mon 2022-07-18 15:55:38 UTC I0718 15:55:38.363013 1755 rpm-ostree.go:324] Running captured: systemctl list-units --state=failed --no-legend I0718 15:55:38.367886 1755 daemon.go:1244] systemd service state: OK I0718 15:55:38.367904 1755 daemon.go:909] Starting MachineConfigDaemon I0718 15:55:38.367987 1755 daemon.go:916] Enabling Kubelet Healthz Monitor I0718 15:55:39.323661 1755 daemon.go:451] Node ip-10-0-52-69.us-east-2.compute.internal is not labeled node-role.kubernetes.io/master I0718 15:55:39.392578 1755 daemon.go:1172] Current config: rendered-worker-19a475cef3bfceddf28fa8b585251d4b I0718 15:55:39.392599 1755 daemon.go:1173] Desired config: rendered-worker-cfc6f4620f05181a7b1662ad9984feb7 I0718 15:55:39.392604 1755 daemon.go:1175] state: Working I0718 15:55:39.395877 1755 update.go:1976] Running: rpm-ostree cleanup -r Deployments unchanged. I0718 15:55:39.766381 1755 update.go:1991] Disk currentConfig rendered-worker-cfc6f4620f05181a7b1662ad9984feb7 overrides node's currentConfig annotation rendered-worker-19a475cef3bfceddf28fa8b585251d4b I0718 15:55:39.768597 1755 daemon.go:1467] Validating against pending config rendered-worker-cfc6f4620f05181a7b1662ad9984feb7 I0718 15:55:39.768621 1755 rpm-ostree.go:324] Running captured: rpm-ostree kargs E0718 15:55:39.846555 1755 writer.go:200] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-cfc6f4620f05181a7b1662ad9984feb7: error running rpm-ostree kargs: exit status 1 Job for rpm-ostreed.service failed. See "systemctl status rpm-ostreed.service" and "journalctl -xe" for details. error: exit status: 1 I0718 15:55:41.858639 1755 daemon.go:500] Transitioned from state: Working -> Degraded I0718 15:55:41.864888 1755 daemon.go:1172] Current config: rendered-worker-19a475cef3bfceddf28fa8b585251d4b I0718 15:55:41.864903 1755 daemon.go:1173] Desired config: rendered-worker-cfc6f4620f05181a7b1662ad9984feb7 I0718 15:55:41.864906 1755 daemon.go:1175] state: Degraded I0718 15:55:41.867789 1755 update.go:1976] Running: rpm-ostree cleanup -r ● rpm-ostreed.service - rpm-ostree System Management Daemon Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static; vendor preset: disabled) Active: failed (Result: start-limit-hit) since Mon 2022-07-18 15:55:39 UTC; 2s ago Docs: man:rpm-ostree(1) Main PID: 2650 Status: "clients=0; idle exit in 61 seconds" Tasks: 9 (limit: 100204) Memory: 16.8M CPU: 104ms CGroup: /system.slice/rpm-ostreed.service └─2650 /usr/bin/rpm-ostree start-daemon Jul 18 15:55:39 ip-10-0-52-69 rpm-ostree[2650]: Process [pid: 3290 uid: 0 unit: crio-73b3e09d07ad8aec2161ff8ea5b56757a265ded0a493b6d0f926a6d1223820b1.scope] disconnected from transaction progress Jul 18 15:55:39 ip-10-0-52-69 rpm-ostree[2650]: client(id:machine-config-operator dbus:1.74 unit:crio-73b3e09d07ad8aec2161ff8ea5b56757a265ded0a493b6d0f926a6d1223820b1.scope uid:0) vanished; remaining=0 Jul 18 15:55:39 ip-10-0-52-69 rpm-ostree[2650]: In idle state; will auto-exit in 61 seconds Jul 18 15:55:39 ip-10-0-52-69 systemd[1]: rpm-ostreed.service: Start request repeated too quickly. Jul 18 15:55:39 ip-10-0-52-69 systemd[1]: rpm-ostreed.service: Failed with result 'start-limit-hit'. Jul 18 15:55:39 ip-10-0-52-69 systemd[1]: Failed to start rpm-ostree System Management Daemon. Jul 18 15:55:39 ip-10-0-52-69 systemd[1]: rpm-ostreed.service: Consumed 104ms CPU time Jul 18 15:55:41 ip-10-0-52-69 systemd[1]: rpm-ostreed.service: Start request repeated too quickly. Jul 18 15:55:41 ip-10-0-52-69 systemd[1]: rpm-ostreed.service: Failed with result 'start-limit-hit'. Jul 18 15:55:41 ip-10-0-52-69 systemd[1]: Failed to start rpm-ostree System Management Daemon. E0718 15:55:41.892942 1755 writer.go:200] Marking Degraded due to: Failed to remove rollback: error running rpm-ostree cleanup -r: Job for rpm-ostreed.service failed. See "systemctl status rpm-ostreed.service" and "journalctl -xe" for details. error: exit status: 1 : exit status 1 Expected results: MCDs should not fail and the MachineConfigPools should not be degraded when a machine config daemon is deleted. Additional info: Eventually, after a few minutes, the 'start-limit-hit' issue stops happening and the MCP stops being degraded. Every configuration triggering a node reboot will make the MCP become degraded since it restarts the MCD pod and, hence, triggers the 'start-limit-hit'.
It is probable that a cluster can run into this and will lead to degraded node. Regression/started seeing this issue with https://github.com/openshift/machine-config-operator/pull/3243
*** This bug has been marked as a duplicate of bug 2108320 ***