2108488 – MCD pods fail because of rpm-ostreed "start-limit-hit" and MCP are degraded

Bug 2108488 - MCD pods fail because of rpm-ostreed "start-limit-hit" and MCP are degraded

Summary: MCD pods fail because of rpm-ostreed "start-limit-hit" and MCP are degraded

Keywords:
Status:	CLOSED DUPLICATE of bug 2108320
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	MCO Bug Bot
QA Contact:	Sergio
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-19 08:12 UTC by Sergio
Modified:	2022-07-19 13:14 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-19 13:14:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sergio 2022-07-19 08:12:54 UTC

Description of problem:
When a MCD pod is restarted the "start-limit-hit" is reached in rpm-ostreed service, the service fails and the MachineConfigPool for this node is degraded.

Version-Release number of MCO (Machine Config Operator) (if applicable):

4.11.0-0.nightly-2022-07-16-020951


Platform (AWS, VSphere, Metal, etc.):

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure):
Yes

How reproducible:

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. Delete a machine config daemon pod (worker pool)

Actual results:
The worker pool is degraded, and we can see this log in the new machine config daemon pod

I0718 15:55:37.137472    1755 start.go:112] Version: v4.11.0-202207132217.p0.ge722bb7.assembly.stream-dirty (e722bb791349c5f74548633d5c3407c414ece9bf)
I0718 15:55:37.254275    1755 start.go:125] Calling chroot("/rootfs")
I0718 15:55:37.281700    1755 update.go:1976] Running: systemctl start rpm-ostreed
I0718 15:55:38.115837    1755 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0718 15:55:38.186795    1755 rpm-ostree.go:324] Running captured: rpm-ostree status --json
I0718 15:55:38.223195    1755 daemon.go:236] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08ca66d84e70c73b09125eedcdf8183e37454839b3cbf8b7489829b29455a5e8 (411.86.202207150124-0)
I0718 15:55:38.305993    1755 start.go:101] Copied self to /run/bin/machine-config-daemon on host
I0718 15:55:38.307038    1755 start.go:189] overriding kubernetes api to https://api-int.sregidor4-rc2.qe.devcluster.openshift.com:6443
I0718 15:55:38.307648    1755 metrics.go:106] Registering Prometheus metrics
I0718 15:55:38.307740    1755 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I0718 15:55:38.308999    1755 writer.go:93] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig
I0718 15:55:38.309141    1755 update.go:1991] Starting to manage node: ip-10-0-52-69.us-east-2.compute.internal
I0718 15:55:38.315672    1755 rpm-ostree.go:324] Running captured: rpm-ostree status
I0718 15:55:38.355488    1755 daemon.go:1220] State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:08ca66d84e70c73b09125eedcdf8183e37454839b3cbf8b7489829b29455a5e8
              CustomOrigin: Managed by machine-config-operator
                   Version: 411.86.202207150124-0 (2022-07-15T01:27:45Z)
I0718 15:55:38.356987    1755 coreos.go:95] CoreOS aleph version: mtime=2022-06-30 15:10:12.034 +0000 UTC build=411.86.202206301504-0 imgid=rhcos-411.86.202206301504-0-qemu.x86_64.qcow2
I0718 15:55:38.357224    1755 coreos.go:112] Ignition provisioning: time=2022-07-18T07:43:16Z
I0718 15:55:38.357243    1755 rpm-ostree.go:324] Running captured: journalctl --list-boots
I0718 15:55:38.362993    1755 daemon.go:1229] journalctl --list-boots:
-15 176434f3d44c439d8328e6681aa6f21a Mon 2022-07-18 07:41:46 UTC—Mon 2022-07-18 07:44:20 UTC
-14 64a8d5f5857d4eb183af8189ed65472b Mon 2022-07-18 07:44:33 UTC—Mon 2022-07-18 11:48:28 UTC
-13 83359db7e8cb46debbc629e0bb4c5ccc Mon 2022-07-18 11:48:40 UTC—Mon 2022-07-18 11:51:01 UTC
-12 ffe41b05f1f04ad39e66e459da025822 Mon 2022-07-18 11:51:12 UTC—Mon 2022-07-18 11:53:21 UTC
-11 168cf7fe963a41fb98f8b76c5c24ff13 Mon 2022-07-18 11:53:34 UTC—Mon 2022-07-18 12:13:58 UTC
-10 c53138eb943c4659b0e90e33ece52990 Mon 2022-07-18 12:14:09 UTC—Mon 2022-07-18 12:16:18 UTC
 -9 fa1ffb82bc3540b5b4a37598584e1279 Mon 2022-07-18 12:16:30 UTC—Mon 2022-07-18 12:41:22 UTC
 -8 88e513c8e8d64478b7bf0773d9f7b7b5 Mon 2022-07-18 12:41:35 UTC—Mon 2022-07-18 12:43:57 UTC
 -7 a5a700ac34044d03a08a9a0d9b709ea9 Mon 2022-07-18 12:44:09 UTC—Mon 2022-07-18 12:46:16 UTC
 -6 37c91c065bae4f75add01ad93e9cd190 Mon 2022-07-18 12:46:29 UTC—Mon 2022-07-18 13:02:15 UTC
 -5 baa108f90c8e4b4ca38cefb97d582a54 Mon 2022-07-18 13:02:27 UTC—Mon 2022-07-18 13:05:37 UTC
 -4 71655a9131c247bbbcd1dd4eb55cf6ee Mon 2022-07-18 13:05:50 UTC—Mon 2022-07-18 13:18:15 UTC
 -3 d4a9af45bef64fa6b35f98e591ba31dd Mon 2022-07-18 13:18:27 UTC—Mon 2022-07-18 13:27:36 UTC
 -2 ee3d4c69c8ab4a41a6ed1c50f3984152 Mon 2022-07-18 13:27:48 UTC—Mon 2022-07-18 15:44:03 UTC
 -1 62bfa5b964da4e47856cb92fe892b9ad Mon 2022-07-18 15:44:14 UTC—Mon 2022-07-18 15:55:11 UTC
  0 74ccfd95a7ae4aafb2b57283563d2ef8 Mon 2022-07-18 15:55:23 UTC—Mon 2022-07-18 15:55:38 UTC
I0718 15:55:38.363013    1755 rpm-ostree.go:324] Running captured: systemctl list-units --state=failed --no-legend
I0718 15:55:38.367886    1755 daemon.go:1244] systemd service state: OK
I0718 15:55:38.367904    1755 daemon.go:909] Starting MachineConfigDaemon
I0718 15:55:38.367987    1755 daemon.go:916] Enabling Kubelet Healthz Monitor
I0718 15:55:39.323661    1755 daemon.go:451] Node ip-10-0-52-69.us-east-2.compute.internal is not labeled node-role.kubernetes.io/master
I0718 15:55:39.392578    1755 daemon.go:1172] Current config: rendered-worker-19a475cef3bfceddf28fa8b585251d4b
I0718 15:55:39.392599    1755 daemon.go:1173] Desired config: rendered-worker-cfc6f4620f05181a7b1662ad9984feb7
I0718 15:55:39.392604    1755 daemon.go:1175] state: Working
I0718 15:55:39.395877    1755 update.go:1976] Running: rpm-ostree cleanup -r
Deployments unchanged.
I0718 15:55:39.766381    1755 update.go:1991] Disk currentConfig rendered-worker-cfc6f4620f05181a7b1662ad9984feb7 overrides node's currentConfig annotation rendered-worker-19a475cef3bfceddf28fa8b585251d4b
I0718 15:55:39.768597    1755 daemon.go:1467] Validating against pending config rendered-worker-cfc6f4620f05181a7b1662ad9984feb7
I0718 15:55:39.768621    1755 rpm-ostree.go:324] Running captured: rpm-ostree kargs
E0718 15:55:39.846555    1755 writer.go:200] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-cfc6f4620f05181a7b1662ad9984feb7: error running rpm-ostree kargs: exit status 1
Job for rpm-ostreed.service failed.
See "systemctl status rpm-ostreed.service" and "journalctl -xe" for details.
error: exit status: 1
I0718 15:55:41.858639    1755 daemon.go:500] Transitioned from state: Working -> Degraded
I0718 15:55:41.864888    1755 daemon.go:1172] Current config: rendered-worker-19a475cef3bfceddf28fa8b585251d4b
I0718 15:55:41.864903    1755 daemon.go:1173] Desired config: rendered-worker-cfc6f4620f05181a7b1662ad9984feb7
I0718 15:55:41.864906    1755 daemon.go:1175] state: Degraded
I0718 15:55:41.867789    1755 update.go:1976] Running: rpm-ostree cleanup -r
● rpm-ostreed.service - rpm-ostree System Management Daemon
   Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit-hit) since Mon 2022-07-18 15:55:39 UTC; 2s ago
     Docs: man:rpm-ostree(1)
 Main PID: 2650
   Status: "clients=0; idle exit in 61 seconds"
    Tasks: 9 (limit: 100204)
   Memory: 16.8M
      CPU: 104ms
   CGroup: /system.slice/rpm-ostreed.service
           └─2650 /usr/bin/rpm-ostree start-daemon

Jul 18 15:55:39 ip-10-0-52-69 rpm-ostree[2650]: Process [pid: 3290 uid: 0 unit: crio-73b3e09d07ad8aec2161ff8ea5b56757a265ded0a493b6d0f926a6d1223820b1.scope] disconnected from transaction progress
Jul 18 15:55:39 ip-10-0-52-69 rpm-ostree[2650]: client(id:machine-config-operator dbus:1.74 unit:crio-73b3e09d07ad8aec2161ff8ea5b56757a265ded0a493b6d0f926a6d1223820b1.scope uid:0) vanished; remaining=0
Jul 18 15:55:39 ip-10-0-52-69 rpm-ostree[2650]: In idle state; will auto-exit in 61 seconds
Jul 18 15:55:39 ip-10-0-52-69 systemd[1]: rpm-ostreed.service: Start request repeated too quickly.
Jul 18 15:55:39 ip-10-0-52-69 systemd[1]: rpm-ostreed.service: Failed with result 'start-limit-hit'.
Jul 18 15:55:39 ip-10-0-52-69 systemd[1]: Failed to start rpm-ostree System Management Daemon.
Jul 18 15:55:39 ip-10-0-52-69 systemd[1]: rpm-ostreed.service: Consumed 104ms CPU time
Jul 18 15:55:41 ip-10-0-52-69 systemd[1]: rpm-ostreed.service: Start request repeated too quickly.
Jul 18 15:55:41 ip-10-0-52-69 systemd[1]: rpm-ostreed.service: Failed with result 'start-limit-hit'.
Jul 18 15:55:41 ip-10-0-52-69 systemd[1]: Failed to start rpm-ostree System Management Daemon.
E0718 15:55:41.892942    1755 writer.go:200] Marking Degraded due to: Failed to remove rollback: error running rpm-ostree cleanup -r: Job for rpm-ostreed.service failed.
See "systemctl status rpm-ostreed.service" and "journalctl -xe" for details.
error: exit status: 1
: exit status 1



Expected results:
MCDs should not fail and the MachineConfigPools should not be degraded when a machine config daemon is deleted.


Additional info:

Eventually, after a few minutes, the 'start-limit-hit' issue stops happening and the MCP stops being degraded.

Every configuration triggering a node reboot will make the MCP become degraded since it restarts the MCD pod and, hence, triggers the 'start-limit-hit'.

Comment 1 Sinny Kumari 2022-07-19 12:49:36 UTC

It is probable that a cluster can run into this and will lead to degraded node. Regression/started seeing this issue with https://github.com/openshift/machine-config-operator/pull/3243

Comment 2 Sinny Kumari 2022-07-19 13:14:55 UTC


*** This bug has been marked as a duplicate of bug 2108320 ***

Note You need to log in before you can comment on or make changes to this bug.