2050912 – openshift-kube-controller-manager is no longer deleting completed pods spawned by cronjobs

Bug 2050912 - openshift-kube-controller-manager is no longer deleting completed pods spawned by cronjobs

Summary: openshift-kube-controller-manager is no longer deleting completed pods spawne...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Filip Krepinsky
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2057378 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-04 22:46 UTC by Courtney Ruhm
Modified:	2022-07-29 07:37 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 20:04:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6843441	0	None	None	None	2022-03-25 13:35:22 UTC

Comment 5 Maciej Szulik 2022-02-10 12:47:48 UTC

Looks like we would need to pick https://github.com/kubernetes/kubernetes/pull/107470 into 4.9 and k8s 1.22.

Filip can you check if the above is sufficient, or we need something more.

Comment 6 Filip Krepinsky 2022-02-24 15:30:32 UTC

Sorry for a later response. I could not reproduce this. I took similar cronjobs to the ones posted. And let them run in an upgraded environment. I created a bunch of them before the upgrade and also added bunch of them after the upgrade.

I tried these upgrades:

- 4.8.0-0.nightly-2022-02-17-071009 -> 4.9.0-0.nightly-2022-02-16-161609
- 4.8.0-0.nightly-2022-02-23-210700 -> 4.9.18

and latest 4.9

But the number of jobs always correspond to the number of pods and to the successfulJobsHistoryLimit.

I also saw a lot of "error syncing CronJobController default/example-3, requeuing" but these errors don't seem to significantly differ even after applying the patch https://github.com/kubernetes/kubernetes/pull/107470 as suggested.

The cronjob errors are not indicative of the problem (just harmless collisions in the cronjob controller). The problem seems to be with a garbage collector that does not delete the pods, so there is no problem with a cronjob controller (the controller does not delete the pods itself).

2022-02-04T20:02:49.127071771Z I0204 20:02:49.126999 1 garbagecollector.go:213] syncing garbage collector with updated resources from discovery (attempt 3896): added: [/v1, Resource=configmaps /v1,
2022-02-04T20:02:49.127396949Z I0204 20:02:49.127377 1 shared_informer.go:240] Waiting for caches to sync for garbage collector
2022-02-04T20:03:19.131897387Z I0204 20:03:19.128377 1 shared_informer.go:266] stop requested
2022-02-04T20:03:19.131897387Z E0204 20:03:19.128421 1 shared_informer.go:243] unable to sync caches for garbage collector
2022-02-04T20:03:19.131897387Z E0204 20:03:19.128440 1 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 3896)

So we can see GC cannot sync and is trying to do so in an endless loop.

This is usually cause by some APIServices being down or misbehaving.

Can you please increase the LogLevel to "Trace" in a cluster KubeControllerManager and post a logs for more details?

Comment 7 Kai-Uwe Rommel 2022-02-24 17:39:52 UTC

To us it also happens on two 4.9 clusters (one OCP and one OKD).
I noticed it does not seem to happen for ALL cronjobs.
Where it is reproducibly seen is for the cronjobs of
- OpenShift logging, e.g. the elasticsearch-im-app, elasticsearch-im-audit and elasticsearch-im-infra
- collect-profiles in namespace openshift-operator-lifecycle-manager

Comment 8 Filip Krepinsky 2022-03-01 12:16:40 UTC

@kai-uwe.rommel the logs and events should also help to identify the potential problems in cronjob controller as well. I would also suggest to run must-gather so we could see all the working and non working jobs.

Comment 9 Kai-Uwe Rommel 2022-03-01 19:23:20 UTC

Will no when I return to work next week. Also need to rule out other possible causes.
On the mentioned OKD cluster, a left over CRD from Gitlab was the cause - solved.

Comment 10 Maciej Szulik 2022-03-04 15:57:38 UTC

Filip check bug 2057378 which I'm currently marking as duplicate, but it might contain more info useful for debugging.

Comment 11 Maciej Szulik 2022-03-04 15:57:53 UTC

*** Bug 2057378 has been marked as a duplicate of this bug. ***

Comment 15 Kai-Uwe Rommel 2022-03-09 10:06:56 UTC

On the other (OCP) cluster basically the same left over Gitlab Runners CRD caused the problem. Also solved.
My interpretation is that after an upgrade to a newer k8s release there may be older ressources like this Gitlab Runners CRD which are incompatible cause problems in the finalizers of objects.
This leads to hanging deletions which hold up everything.
Like in our two cases, users/admins may be unaware of even the existence of this ressource because the previously installed operator or whatever it was has already been uninstalled but left something behind which was not deleted properly.

Comment 16 Filip Krepinsky 2022-03-10 20:04:48 UTC

Thanks for figuring it out. Indeed we can see the it is cause by orphaned CRD in the following logs messages

> graph_builder.go:279] garbage controller monitor not yet synced: apps.gitlab.com/v1beta2, Resource=runners

this is also the case in the duplicate bug https://bugzilla.redhat.com/show_bug.cgi?id=2057378 

Closing the bug, please reopen if any other assistance is required.

Note You need to log in before you can comment on or make changes to this bug.