Looks like we would need to pick https://github.com/kubernetes/kubernetes/pull/107470 into 4.9 and k8s 1.22. Filip can you check if the above is sufficient, or we need something more.
Sorry for a later response. I could not reproduce this. I took similar cronjobs to the ones posted. And let them run in an upgraded environment. I created a bunch of them before the upgrade and also added bunch of them after the upgrade. I tried these upgrades: - 4.8.0-0.nightly-2022-02-17-071009 -> 4.9.0-0.nightly-2022-02-16-161609 - 4.8.0-0.nightly-2022-02-23-210700 -> 4.9.18 and latest 4.9 But the number of jobs always correspond to the number of pods and to the successfulJobsHistoryLimit. I also saw a lot of "error syncing CronJobController default/example-3, requeuing" but these errors don't seem to significantly differ even after applying the patch https://github.com/kubernetes/kubernetes/pull/107470 as suggested. The cronjob errors are not indicative of the problem (just harmless collisions in the cronjob controller). The problem seems to be with a garbage collector that does not delete the pods, so there is no problem with a cronjob controller (the controller does not delete the pods itself). 2022-02-04T20:02:49.127071771Z I0204 20:02:49.126999 1 garbagecollector.go:213] syncing garbage collector with updated resources from discovery (attempt 3896): added: [/v1, Resource=configmaps /v1, 2022-02-04T20:02:49.127396949Z I0204 20:02:49.127377 1 shared_informer.go:240] Waiting for caches to sync for garbage collector 2022-02-04T20:03:19.131897387Z I0204 20:03:19.128377 1 shared_informer.go:266] stop requested 2022-02-04T20:03:19.131897387Z E0204 20:03:19.128421 1 shared_informer.go:243] unable to sync caches for garbage collector 2022-02-04T20:03:19.131897387Z E0204 20:03:19.128440 1 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 3896) So we can see GC cannot sync and is trying to do so in an endless loop. This is usually cause by some APIServices being down or misbehaving. Can you please increase the LogLevel to "Trace" in a cluster KubeControllerManager and post a logs for more details?
To us it also happens on two 4.9 clusters (one OCP and one OKD). I noticed it does not seem to happen for ALL cronjobs. Where it is reproducibly seen is for the cronjobs of - OpenShift logging, e.g. the elasticsearch-im-app, elasticsearch-im-audit and elasticsearch-im-infra - collect-profiles in namespace openshift-operator-lifecycle-manager
@kai-uwe.rommel the logs and events should also help to identify the potential problems in cronjob controller as well. I would also suggest to run must-gather so we could see all the working and non working jobs.
Will no when I return to work next week. Also need to rule out other possible causes. On the mentioned OKD cluster, a left over CRD from Gitlab was the cause - solved.
Filip check bug 2057378 which I'm currently marking as duplicate, but it might contain more info useful for debugging.
*** Bug 2057378 has been marked as a duplicate of this bug. ***
On the other (OCP) cluster basically the same left over Gitlab Runners CRD caused the problem. Also solved. My interpretation is that after an upgrade to a newer k8s release there may be older ressources like this Gitlab Runners CRD which are incompatible cause problems in the finalizers of objects. This leads to hanging deletions which hold up everything. Like in our two cases, users/admins may be unaware of even the existence of this ressource because the previously installed operator or whatever it was has already been uninstalled but left something behind which was not deleted properly.
Thanks for figuring it out. Indeed we can see the it is cause by orphaned CRD in the following logs messages > graph_builder.go:279] garbage controller monitor not yet synced: apps.gitlab.com/v1beta2, Resource=runners this is also the case in the duplicate bug https://bugzilla.redhat.com/show_bug.cgi?id=2057378 Closing the bug, please reopen if any other assistance is required.