Description of problem: As mentioned in the conventions doc https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability, alertmanager should have replica count of 2 with hard affinities set till we bring descheduler into our product. Version-Release number of selected component (if applicable): 4.8 How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Follow-up of bug 1949262. We need to be able to set minReadySeconds on statefulsets before the replica count can be decreased to 2 (see https://github.com/kubernetes/kubernetes/pull/100842).
tested with PR, expected alertmanager pods changed to 2 and pods can not be started # oc -n openshift-monitoring get pdb alertmanager-main NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE alertmanager-main N/A 1 0 41m # oc -n openshift-monitoring get pdb alertmanager-main -oyaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2021-11-25T02:52:56Z" generation: 1 labels: app.kubernetes.io/component: alert-router app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.22.2 name: alertmanager-main namespace: openshift-monitoring resourceVersion: "95240" uid: 94eea939-798d-48a0-8f24-aa89aaa525c2 spec: maxUnavailable: 1 selector: matchLabels: alertmanager: main app.kubernetes.io/component: alert-router app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring status: conditions: - lastTransitionTime: "2021-11-25T03:34:55Z" message: "" observedGeneration: 1 reason: InsufficientPods status: "False" type: DisruptionAllowed currentHealthy: 0 desiredHealthy: 1 disruptionsAllowed: 0 expectedPods: 2 observedGeneration: 1 # while true; do date; oc -n openshift-monitoring get pod | grep alertmanager; sleep 10s; echo -e "\n"; done Wed Nov 24 22:29:54 EST 2021 alertmanager-main-0 0/6 Terminating 0 3s alertmanager-main-1 0/6 ContainerCreating 0 3s Wed Nov 24 22:30:10 EST 2021 alertmanager-main-0 0/6 Terminating 0 1s alertmanager-main-1 0/6 Terminating 0 1s Wed Nov 24 22:30:25 EST 2021 alertmanager-main-0 0/6 Terminating 0 1s alertmanager-main-1 0/6 Terminating 0 1s Wed Nov 24 22:30:41 EST 2021 alertmanager-main-1 6/6 Terminating 0 6s Wed Nov 24 22:30:56 EST 2021 alertmanager-main-1 6/6 Terminating 0 6s Wed Nov 24 22:31:11 EST 2021 alertmanager-main-0 6/6 Terminating 0 4s alertmanager-main-1 6/6 Terminating 0 4s Wed Nov 24 22:31:27 EST 2021 alertmanager-main-0 0/6 Terminating 0 4s alertmanager-main-1 0/6 Terminating 0 4s Wed Nov 24 22:31:42 EST 2021 alertmanager-main-0 0/6 Terminating 0 3s alertmanager-main-1 0/6 Terminating 0 3s Wed Nov 24 22:31:57 EST 2021 alertmanager-main-0 0/6 ContainerCreating 0 0s alertmanager-main-1 0/6 Pending 0 0s Wed Nov 24 22:32:13 EST 2021 alertmanager-main-0 6/6 Terminating 0 6s # oc -n openshift-monitoring get event | grep alertmanager-main ... 13s Warning FailedCreate statefulset/alertmanager-main create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again. 13s Warning FailedCreate statefulset/alertmanager-main create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again. 13s Normal SuccessfulCreate statefulset/alertmanager-main create Pod alertmanager-main-0 in StatefulSet alertmanager-main successful 13s Warning FailedCreate statefulset/alertmanager-main create Pod alertmanager-main-1 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again.
# oc -n openshift-monitoring logs prometheus-operator-84c85586d6-bpf2r ... level=info ts=2021-11-25T02:53:02.3094545Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" level=info ts=2021-11-25T02:53:02.316758637Z caller=operator.go:741 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager" level=info ts=2021-11-25T02:53:02.426700671Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" level=info ts=2021-11-25T02:53:02.432181021Z caller=operator.go:741 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager" level=info ts=2021-11-25T02:53:02.50330463Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" level=info ts=2021-11-25T02:53:02.509158493Z caller=operator.go:741 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager" level=info ts=2021-11-25T02:53:02.553180316Z caller=operator.go:814 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" ... should remove finalizers: - foregroundDeletion from alertmanager-main statefulset # oc -n openshift-monitoring get sts alertmanager-main -oyaml apiVersion: apps/v1 kind: StatefulSet metadata: annotations: prometheus-operator-input-hash: "14523878381744334873" creationTimestamp: "2021-11-25T03:49:58Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2021-11-25T03:49:58Z" finalizers: - foregroundDeletion
Pull request submitted
checked with 4.10.0-0.nightly-2021-12-12-184227, the fix is in it. Alertmanager Statefulsets have 2 replicas and hard affinity set # oc -n openshift-monitoring get pod | grep alertmanager-main alertmanager-main-0 6/6 Running 0 4h13m alertmanager-main-1 6/6 Running 0 4h12m # oc -n openshift-monitoring get sts alertmanager-main -oyaml ... affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: alert-router app.kubernetes.io/instance: main app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring namespaces: - openshift-monitoring topologyKey: kubernetes.io/hostname # oc -n openshift-monitoring get pdb alertmanager-main NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE alertmanager-main N/A 1 1 10h # oc -n openshift-monitoring get pdb alertmanager-main -oyaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2021-12-12T23:30:56Z" generation: 1 labels: app.kubernetes.io/component: alert-router app.kubernetes.io/instance: main app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.22.2 name: alertmanager-main namespace: openshift-monitoring resourceVersion: "149472" uid: 74e9b3dd-a3c8-45fb-8b5a-6b627a0a3acd spec: maxUnavailable: 1 selector: matchLabels: app.kubernetes.io/component: alert-router app.kubernetes.io/instance: main app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring status: conditions: - lastTransitionTime: "2021-12-13T06:01:03Z" message: "" observedGeneration: 1 reason: SufficientPods status: "True" type: DisruptionAllowed currentHealthy: 2 desiredHealthy: 1 disruptionsAllowed: 1 expectedPods: 2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056