Bug 2016435

Summary: Duplicate AlertmanagerClusterFailedToSendAlerts alerts
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Prashant Balachandran <pnair>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: medium    
Version: 4.9CC: amuller, anpicker, aos-bugs, erooth, kgordeev, nchoudhu, pnair, wking
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2062091 (view as bug list) Environment:
Last Closed: 2022-03-10 16:21:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2062091    

Description Simon Pasquier 2021-10-21 14:05:05 UTC
CMO ships two AlertmanagerClusterFailedToSendAlerts alerts [1]. The alerts come from upstream [2] and differ only by the empty-ness of the {{integration}} label, with an empty label for "non-critical integrations" and a non-empty label for "critical integrations". 

When we reworked alert severity in 4.9, it was agreed that AlertmanagerClusterFailedToSendAlerts shouldn't be critical which is why both alerts ended up with the "warning" severity. To avoid confusion, CMO should ship only 1 alert (the expression with the "integration=~`.*`" selector)?


[1] https://github.com/openshift/cluster-monitoring-operator/blob/79cdf6865f159afbd89b7be60a8bbcf5f24cb938/assets/alertmanager/prometheus-rule.yaml#L61-L94
[2] https://github.com/prometheus/alertmanager/blob/1b8afe7cb5aafe59442e35979ec57401145ea26b/doc/alertmanager-mixin/alerts.libsonnet#L61-L97

Comment 3 Junqi Zhao 2021-11-18 02:29:40 UTC
checked with 4.10.0-0.nightly-2021-11-17-100252, only one AlertmanagerClusterFailedToSendAlerts now
# oc -n openshift-monitoring get prometheusrules alertmanager-main-rules -oyaml
...
    - alert: AlertmanagerClusterFailedToSendAlerts
      annotations:
        description: The minimum notification failure rate to {{ $labels.integration
          }} sent from any instance in the {{$labels.job}} cluster is {{ $value |
          humanizePercentage }}.
        summary: All Alertmanager instances in a cluster failed to send notifications
          to a critical integration.
      expr: |
        min by (namespace,service, integration) (
          rate(alertmanager_notifications_failed_total{job="alertmanager-main",namespace="openshift-monitoring", integration=~`.*`}[5m])
        /
          rate(alertmanager_notifications_total{job="alertmanager-main",namespace="openshift-monitoring", integration=~`.*`}[5m])
        )
        > 0.01
      for: 5m
      labels:
        severity: warning

Comment 8 errata-xmlrpc 2022-03-10 16:21:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056