Bug 2016435 - Duplicate AlertmanagerClusterFailedToSendAlerts alerts
Summary: Duplicate AlertmanagerClusterFailedToSendAlerts alerts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.10.0
Assignee: Prashant Balachandran
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 2062091
TreeView+ depends on / blocked
 
Reported: 2021-10-21 14:05 UTC by Simon Pasquier
Modified: 2022-03-23 09:56 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2062091 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:21:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1481 0 None Merged Bug 2016435: Removing one of the AlertmanagerClusterFailedToSendAlerts alerts 2022-03-09 07:36:17 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:21:54 UTC

Description Simon Pasquier 2021-10-21 14:05:05 UTC
CMO ships two AlertmanagerClusterFailedToSendAlerts alerts [1]. The alerts come from upstream [2] and differ only by the empty-ness of the {{integration}} label, with an empty label for "non-critical integrations" and a non-empty label for "critical integrations". 

When we reworked alert severity in 4.9, it was agreed that AlertmanagerClusterFailedToSendAlerts shouldn't be critical which is why both alerts ended up with the "warning" severity. To avoid confusion, CMO should ship only 1 alert (the expression with the "integration=~`.*`" selector)?


[1] https://github.com/openshift/cluster-monitoring-operator/blob/79cdf6865f159afbd89b7be60a8bbcf5f24cb938/assets/alertmanager/prometheus-rule.yaml#L61-L94
[2] https://github.com/prometheus/alertmanager/blob/1b8afe7cb5aafe59442e35979ec57401145ea26b/doc/alertmanager-mixin/alerts.libsonnet#L61-L97

Comment 3 Junqi Zhao 2021-11-18 02:29:40 UTC
checked with 4.10.0-0.nightly-2021-11-17-100252, only one AlertmanagerClusterFailedToSendAlerts now
# oc -n openshift-monitoring get prometheusrules alertmanager-main-rules -oyaml
...
    - alert: AlertmanagerClusterFailedToSendAlerts
      annotations:
        description: The minimum notification failure rate to {{ $labels.integration
          }} sent from any instance in the {{$labels.job}} cluster is {{ $value |
          humanizePercentage }}.
        summary: All Alertmanager instances in a cluster failed to send notifications
          to a critical integration.
      expr: |
        min by (namespace,service, integration) (
          rate(alertmanager_notifications_failed_total{job="alertmanager-main",namespace="openshift-monitoring", integration=~`.*`}[5m])
        /
          rate(alertmanager_notifications_total{job="alertmanager-main",namespace="openshift-monitoring", integration=~`.*`}[5m])
        )
        > 0.01
      for: 5m
      labels:
        severity: warning

Comment 8 errata-xmlrpc 2022-03-10 16:21:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.