2070047 – Kuryr: Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured

Bug 2070047 - Kuryr: Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured

Summary: Kuryr: Prometheus when installed on the cluster shouldn't report any alerts i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Maysa Macedo
QA Contact:	Itay Matza
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2077384
TreeView+	depends on / blocked

Reported:	2022-03-30 11:20 UTC by Maysa Macedo
Modified:	2022-08-10 11:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:02:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 1359	None	open	Bug 2070047: Bump max value of hist quantile for kuryr_cni_request_duration	2022-03-30 11:21:24 UTC
Github	openshift kuryr-kubernetes pull 647	None	open	Bug 2070047: Increase cni_request_duration buckets	2022-04-01 09:29:19 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 11:03:01 UTC

Description Maysa Macedo 2022-03-30 11:20:13 UTC

Description of problem:

With Kuryr, the CNI requests can take a considerable
time given that it has to wait for a VIF from Neutron.
We've seen warning alerts being raised with KuryrCNISlow and reported on the following test
"Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured". The test failure makes the Kuryr upgrade to fail.


Version-Release number of selected component (if applicable):


How reproducible:

Upgrade from OCP 4.9 to OCP 4.10.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Itay Matza 2022-04-28 13:53:27 UTC

Verified with the following steps:

- Installed OCP 4.10.0-0.nightly-2022-04-27-212741 on top of RHOS-16.1-RHEL-8-20220329.n.1 with Kuryr.

- Make sure the cluster is up and the Watchdog and AlertmanagerReceiversNotConfigured alerts exist:
```
(shiftstack) [stack@undercloud-0 ~]$ curl -sk -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.ostest.shiftstack.com/api/v1/alerts' | jq '.data.alerts[] | select(.labels.alertname) | .labels.alertname'
"Watchdog"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"APIRemovedInNextEUSReleaseInUse"
"APIRemovedInNextEUSReleaseInUse"
"AlertmanagerReceiversNotConfigured"
```

- Upgraded successfully to 4.11.0-0.nightly-2022-04-26-181148 using the upgrade command:
```
$ oc adm upgrade --to-image="registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-04-26-181148" --allow-explicit-upgrade --force=true 
```

- Make sure the cluster is up.

- Check the alerts, the Watchdog and AlertmanagerReceiversNotConfigured alerts exist, but the KuryrCNISlow is not.
```
(shiftstack) [stack@undercloud-0 ~]$ curl -sk -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.ostest.shiftstack.com/api/v1/alerts' | jq '.data.alerts[] | select(.labels.alertname) | .labels.alertname'
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"AlertmanagerReceiversNotConfigured"
"Watchdog"
```

- Keep checking the alerts and make sure the KuryrCNISlow is not raised.

- Destroy and create the cluster with OCP 4.11.0-0.nightly-2022-04-26-181148 version.

- Keep checking the alerts and make sure the KuryrCNISlow is not raised.

Comment 6 Prasad Chaudhari 2022-06-23 08:00:05 UTC

The similar issue is seen for version 4.8.45


Description of problem:

test
"[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

this test is failing consistently on latest 4.8.45 build. 


Version-Release number of selected component (if applicable):

[root@rdr-zscurst-348a-bastion-0 ~]# oc version
Client Version: 4.8.44
Server Version: 4.8.45
Kubernetes Version: v1.21.11+6b3cbdd


How reproducible:
Deploy the newly come 4.8.45 on power platform and run e2e test.


Actual results:
Test is failing.

Flaky invariants:

[sig-arch] Monitor cluster while tests execute

Failing tests:

[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

Expected results:
Test should pass without any error.

Comment 7 errata-xmlrpc 2022-08-10 11:02:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.