Description of problem: If the console cluster has an empty status/status.consoleURL unset, the CMO will set up the alertmanager with an argument of `-web.external-url=https:/monitoring` which is invalid and puts it into crashloop, logging "level=error ts=2022-03-02T16:53:47.773Z caller=main.go:369 msg="failed to determine external URL" err="\"monitoring\": invalid \"\" scheme, only 'http' and 'https' are supported"" Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Disable the CVO and console clusteroperator 2. Set the console status to empty: `oc proxy & ; curl -v -XPATCH -H "Accept: application/json" -H "Content-Type: application/merge-patch+json" -H "User-Agent: kubectl/v1.23.4 (linux/amd64) kubernetes/e6c093d" 'http://127.0.0.1:8001/apis/config.openshift.io/v1/consoles/cluster/status?fieldManager=kubectl-edit' --data '{"status":null}` 3. Delete the CMO pod, because it doesn't react to changes in clusteroperators, ref https://bugzilla.redhat.com/show_bug.cgi?id=2060083 Actual results: The alertmanager is invalid and crashloops, logging ``` level=error ts=2022-03-02T16:53:47.773Z caller=main.go:369 msg="failed to determine external URL" err="\"monitoring\": invalid \"\" scheme, only 'http' and 'https' are supported" ``` because it has a CLI argument of `--web.external-url=monitoring` Expected results: A working alertmanager is produced Additional info:
the fix is in 4.11.0-0.nightly-2022-03-04-063157, tested with it, still the same issue before seeting status.consoleURL to null for console/cluster # oc get console/cluster -o jsonpath="{.status.consoleURL}" https://console-openshift-console.apps.qe-daily-0308.qe.devcluster.openshift.com # oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url" - --web.external-url=https://console-openshift-console.apps.qe-daily-0308.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url" - --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/ scale down CVO/console-operator # oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0 # oc -n openshift-console-operator scale deploy console-operator --replicas=0 # oc -n openshift-cluster-version get deploy NAME READY UP-TO-DATE AVAILABLE AGE cluster-version-operator 0/0 0 0 3h56m # oc -n openshift-console-operator get deploy NAME READY UP-TO-DATE AVAILABLE AGE console-operator 0/0 0 0 3h43m # oc proxy open another teminal # curl -v -XPATCH -H "Accept: application/json" -H "Content-Type: application/merge-patch+json" -H "User-Agent: kubectl/v1.23.4 (linux/amd64) kubernetes/e6c093d" 'http://127.0.0.1:8001/apis/config.openshift.io/v1/consoles/cluster/status?fieldManager=kubectl-edit' --data '{"status":null}' # oc get console/cluster -o jsonpath="{.status.consoleURL}" no result # oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator" alertmanager-main-0 6/6 Running 0 6m44s alertmanager-main-1 5/6 CrashLoopBackOff 4 (64s ago) 2m52s prometheus-k8s-0 6/6 Running 0 8m9s prometheus-k8s-1 6/6 Running 0 10m cluster-monitoring-operator-66cb5487b9-8l7sl 2/2 Running 0 99m # oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url" - --web.external-url=/monitoring # oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url" - --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/ # oc -n openshift-monitoring describe pod alertmanager-main-1 ... alertmanager: Container ID: cri-o://e502d6f2cde8ff678bc26bd074f662cf364441192bef214dd28e7fb1fdd61596 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58 Ports: 9094/TCP, 9094/UDP Host Ports: 0/TCP, 0/UDP Args: --config.file=/etc/alertmanager/config/alertmanager.yaml --storage.path=/alertmanager --data.retention=120h --cluster.listen-address=[$(POD_IP)]:9094 --web.listen-address=127.0.0.1:9093 --web.external-url=/monitoring --web.route-prefix=/ --cluster.peer=alertmanager-main-0.alertmanager-operated:9094 --cluster.peer=alertmanager-main-1.alertmanager-operated:9094 --cluster.reconnect-timeout=5m State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: level=info ts=2022-03-08T04:11:10.810Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=72e0ff6e1bacdb3e9ced559bc905bf4501eb8b61)" level=info ts=2022-03-08T04:11:10.810Z caller=main.go:226 build_context="(go=go1.17.5, user=root@e377fc787659, date=20220304-03:55:17)" level=info ts=2022-03-08T04:11:10.829Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s level=error ts=2022-03-08T04:11:10.857Z caller=main.go:369 msg="failed to determine external URL" err="\"/monitoring\": invalid \"\" scheme, only 'http' and 'https' are supported" level=info ts=2022-03-08T04:11:10.857Z caller=cluster.go:680 component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=28.253187ms delete CMO pod # oc -n openshift-monitoring delete pod cluster-monitoring-operator-66cb5487b9-8l7sl pod "cluster-monitoring-operator-66cb5487b9-8l7sl" deleted still the same issue # oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator" alertmanager-main-0 6/6 Running 0 39m alertmanager-main-1 5/6 Error 3 (27s ago) 48s cluster-monitoring-operator-66cb5487b9-8qwpb 2/2 Running 0 16m prometheus-k8s-0 6/6 Running 0 20m prometheus-k8s-1 6/6 Running 0 21m # oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url" - --web.external-url=/monitoring # oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url" - --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/ # oc -n openshift-monitoring get po alertmanager-main-0 -oyaml | grep "web.external-url" - --web.external-url=https://console-openshift-console.apps.qe-daily-0308.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get po alertmanager-main-1 -oyaml | grep "web.external-url" - --web.external-url=/monitoring # oc -n openshift-monitoring get po prometheus-k8s-0 -oyaml | grep "web.external-url" - --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/ # oc -n openshift-monitoring get po prometheus-k8s-1 -oyaml | grep "web.external-url" - --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/ # oc -n openshift-monitoring describe pod alertmanager-main-1 ... Containers: alertmanager: Container ID: cri-o://8c733eba4cbc223a0598bafa6d95356de050da0151f3608f233402efc5c8342c Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58 Ports: 9094/TCP, 9094/UDP Host Ports: 0/TCP, 0/UDP Args: --config.file=/etc/alertmanager/config/alertmanager.yaml --storage.path=/alertmanager --data.retention=120h --cluster.listen-address=[$(POD_IP)]:9094 --web.listen-address=127.0.0.1:9093 --web.external-url=/monitoring --web.route-prefix=/ --cluster.peer=alertmanager-main-0.alertmanager-operated:9094 --cluster.peer=alertmanager-main-1.alertmanager-operated:9094 --cluster.reconnect-timeout=5m State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: level=info ts=2022-03-08T04:45:04.955Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=72e0ff6e1bacdb3e9ced559bc905bf4501eb8b61)" level=info ts=2022-03-08T04:45:04.955Z caller=main.go:226 build_context="(go=go1.17.5, user=root@e377fc787659, date=20220304-03:55:17)" level=info ts=2022-03-08T04:45:04.968Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s level=error ts=2022-03-08T04:45:05.007Z caller=main.go:369 msg="failed to determine external URL" err="\"/monitoring\": invalid \"\" scheme, only 'http' and 'https' are supported" level=info ts=2022-03-08T04:45:05.007Z caller=cluster.go:680 component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=38.734031ms # oc -n openshift-monitoring rsh -c alertmanager alertmanager-main-0 sh-4.4$ /bin/alertmanager --help ... --web.external-url=WEB.EXTERNAL-URL The URL under which Alertmanager is externally reachable (for example, if Alertmanager is served via a reverse proxy). Used for generating relative and absolute links back to Alertmanager itself. If the URL has a path portion, it will be used to prefix all HTTP endpoints served by Alertmanager. If omitted, relevant URL components will be derived automatically.
Junqi Zhao I can not reproduce this, for me it works in that version: Cluster has same version as the one you mentioned: $ k get clusterversion version -ojson|jq .status.desired { "image": "registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-03-04-063157", "version": "4.11.0-0.nightly-2022-03-04-063157" } Console doesn't have the url in the status: k get console cluster -ojson|jq .status null The statefulset doesn't have `web.external-url` in the alertmanager args: $ k get statefulset -n openshift-monitoring alertmanager-main -ojson|jq '.spec.template.spec.containers[0].args' [ "--config.file=/etc/alertmanager/config/alertmanager.yaml", "--storage.path=/alertmanager", "--data.retention=120h", "--cluster.listen-address=[$(POD_IP)]:9094", "--web.listen-address=127.0.0.1:9093", "--web.route-prefix=/", "--cluster.peer=alertmanager-main-0.alertmanager-operated:9094", "--cluster.peer=alertmanager-main-1.alertmanager-operated:9094", "--cluster.reconnect-timeout=5m" ] Alertmanager is running: $ k get pod|rg alertmanager alertmanager-main-0 6/6 Running 0 9m8s alertmanager-main-1 6/6 Running 0 9m39s Alertmanager pod doesn't have the `web.external-url` arg either: $ k get pod alertmanager-main-0 -ojson|jq '.spec.containers[0].args' [ "--config.file=/etc/alertmanager/config/alertmanager.yaml", "--storage.path=/alertmanager", "--data.retention=120h", "--cluster.listen-address=[$(POD_IP)]:9094", "--web.listen-address=127.0.0.1:9093", "--web.route-prefix=/", "--cluster.peer=alertmanager-main-0.alertmanager-operated:9094", "--cluster.peer=alertmanager-main-1.alertmanager-operated:9094", "--cluster.reconnect-timeout=5m" ] Are you sure your cluster is at 4.11.0-0.nightly-2022-03-04-063157 and not some other version that doesn't have the patch? > I am kindly think the scenario in Comment 0 is invalid based on the --web.external-url help It can happen during cluster creation and results in a crashlooping alertmanager pod, that is what the bug is about.
tested with 4.11.0-0.nightly-2022-03-09-235248, no issue now # oc get console/cluster -o jsonpath="{.status.consoleURL}" no result # oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator" alertmanager-main-0 6/6 Running 0 76s alertmanager-main-1 6/6 Running 0 109s cluster-monitoring-operator-5699fc45d8-q7m9r 2/2 Running 0 118s prometheus-k8s-0 6/6 Running 0 86s prometheus-k8s-1 6/6 Running 0 104s # oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring delete pod cluster-monitoring-operator-5699fc45d8-q7m9r pod "cluster-monitoring-operator-5699fc45d8-q7m9r" deleted # oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator" alertmanager-main-0 6/6 Running 0 41s alertmanager-main-1 6/6 Running 0 73s cluster-monitoring-operator-5699fc45d8-2dvbg 2/2 Running 0 83s prometheus-k8s-0 6/6 Running 0 52s prometheus-k8s-1 6/6 Running 0 68s # oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url" no result # oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url" - --web.external-url=https://prometheus-k8s.openshift-monitoring.svc:9091 # oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml | grep "web.external-url" no result # oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml | grep "web.external-url" no result # oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep "web.external-url" - --web.external-url=https://prometheus-k8s.openshift-monitoring.svc:9091 # oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml | grep "web.external-url" - --web.external-url=https://prometheus-k8s.openshift-monitoring.svc:9091 restore cluster # oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=1 # oc -n openshift-console-operator scale deploy console-operator --replicas=1 # oc -n openshift-monitoring delete pod cluster-monitoring-operator-5699fc45d8-2dvbg pod "cluster-monitoring-operator-5699fc45d8-4786r" deleted # oc get console/cluster -o jsonpath="{.status.consoleURL}" https://console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com # oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator" alertmanager-main-0 6/6 Running 0 52s alertmanager-main-1 6/6 Running 0 85s cluster-monitoring-operator-5699fc45d8-cv4wv 2/2 Running 0 94s prometheus-k8s-0 6/6 Running 0 63s prometheus-k8s-1 6/6 Running 0 80s # oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring # oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml | grep "web.external-url" - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069