Description of problem: PTP Operator has defined Prometheus rules to define out Of sync alters on max master offset . 1. When fast event enabled, we should use sync state metrics "openshift_ptp_clock_state" to define NodeOutOfSync alert. 2. When fast event is not enabled , we should use "openshift_ptp_offset_from_system" to defined NodeOutOfSync alerts Version-Release number of selected component (if applicable): 4.9/4.10
Hi Aneesh, The problem using the `openshift_ptp_clock_state` is that on telco if the sync is higher than +-100 the synchronization should be marked as out of sync. One option is to change the NodeOutOfSync to use the `clock_state` and create a new one NodeHighPtpOffsetSync WDYT?
@aputtur
@aputtur@aputtur We review the issue described int he bug, and tried to verified it according to the steps detailed above. 1. When fast event enabled, we should use sync state metrics "openshift_ptp_clock_state" to define NodeOutOfSync alert. This mean when pull the metrics using curl when fast event is enabled we should see only the openshift_ptp_clock_state metric. 2. When fast event is not enabled , we should use "openshift_ptp_offset_from_system" to defined NodeOutOfSync alerts This mean when pull the metrics using curl when fast event is enabled we should see only the openshift_ptp_offset_from_system metric. what we currently see is : when pulling the metrics when fast event is enabled. [marzianor@localhost ptp]$ oc -n openshift-ptp exec linuxptp-daemon-vvr4b -c cloud-event-proxy -- curl 127.0.0.1:9091/metrics | grep "openshift_ptp_clock" # HELP openshift_ptp_clock_state 0 = FREERUN, 1 = LOCKED, 2 = HOLDOVER # TYPE openshift_ptp_clock_state gauge openshift_ptp_clock_state{iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 1 openshift_ptp_clock_state{iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 1 openshift_ptp_clock_state{iface="ens5fx",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 1 [marzianor@localhost ptp]$ oc -n openshift-ptp exec linuxptp-daemon-vvr4b -c cloud-event-proxy -- curl 127.0.0.1:9091/metrics | grep "openshift_ptp_offset" # HELP openshift_ptp_offset_ns # TYPE openshift_ptp_offset_ns gauge openshift_ptp_offset_ns{from="master",iface="ens5fx",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} -19 openshift_ptp_offset_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} -12 openshift_ptp_offset_ns{from="phc",iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} -12 when pulling the metrics when fast event is disabled [obochan@obochan ptp]$ oc -n openshift-ptp exec linuxptp-daemon-g96xb -c linuxptp-daemon-container -- curl 127.0.0.1:9091/metrics | grep "openshift_ptp_clock" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2237 0 2237 0 0 2184k 0 --:--:-- --:--:-- --:--:-- 2184k # HELP openshift_ptp_clock_state 0 = FREERUN, 1 = LOCKED, 2 = HOLDOVER # TYPE openshift_ptp_clock_state gauge openshift_ptp_clock_state{iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 1 [obochan@obochan ptp]$ oc -n openshift-ptp exec linuxptp-daemon-g96xb -c linuxptp-daemon-container -- curl 127.0.0.1:9091/metrics % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP openshift_ptp_clock_state 0 = FREERUN, 1 = LOCKED, 2 = HOLDOVER # TYPE openshift_ptp_clock_state gauge openshift_ptp_clock_state{iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 1 # HELP openshift_ptp_delay_ns # TYPE openshift_ptp_delay_ns gauge openshift_ptp_delay_ns{from="master",iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 87 openshift_ptp_delay_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 405 # HELP openshift_ptp_frequency_adjustment_ns # TYPE openshift_ptp_frequency_adjustment_ns gauge openshift_ptp_frequency_adjustment_ns{from="master",iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} -2372 openshift_ptp_frequency_adjustment_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} -77826 # HELP openshift_ptp_interface_role 0 = PASSIVE, 1 = SLAVE, 2 = MASTER, 3 = FAULTY, 4 = UNKNOWN # TYPE openshift_ptp_interface_role gauge openshift_ptp_interface_role{iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 1 # HELP openshift_ptp_max_offset_ns # TYPE openshift_ptp_max_offset_ns gauge openshift_ptp_max_offset_ns{from="master",iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} -5 openshift_ptp_max_offset_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 1 # HELP openshift_ptp_offset_ns # TYPE openshift_ptp_offset_ns gauge openshift_ptp_offset_ns{from="master",iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} -5 openshift_ptp_offset_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 1 # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. # TYPE promhttp_metric_handler_requests_in_flight gauge promhttp_metric_handler_requests_in_flight 1 # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. # TYPE promhttp_metric_handler_requests_total counter promhttp_metric_handler_requests_total{code="200"} 44 promhttp_metric_handler_requests_total{code="500"} 0 promhttp_metric_handler_requests_total{code="503"} 0 100 2239 0 2239 0 0 2186k 0 --:--:-- --:--:-- --:--:-- 2186k as you could see above on both examples (enable and disable) you see the clock_state metrics but you, but according to what we understand when the event is disabled we should have a the offset event. Please advise if that is the way we want it to work, or the issue wasn't fix accordingly.
Ok, i understand so according to what you say the logs shows the correct information. it mean openshift_ptp_offset_from_system was changed to openshift_ptp_offset_ns can you confirm it? Ofer.
Ok That bug is on verified - i could do duplicate and reopen the first bug - please advise how you want to deal with it.
reopen this issue https://bugzilla.redhat.com/show_bug.cgi?id=2019198 or we want to deal with the convention name as agreed at comment 6 and 7. Please advise.
from what i could see when i change to the ptpopertatorconfig and disabled the events the metrics stopped , is that expected behavior are the metric only enabled when sidecar is enabled(events). Please advise , what is the expected behavior, you can see the below the out of the 2 options enabled/disabled [obochan@obochan ptp]$ cat /tmp/event_disable # HELP openshift_ptp_clock_state 0 = FREERUN, 1 = LOCKED, 2 = HOLDOVER # TYPE openshift_ptp_clock_state gauge openshift_ptp_clock_state{iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 1 # HELP openshift_ptp_delay_ns # TYPE openshift_ptp_delay_ns gauge openshift_ptp_delay_ns{from="master",iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 83 openshift_ptp_delay_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 407 # HELP openshift_ptp_frequency_adjustment_ns # TYPE openshift_ptp_frequency_adjustment_ns gauge openshift_ptp_frequency_adjustment_ns{from="master",iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} -2392 openshift_ptp_frequency_adjustment_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} -77616 # HELP openshift_ptp_interface_role 0 = PASSIVE, 1 = SLAVE, 2 = MASTER, 3 = FAULTY, 4 = UNKNOWN # TYPE openshift_ptp_interface_role gauge openshift_ptp_interface_role{iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 1 # HELP openshift_ptp_max_offset_ns # TYPE openshift_ptp_max_offset_ns gauge openshift_ptp_max_offset_ns{from="master",iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 3 openshift_ptp_max_offset_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 3 # HELP openshift_ptp_offset_ns # TYPE openshift_ptp_offset_ns gauge openshift_ptp_offset_ns{from="master",iface="master",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 3 openshift_ptp_offset_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 3 # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. # TYPE promhttp_metric_handler_requests_in_flight gauge promhttp_metric_handler_requests_in_flight 1 # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. # TYPE promhttp_metric_handler_requests_total counter promhttp_metric_handler_requests_total{code="200"} 166 promhttp_metric_handler_requests_total{code="500"} 0 promhttp_metric_handler_requests_total{code="503"} 0 white_check_mark eyes raised_hands 1:58 [obochan@obochan ptp]$ cat /tmp/event_enable # HELP cne_api_events_published Metric to get number of events published by the rest api # TYPE cne_api_events_published gauge cne_api_events_published{address="/cluster/node/cnfde7.ptp.lab.eng.bos.redhat.com/ptp",status="success"} 44 # HELP cne_api_publishers Metric to get number of publishers # TYPE cne_api_publishers gauge cne_api_publishers{status="active"} 1 # HELP cne_events_ack Metric to get number of events produced # TYPE cne_events_ack gauge cne_events_ack{status="success",type="/cluster/node/cnfde7.ptp.lab.eng.bos.redhat.com/ptp"} 44 # HELP openshift_ptp_clock_state 0 = FREERUN, 1 = LOCKED, 2 = HOLDOVER # TYPE openshift_ptp_clock_state gauge openshift_ptp_clock_state{iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 1 openshift_ptp_clock_state{iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 1 openshift_ptp_clock_state{iface="ens5fx",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 1 # HELP openshift_ptp_delay_ns # TYPE openshift_ptp_delay_ns gauge openshift_ptp_delay_ns{from="master",iface="ens5fx",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 82 openshift_ptp_delay_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 406 openshift_ptp_delay_ns{from="phc",iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 406 # HELP openshift_ptp_frequency_adjustment_ns # TYPE openshift_ptp_frequency_adjustment_ns gauge openshift_ptp_frequency_adjustment_ns{from="master",iface="ens5fx",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} -2395 openshift_ptp_frequency_adjustment_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} -77590 openshift_ptp_frequency_adjustment_ns{from="phc",iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} -77590 # HELP openshift_ptp_interface_role 0 = PASSIVE, 1 = SLAVE, 2 = MASTER, 3 = FAULTY, 4 = UNKNOWN # TYPE openshift_ptp_interface_role gauge openshift_ptp_interface_role{iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 1 # HELP openshift_ptp_max_offset_ns # TYPE openshift_ptp_max_offset_ns gauge openshift_ptp_max_offset_ns{from="master",iface="ens5fx",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} 78 openshift_ptp_max_offset_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 99 openshift_ptp_max_offset_ns{from="phc",iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 99 # HELP openshift_ptp_offset_ns # TYPE openshift_ptp_offset_ns gauge openshift_ptp_offset_ns{from="master",iface="ens5fx",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="ptp4l"} -8 openshift_ptp_offset_ns{from="phc",iface="CLOCK_REALTIME",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 16 openshift_ptp_offset_ns{from="phc",iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",process="phc2sys"} 16 # HELP openshift_ptp_threshold # TYPE openshift_ptp_threshold gauge openshift_ptp_threshold{iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",threshold="HoldOverTimeout"} 60 openshift_ptp_threshold{iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",threshold="MaxOffsetThreshold"} 100 openshift_ptp_threshold{iface="ens5f1",node="cnfde7.ptp.lab.eng.bos.redhat.com",threshold="MinOffsetThreshold"} -100 # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. # TYPE promhttp_metric_handler_requests_in_flight gauge promhttp_metric_handler_requests_in_flight 1 # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. # TYPE promhttp_metric_handler_requests_total counter promhttp_metric_handler_requests_total{code="200"} 11 promhttp_metric_handler_requests_total{code="500"} 0 promhttp_metric_handler_requests_total{code="503"} 0 [obochan@obochan ptp]$
When events are enabled the NodeOutOfSync alert is generated via : alert: NodeOutOfPtpSync expr: openshift_ptp_clock_state != 1 for: 2m labels: severity: warning annotations: message: | {{ $labels.iface }} is not in sync When the events are disabled NodeOutOfSync alert is generated via : alert: HighPtpSyncOffset expr: openshift_ptp_offset_ns > 100 or openshift_ptp_offset_ns < -100 for: 2m labels: severity: warning annotations: message: | All nodes should have ptp sync offset lower then 100
to validate the alert while the events are disabled we had to change the Prometheus thresholds of NodeOutOfSync alert [obochan@obochan ptp]$ oc edit prometheusrules.monitoring.coreos.com -n openshift-ptp ptp-rules prometheusrule.monitoring.coreos.com/ptp-rules edited [obochan@obochan ptp]$ oc edit prometheusrules.monitoring.coreos.com -n openshift-ptp ptp-rules prometheusrule.monitoring.coreos.com/ptp-rules edited [obochan@obochan ptp]$ oc get prometheusrules.monitoring.coreos.com -n openshift-ptp ptp-rules -o yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: creationTimestamp: "2022-01-24T14:11:34Z" generation: 8 labels: prometheus: k8s role: alert-rules name: ptp-rules namespace: openshift-ptp ownerReferences: - apiVersion: ptp.openshift.io/v1 blockOwnerDeletion: true controller: true kind: PtpOperatorConfig name: default uid: 4cda9c30-aa26-48af-8a92-52ac2baf826e resourceVersion: "744823" uid: b27b772b-6c5f-4d86-bbb9-010327aa571f spec: groups: - name: ptp.rules rules: - alert: HighPtpSyncOffset annotations: message: | All nodes should have ptp sync offset lower then 100 expr: | openshift_ptp_offset_ns > 2 or openshift_ptp_offset_ns < -2 for: 2m labels: severity: warning HighPtpSyncOffset (2 active) alert: HighPtpSyncOffset expr: openshift_ptp_offset_ns > 2 or openshift_ptp_offset_ns < -2 for: 2m labels: severity: warning annotations: message: | All nodes should have ptp sync offset lower then 100
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056