Created attachment 1864022 [details] linuxptp-daemon.log Description of problem: linuxptp-daemon-container reports SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) errors on DU node deployed via ZTP process Version-Release number of selected component (if applicable): 4.10.0-rc.6 ptp-operator.4.10.0-202202222110 How reproducible: 100% Steps to Reproduce: 1. Deploy DU node via ZTP process, ptp config set in: http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-1-4.10/policygentemplates/group-du-sno-ranGen.yaml#L48-L57 2. Wait for the deployment and configuration to complete 3. Check linuxptp-daemon-container logs: oc -n openshift-ptp logs linuxptp-daemon-cwtmw -c linuxptp-daemon-container -f Actual results: ptp4l[8376.097]: [ptp4l.0.config] port 1: FAULTY to LISTENING on INIT_COMPLETE ptp4l[8376.137]: [ptp4l.0.config] port 1: new foreign master b47af1.fffe.7b20e2-1 ptp4l[8376.362]: [ptp4l.0.config] selected best master clock b47af1.fffe.7b20e2 ptp4l[8376.362]: [ptp4l.0.config] port 1: LISTENING to UNCALIBRATED on RS_SLAVE ptp4l[8376.367]: [ptp4l.0.config] master offset -25954588102 s2 freq -900000000 path delay 3119884 ptp4l[8376.371]: [ptp4l.0.config] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[8376.423]: [ptp4l.0.config] master offset -25905432639 s2 freq -900000000 path delay 3119884 ptp4l[8376.480]: [ptp4l.0.config] master offset -25847641674 s2 freq -900000000 path delay 1656547 ptp4l[8376.500]: [ptp4l.0.config] timed out while polling for tx timestamp ptp4l[8376.500]: [ptp4l.0.config] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug ptp4l[8376.500]: [ptp4l.0.config] port 1: send delay request failed ptp4l[8376.500]: [ptp4l.0.config] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) phc2sys[8377.136]: [ptp4l.0.config] port b49691.fffe.a57b06-1 changed state phc2sys[8377.136]: [ptp4l.0.config] port b49691.fffe.a57b06-1 changed state phc2sys[8377.136]: [ptp4l.0.config] port b49691.fffe.a57b06-1 changed state phc2sys[8377.136]: [ptp4l.0.config] reconfiguring after port state change phc2sys[8377.137]: [ptp4l.0.config] selecting ens2f2 for synchronization phc2sys[8377.137]: [ptp4l.0.config] nothing to synchronize Expected results: No faults Additional info: nic info: [root@sno core]# ethtool -i ens2f2 driver: ice version: 4.18.0-305.34.2.rt7.107.el8_4.x firmware-version: 2.10 0x8000433d 1.2789.0 expansion-rom-version: bus-info: 0000:b2:00.2 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes lspci -s b2:00.0 -v b2:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for SFP (rev 02) Subsystem: Intel Corporation Ethernet Network Adapter E810-XXV-4 Physical Slot: 2 Flags: bus master, fast devsel, latency 0, IRQ 42, NUMA node 1, IOMMU group 93 Memory at de000000 (64-bit, prefetchable) [size=32M] Memory at e6000000 (64-bit, prefetchable) [size=64K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable+ Count=512 Masked- Capabilities: [a0] Express Endpoint, MSI 00 Capabilities: [e0] Vital Product Data Capabilities: [100] Advanced Error Reporting Capabilities: [148] Alternative Routing-ID Interpretation (ARI) Capabilities: [150] Device Serial Number b4-96-91-ff-ff-a5-7b-04 Capabilities: [160] Single Root I/O Virtualization (SR-IOV) Capabilities: [1a0] Transaction Processing Hints Capabilities: [1b0] Access Control Services Capabilities: [1d0] Secondary PCI Express Capabilities: [200] Data Link Feature <?> Capabilities: [210] Physical Layer 16.0 GT/s <?> Capabilities: [250] Lane Margining at the Receiver <?> Kernel driver in use: ice Kernel modules: ice
Marius, Have you set the mitigation required for https://bugzilla.redhat.com/show_bug.cgi?id=1992173 provisioned? See https://bugzilla.redhat.com/show_bug.cgi?id=1992173#c19. /KenY
(In reply to Ken Young from comment #1) > Marius, > > Have you set the mitigation required for > https://bugzilla.redhat.com/show_bug.cgi?id=1992173 provisioned? See > https://bugzilla.redhat.com/show_bug.cgi?id=1992173#c19. > > /KenY I haven't set the mitigation mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1992173#c19 . I tried changing the priority (chrt -f -p 65 $pid) of the existing ice-ptp and ptp4l processes but the linuxptp-daemon-container log shows the same error. Nevertheless I see the BZ mentions a more recent NIC firmware than what I have on my system so I'll try updating the firmware and re-try.
After the firmware update and adjusting the priorities I can no longer see the faults in the ptp logs. Ofer also noticed that the ptp config set on my machine was using `network_transport UDPv4` while it should be `network_transport L2`. This config comes from the ZTP source CRs: https://github.com/openshift-kni/cnf-features-deploy/blob/release-4.10/ztp/source-crs/PtpConfigSlave.yaml#L99 https://github.com/openshift-kni/cnf-features-deploy/blob/release-4.10/ztp/source-crs/PtpConfigSlaveCvl.yaml#L100 @Ken, I updated this BZ to keep track of updating the ptp configs source CRs to use `network_transport L2` instead of `network_transport UDPv4` as I understand L2 is the supported mode currently.
It(In reply to Marius Cornea from comment #3) > After the firmware update and adjusting the priorities I can no longer see > the faults in the ptp logs. > > Ofer also noticed that the ptp config set on my machine was using > `network_transport UDPv4` while it should be `network_transport L2`. This > config comes from the ZTP source CRs: > > https://github.com/openshift-kni/cnf-features-deploy/blob/release-4.10/ztp/ > source-crs/PtpConfigSlave.yaml#L99 > https://github.com/openshift-kni/cnf-features-deploy/blob/release-4.10/ztp/ > source-crs/PtpConfigSlaveCvl.yaml#L100 > > @Ken, I updated this BZ to keep track of updating the ptp configs source CRs > to use `network_transport L2` instead of `network_transport UDPv4` as I > understand L2 is the supported mode currently. While the transport is set to UDP4 in the config file options, it is overridden by the command line options. The command line options for ptp4l are selecting IEEE 802.3 transport: https://github.com/openshift-kni/cnf-features-deploy/blob/d521e22a7c1a8dcd0a76f2c4659da8736defec49/ztp/source-crs/PtpConfigSlave.yaml#L13 ptp4lOpts: "-2 -s --summary_interval -4" The "-2" is for selecting the IEEE 802.3 transport, according to https://linux.die.net/man/8/ptp4l It's therefore possible that the observed behavior is not related to the ptp4l configuration. Having said that, it's probably a good idea to remove duplicate and seemingly conflicting settings from ptp4lOpts and ptp4lConf to reduce confusion, but this is not a functional / performance issue.
Issue is validated via the PR changed the configuration from UDP to L2.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069