Bug 2060492 - Update PtpConfigSlave source-crs to use network_transport L2 instead of UDPv4
Summary: Update PtpConfigSlave source-crs to use network_transport L2 instead of UDPv4
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.11.0
Assignee: Joseph Richard
QA Contact: obochan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-03 15:42 UTC by Marius Cornea
Modified: 2022-08-10 10:52 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
network_transport should be L2 and not UDPv4 in all documented ptp configs (e.g. https://docs.openshift.com/container-platform/4.9/networking/using-ptp.html#configuring-linuxptp-services-as-boundary-clock_using-ptp). Note that this applies in all versions as we do not support UDPv4 ptp.
Clone Of:
Environment:
Last Closed: 2022-08-10 10:52:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
linuxptp-daemon.log (658.75 KB, text/plain)
2022-03-03 15:42 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift-kni cnf-features-deploy pull 1042 0 None open Bug 2060492: ztp: Update network_transport to L2 2022-04-07 14:50:18 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:52:41 UTC

Internal Links: 2084195

Description Marius Cornea 2022-03-03 15:42:46 UTC
Created attachment 1864022 [details]
linuxptp-daemon.log

Description of problem:

linuxptp-daemon-container reports SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) errors on DU node deployed via ZTP process

Version-Release number of selected component (if applicable):
4.10.0-rc.6
ptp-operator.4.10.0-202202222110

How reproducible:
100%

Steps to Reproduce:
1. Deploy DU node via ZTP process, ptp config set in:

http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-1-4.10/policygentemplates/group-du-sno-ranGen.yaml#L48-L57

2. Wait for the deployment and configuration to complete

3. Check linuxptp-daemon-container logs:

oc -n openshift-ptp logs linuxptp-daemon-cwtmw -c linuxptp-daemon-container  -f

Actual results:

ptp4l[8376.097]: [ptp4l.0.config] port 1: FAULTY to LISTENING on INIT_COMPLETE
ptp4l[8376.137]: [ptp4l.0.config] port 1: new foreign master b47af1.fffe.7b20e2-1
ptp4l[8376.362]: [ptp4l.0.config] selected best master clock b47af1.fffe.7b20e2
ptp4l[8376.362]: [ptp4l.0.config] port 1: LISTENING to UNCALIBRATED on RS_SLAVE
ptp4l[8376.367]: [ptp4l.0.config] master offset -25954588102 s2 freq -900000000 path delay   3119884
ptp4l[8376.371]: [ptp4l.0.config] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
ptp4l[8376.423]: [ptp4l.0.config] master offset -25905432639 s2 freq -900000000 path delay   3119884
ptp4l[8376.480]: [ptp4l.0.config] master offset -25847641674 s2 freq -900000000 path delay   1656547
ptp4l[8376.500]: [ptp4l.0.config] timed out while polling for tx timestamp
ptp4l[8376.500]: [ptp4l.0.config] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
ptp4l[8376.500]: [ptp4l.0.config] port 1: send delay request failed
ptp4l[8376.500]: [ptp4l.0.config] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
phc2sys[8377.136]: [ptp4l.0.config] port b49691.fffe.a57b06-1 changed state
phc2sys[8377.136]: [ptp4l.0.config] port b49691.fffe.a57b06-1 changed state
phc2sys[8377.136]: [ptp4l.0.config] port b49691.fffe.a57b06-1 changed state
phc2sys[8377.136]: [ptp4l.0.config] reconfiguring after port state change
phc2sys[8377.137]: [ptp4l.0.config] selecting ens2f2 for synchronization
phc2sys[8377.137]: [ptp4l.0.config] nothing to synchronize


Expected results:

No faults

Additional info:

nic info:

[root@sno core]# ethtool -i ens2f2
driver: ice
version: 4.18.0-305.34.2.rt7.107.el8_4.x
firmware-version: 2.10 0x8000433d 1.2789.0
expansion-rom-version: 
bus-info: 0000:b2:00.2
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

 lspci -s b2:00.0 -v
b2:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for SFP (rev 02)
	Subsystem: Intel Corporation Ethernet Network Adapter E810-XXV-4
	Physical Slot: 2
	Flags: bus master, fast devsel, latency 0, IRQ 42, NUMA node 1, IOMMU group 93
	Memory at de000000 (64-bit, prefetchable) [size=32M]
	Memory at e6000000 (64-bit, prefetchable) [size=64K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] MSI-X: Enable+ Count=512 Masked-
	Capabilities: [a0] Express Endpoint, MSI 00
	Capabilities: [e0] Vital Product Data
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [150] Device Serial Number b4-96-91-ff-ff-a5-7b-04
	Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
	Capabilities: [1a0] Transaction Processing Hints
	Capabilities: [1b0] Access Control Services
	Capabilities: [1d0] Secondary PCI Express
	Capabilities: [200] Data Link Feature <?>
	Capabilities: [210] Physical Layer 16.0 GT/s <?>
	Capabilities: [250] Lane Margining at the Receiver <?>
	Kernel driver in use: ice
	Kernel modules: ice

Comment 2 Marius Cornea 2022-03-08 13:00:22 UTC
(In reply to Ken Young from comment #1)
> Marius,
> 
> Have you set the mitigation required for
> https://bugzilla.redhat.com/show_bug.cgi?id=1992173 provisioned?  See
> https://bugzilla.redhat.com/show_bug.cgi?id=1992173#c19.
> 
> /KenY

I haven't set the mitigation mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1992173#c19 . I tried changing the priority (chrt -f -p 65 $pid) of the existing ice-ptp and ptp4l processes but the linuxptp-daemon-container log shows the same error.

Nevertheless I see the BZ mentions a more recent NIC firmware than what I have on my system so I'll try updating the firmware and re-try.

Comment 3 Marius Cornea 2022-03-09 13:22:58 UTC
After the firmware update and adjusting the priorities I can no longer see the faults in the ptp logs.

Ofer also noticed that the ptp config set on my machine was using `network_transport UDPv4` while it should be `network_transport L2`. This config comes from the ZTP source CRs:

https://github.com/openshift-kni/cnf-features-deploy/blob/release-4.10/ztp/source-crs/PtpConfigSlave.yaml#L99
https://github.com/openshift-kni/cnf-features-deploy/blob/release-4.10/ztp/source-crs/PtpConfigSlaveCvl.yaml#L100

@Ken, I updated this BZ to keep track of updating the ptp configs source CRs to use `network_transport L2` instead of `network_transport UDPv4` as I understand L2 is the supported mode currently.

Comment 6 Vitaly Grinberg 2022-03-23 18:06:50 UTC
It(In reply to Marius Cornea from comment #3)
> After the firmware update and adjusting the priorities I can no longer see
> the faults in the ptp logs.
> 
> Ofer also noticed that the ptp config set on my machine was using
> `network_transport UDPv4` while it should be `network_transport L2`. This
> config comes from the ZTP source CRs:
> 
> https://github.com/openshift-kni/cnf-features-deploy/blob/release-4.10/ztp/
> source-crs/PtpConfigSlave.yaml#L99
> https://github.com/openshift-kni/cnf-features-deploy/blob/release-4.10/ztp/
> source-crs/PtpConfigSlaveCvl.yaml#L100
> 
> @Ken, I updated this BZ to keep track of updating the ptp configs source CRs
> to use `network_transport L2` instead of `network_transport UDPv4` as I
> understand L2 is the supported mode currently.

While the transport is set to UDP4 in the config file options, it is overridden by the command line options.
The command line options for ptp4l are selecting IEEE 802.3 transport:
https://github.com/openshift-kni/cnf-features-deploy/blob/d521e22a7c1a8dcd0a76f2c4659da8736defec49/ztp/source-crs/PtpConfigSlave.yaml#L13

ptp4lOpts: "-2 -s --summary_interval -4"
The "-2" is for selecting the IEEE 802.3 transport, according to https://linux.die.net/man/8/ptp4l
It's therefore possible that the observed behavior is not related to the ptp4l configuration.
Having said that, it's probably a good idea to remove duplicate and seemingly conflicting settings from ptp4lOpts and ptp4lConf to reduce confusion, but this is not a functional / performance issue.

Comment 8 obochan 2022-05-08 07:39:06 UTC
Issue is validated via the PR changed the configuration from UDP to L2.

Comment 10 errata-xmlrpc 2022-08-10 10:52:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.