Description of problem: After link flap or NetworkManager restart node goes NotReady, is unreachable on the network and does not recover without reboot. Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2022-01-30-073053 How reproducible: Always Steps to Reproduce: 1. oc debug node/rbrattai-o410i77-9zd98-master-0 -- chroot /host systemctl restart NetworkManager OR 2. oc debug node/rbrattai-o410i77-9zd98-master-0 -- bash -c 'ip l s ens5 down; sleep 5 ; ip l s ens5 up' Actual results: Node goes NotReady state and stays in that state until reboot. Node is unreachable on the network. several NetworkManager error messages Jan 30 16:53:43.742653 ip-10-0-154-177 NetworkManager[240416] <info> [1643561623.7426] dhcp4 (br-ex): state changed unknown -> timeout Jan 30 16:53:43.742705 ip-10-0-154-177 NetworkManager[240416] <info> [1643561623.7426] device (br-ex): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') Jan 30 16:53:43.743121 ip-10-0-154-177 NetworkManager[240416] <info> [1643561623.7431] manager: NetworkManager state is now CONNECTED_LOCAL Jan 30 16:53:43.743278 ip-10-0-154-177 NetworkManager[240416] <info> [1643561623.7432] device (br-ex): releasing ovs interface br-ex Jan 30 16:53:43.743693 ip-10-0-154-177 NetworkManager[240416] <info> [1643561623.7436] device (br-ex): released from master device br-ex Jan 30 16:53:43.744282 ip-10-0-154-177 NetworkManager[240416] <warn> [1643561623.7442] device (br-ex): Activation: failed for connection 'ovs-if-br-ex' Jan 30 16:53:43.744671 ip-10-0-154-177 NetworkManager[240416] <info> [1643561623.7446] device (br-ex): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed') Jan 30 16:53:43.746528 ip-10-0-154-177 NetworkManager[240416] <info> [1643561623.7465] dhcp4 (br-ex): canceled DHCP transaction Jan 30 16:53:43.746540 ip-10-0-154-177 NetworkManager[240416] <info> [1643561623.7465] dhcp4 (br-ex): state changed timeout -> done OVN pods might have issues. Jan 30 16:53:15.902150 ip-10-0-154-177 hyperkube[2095]: E0130 16:53:15.902134 2095 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"ovnkube-node\" with CrashLoopBackOff: \"back-off 10s restarting failed container=ovnkube-node pod=ovnkube-node-zlhsx_openshift-ovn-kubernetes(5cb9804c-3f6b-4362-ab23-4281bbdeee41)\"" pod="openshift-ovn-kubernetes/ovnkube-node-zlhsx" podUID=5cb9804c-3f6b-4362-ab23-4281bbdeee41 Expected results: Network recovers, Node recovers, TCP connections recover. Additional info: Reproduced on AWS and IBM Cloud
Both clusters were IPI. systemctl NetworkManager restart reliably causes the issue. ip link down and up doesn't seem to trigger the issue. tcpdump was running on ens5 while NetworkManager restarted during the first occurrence on AWS, so ens5 was in promiscuous mode.
@ross can you set NM logging to 'trace' level and reproduce? nmcli log level trace
start of TRACE Jan 31 16:42:28.532437 rbrattai-o410i11-kvds9-master-1 NetworkManager[1759] <info> [1643647348.5324] manager: logging: level 'TRACE' Jan 31 16:44:01.248819 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <warn> [1643647441.2487] dhcp4 (br-ex): request timed out Jan 31 16:44:01.248847 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <info> [1643647441.2488] dhcp4 (br-ex): state changed unknown -> timeout Jan 31 16:44:01.248881 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2488] device[1af75820cc171c57] (br-ex): new DHCPv4 client state 3 Jan 31 16:44:01.248893 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2488] device[1af75820cc171c57] (br-ex): DHCPv4 failed (ip_state conf, was_active 0) Jan 31 16:44:01.248908 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2489] device[1af75820cc171c57] (br-ex): activation-stage: schedule activate_stage4_ip_config_timeout_4,v4 (id 445) Jan 31 16:44:01.248938 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2489] device[1af75820cc171c57] (br-ex): activation-stage: invoke activate_stage4_ip_config_timeout_4,v4 (id 445) Jan 31 16:44:01.248958 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <info> [1643647441.2489] device (br-ex): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') Jan 31 16:44:01.248968 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2489] device[1af75820cc171c57] (br-ex): add_pending_action (2): 'in-state-change' Jan 31 16:44:01.248980 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <trace> [1643647441.2489] device[1af75820cc171c57] (br-ex): stable-id: clear Jan 31 16:44:01.249356 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2493] active-connection[0x564e83a835b0] set state deactivated (was activating) Jan 31 16:44:01.249483 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <info> [1643647441.2494] manager: NetworkManager state is now CONNECTED_LOCAL Jan 31 16:44:01.249600 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2495] active-connection[0x564e83a835b0] check-master-ready: already signalled (state deactivated, master 0x564e83949dc0 is in state activated) Jan 31 16:44:01.249614 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2496] device[1af75820cc171c57] (br-ex): remove_pending_action (1): 'activation-5' Jan 31 16:44:01.249705 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2496] device[13509969e3080d4b] (br-ex): slave br-ex state change 70 (ip-config) -> 120 (failed) Jan 31 16:44:01.249716 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <trace> [1643647441.2497] device[13509969e3080d4b] (br-ex): master: release one slave 0x564e83a5d780/br-ex (enslaved) (configure) Jan 31 16:44:01.249751 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <info> [1643647441.2497] device (br-ex): releasing ovs interface br-ex Jan 31 16:44:01.249764 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <trace> [1643647441.2497] ovsdb: call[095be9f13b8df817] new: del-interface interface=br-ex Jan 31 16:44:01.250162 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <trace> [1643647441.2501] ovsdb: call[095be9f13b8df817] send: call-id=5, {"id": 5, "method": "transact", "params": ["Open_vSwitch", {"op": "mutate", "table": "Open_vSwitch", "mutations": [["ne> Jan 31 16:44:01.250441 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <trace> [1643647441.2504] device[13509969e3080d4b] (br-ex): mtu: commit-mtu... Jan 31 16:44:01.250461 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2504] device[1af75820cc171c57] (br-ex): ifindex: set ifindex 0 Jan 31 16:44:01.251398 rbrattai-o410i11-kvds9-master-1 ovs-vswitchd[1239] ovs|01133|bridge|INFO|bridge br-ex: deleted interface br-ex on port 65534 Jan 31 16:44:01.251902 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <info> [1643647441.2505] device (br-ex): released from master device br-ex Jan 31 16:44:01.252004 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2519] active-connection[0x564e83949dc0] set state-flags is-master,is-slave,layer2-ready,lifetime-bound-to-profile-visibility (was is-master,is-slave,layer2-ready,master-has> Jan 31 16:44:01.252137 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2521] device[1af75820cc171c57] (br-ex): unmanaged: flags set to [!sleeping,!by-type,!platform-init,!user-explicit,!user-settings=0x0/0x79/managed], forget [is-slave=0x1000]> Jan 31 16:44:01.252153 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2521] device[13509969e3080d4b] (br-ex): last slave removed; remaining activated Jan 31 16:44:01.252172 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2521] policy: connection 'ovs-if-br-ex' failed to autoconnect; 3 tries left Jan 31 16:44:01.252185 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <trace> [1643647441.2521] settings-connection[ee1b3e5130e70cab,a87c5ca0-696e-48a9-80e9-069102fba7d6] autoconnect: retries set 3 Jan 31 16:44:01.252534 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <trace> [1643647441.2524] settings: update[a87c5ca0-696e-48a9-80e9-069102fba7d6] clear-secrets: update profile "ovs-if-br-ex" (not persisted) Jan 31 16:44:01.252574 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <trace> [1643647441.2525] settings: storage[a87c5ca0-696e-48a9-80e9-069102fba7d6,a659895e1f528596/keyfile] change event with connection "ovs-if-br-ex" (file "/etc/NetworkManager/system-connect> Jan 31 16:44:01.252607 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <trace> [1643647441.2525] settings: update[a87c5ca0-696e-48a9-80e9-069102fba7d6] updating connection "ovs-if-br-ex" (a659895e1f528596/keyfile) Jan 31 16:44:01.253091 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <trace> [1643647441.2530] settings-connection[ee1b3e5130e70cab,a87c5ca0-696e-48a9-80e9-069102fba7d6] update agent secrets: secrets cleared Jan 31 16:44:01.253810 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <debug> [1643647441.2537] manager: (ovs-if-br-ex) already created virtual interface name br-ex Jan 31 16:44:01.253977 rbrattai-o410i11-kvds9-master-1 NetworkManager[421414] <warn> [1643647441.2539] device (br-ex): Activation: failed for connection 'ovs-if-br-ex'
Mini-update: it's possible to recover the node without a reboot by restarting openvswitch `systemctl restart openvswitch`
It must be something in the flows. When I push an action=normal and then restart immediately, this works --- so my best guess so far is that dhcp is somehow being dropped. ~~~ [akaris@linux data]$ oc debug node/ip-10-0-148-33.ec2.internal chStarting pod/ip-10-0-148-33ec2internal-debug ... To use host binaries, run `chroot /host` root /host Pod IP: 10.0.148.33 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ovs-ofctl add-flow br-ex action=normal sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0x0, duration=86.094s, table=0, n_packets=2632, n_bytes=4304089, idle_age=28, priority=0 actions=NORMAL cookie=0x0, duration=4.547s, table=0, n_packets=463, n_bytes=706732, idle_age=0, actions=NORMAL cookie=0xdeff105, duration=65.510s, table=0, n_packets=0, n_bytes=0, idle_age=65, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=65.510s, table=0, n_packets=0, n_bytes=0, idle_age=65, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=65.510s, table=0, n_packets=161, n_bytes=29816, idle_age=12, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=65.510s, table=0, n_packets=157, n_bytes=289337, idle_age=12, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=65.509s, table=0, n_packets=248, n_bytes=52394, idle_age=4, priority=205,udp,in_port=1,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=65.509s, table=0, n_packets=0, n_bytes=0, idle_age=65, priority=200,udp,in_port=1,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=65.509s, table=0, n_packets=192, n_bytes=107864, idle_age=4, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:1 sh-4.4# nmcli conn NAME UUID TYPE DEVICE ovs-if-br-ex 8a55c503-4638-4390-b70f-83837e7776ac ovs-interface br-ex br-ex 0f3d0d7b-62b7-4586-8439-2dc678ffe71c ovs-bridge br-ex ovs-if-phys0 6c600c36-6997-40ef-8089-831dfcdea6df ethernet ens5 ovs-port-br-ex a9187acf-8445-4f96-a9c7-a5f5053b9d01 ovs-port br-ex ovs-port-phys0 75223cff-0568-4046-849e-df8c0421dc4a ovs-port ens5 sh-4.4# systemctl restart NetworkManager sh-4.4# nmcli conn NAME UUID TYPE DEVICE ovs-if-br-ex 8a55c503-4638-4390-b70f-83837e7776ac ovs-interface br-ex br-ex 0f3d0d7b-62b7-4586-8439-2dc678ffe71c ovs-bridge br-ex ovs-if-phys0 6c600c36-6997-40ef-8089-831dfcdea6df ethernet ens5 ovs-port-br-ex a9187acf-8445-4f96-a9c7-a5f5053b9d01 ovs-port br-ex ovs-port-phys0 75223cff-0568-4046-849e-df8c0421dc4a ovs-port ens5 sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0xdeff105, duration=80.096s, table=0, n_packets=0, n_bytes=0, idle_age=80, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=80.096s, table=0, n_packets=0, n_bytes=0, idle_age=80, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=80.096s, table=0, n_packets=190, n_bytes=34147, idle_age=1, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=80.096s, table=0, n_packets=183, n_bytes=333606, idle_age=0, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=3.287s, table=0, n_packets=18, n_bytes=3635, idle_age=1, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=3.287s, table=0, n_packets=0, n_bytes=0, idle_age=3, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=3.287s, table=0, n_packets=209, n_bytes=122217, idle_age=1, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 cookie=0xb06186a3a85d964d, duration=3.287s, table=0, n_packets=0, n_bytes=0, idle_age=3, priority=110,tcp,in_port=4,tp_dst=31619 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x208c9ca3e7c6379a, duration=3.287s, table=0, n_packets=0, n_bytes=0, idle_age=3, priority=110,tcp,in_port=4,tp_dst=31926 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) sh-4.4# ovs-ofctl add-flow br-ex action=normal sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0xdeff105, duration=91.893s, table=0, n_packets=0, n_bytes=0, idle_age=91, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=91.893s, table=0, n_packets=0, n_bytes=0, idle_age=91, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=91.893s, table=0, n_packets=198, n_bytes=34857, idle_age=11, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=91.893s, table=0, n_packets=188, n_bytes=334109, idle_age=11, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=15.084s, table=0, n_packets=73, n_bytes=13799, idle_age=3, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=15.084s, table=0, n_packets=0, n_bytes=0, idle_age=15, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=15.084s, table=0, n_packets=251, n_bytes=139185, idle_age=3, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 cookie=0xb06186a3a85d964d, duration=15.085s, table=0, n_packets=0, n_bytes=0, idle_age=15, priority=110,tcp,in_port=4,tp_dst=31619 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x208c9ca3e7c6379a, duration=15.085s, table=0, n_packets=0, n_bytes=0, idle_age=15, priority=110,tcp,in_port=4,tp_dst=31926 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) sh-4.4# ovs-ofctl dump-flows br-ex | less sh-4.4# sh-4.4# sh-4.4# (reverse-i-search)`ad': ovs-ofctl dump-flows br-ex | he^C sh-4.4# ovs-ofctl add-flow br-ex action=normal sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0x0, duration=154.017s, table=0, n_packets=2639, n_bytes=4304443, idle_age=30, priority=0 actions=NORMAL cookie=0x0, duration=2.214s, table=0, n_packets=851, n_bytes=1763257, idle_age=0, actions=NORMAL cookie=0xdeff105, duration=133.433s, table=0, n_packets=0, n_bytes=0, idle_age=133, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=133.433s, table=0, n_packets=0, n_bytes=0, idle_age=133, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=133.433s, table=0, n_packets=251, n_bytes=47238, idle_age=8, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=133.433s, table=0, n_packets=235, n_bytes=353153, idle_age=8, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=56.624s, table=0, n_packets=1070, n_bytes=919977, idle_age=2, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=56.624s, table=0, n_packets=0, n_bytes=0, idle_age=56, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=56.624s, table=0, n_packets=1240, n_bytes=493875, idle_age=2, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0x0, duration=156.072s, table=0, n_packets=2639, n_bytes=4304443, idle_age=32, priority=0 actions=NORMAL cookie=0x0, duration=4.269s, table=0, n_packets=1295, n_bytes=2548151, idle_age=0, actions=NORMAL cookie=0xdeff105, duration=135.488s, table=0, n_packets=0, n_bytes=0, idle_age=135, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=135.488s, table=0, n_packets=0, n_bytes=0, idle_age=135, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=135.488s, table=0, n_packets=251, n_bytes=47238, idle_age=10, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=135.488s, table=0, n_packets=235, n_bytes=353153, idle_age=10, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=58.679s, table=0, n_packets=1070, n_bytes=919977, idle_age=4, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=58.679s, table=0, n_packets=0, n_bytes=0, idle_age=58, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=58.679s, table=0, n_packets=1240, n_bytes=493875, idle_age=4, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0x0, duration=157.031s, table=0, n_packets=2639, n_bytes=4304443, idle_age=33, priority=0 actions=NORMAL cookie=0x0, duration=5.228s, table=0, n_packets=1399, n_bytes=2704718, idle_age=0, actions=NORMAL cookie=0xdeff105, duration=136.447s, table=0, n_packets=0, n_bytes=0, idle_age=136, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=136.447s, table=0, n_packets=0, n_bytes=0, idle_age=136, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=136.447s, table=0, n_packets=251, n_bytes=47238, idle_age=11, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=136.447s, table=0, n_packets=235, n_bytes=353153, idle_age=11, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=59.638s, table=0, n_packets=1070, n_bytes=919977, idle_age=5, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=59.638s, table=0, n_packets=0, n_bytes=0, idle_age=59, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=59.638s, table=0, n_packets=1240, n_bytes=493875, idle_age=5, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0xdeff105, duration=138.192s, table=0, n_packets=0, n_bytes=0, idle_age=138, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=138.192s, table=0, n_packets=0, n_bytes=0, idle_age=138, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=138.192s, table=0, n_packets=251, n_bytes=47238, idle_age=13, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=138.192s, table=0, n_packets=235, n_bytes=353153, idle_age=13, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=61.383s, table=0, n_packets=1091, n_bytes=940135, idle_age=0, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=61.383s, table=0, n_packets=0, n_bytes=0, idle_age=61, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=61.383s, table=0, n_packets=1323, n_bytes=503229, idle_age=0, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 cookie=0xb06186a3a85d964d, duration=61.383s, table=0, n_packets=0, n_bytes=0, idle_age=61, priority=110,tcp,in_port=4,tp_dst=31619 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x208c9ca3e7c6379a, duration=61.383s, table=0, n_packets=0, n_bytes=0, idle_age=61, priority=110,tcp,in_port=4,tp_dst=31926 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) sh-4.4# sh-4.4# sh-4.4# sh-4.4# sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0xdeff105, duration=141.587s, table=0, n_packets=0, n_bytes=0, idle_age=141, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=141.587s, table=0, n_packets=0, n_bytes=0, idle_age=141, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=141.587s, table=0, n_packets=254, n_bytes=47510, idle_age=1, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=141.587s, table=0, n_packets=238, n_bytes=353808, idle_age=1, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=64.778s, table=0, n_packets=1177, n_bytes=1041676, idle_age=0, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=64.778s, table=0, n_packets=0, n_bytes=0, idle_age=64, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=64.778s, table=0, n_packets=1399, n_bytes=548432, idle_age=0, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 cookie=0xb06186a3a85d964d, duration=64.778s, table=0, n_packets=0, n_bytes=0, idle_age=64, priority=110,tcp,in_port=4,tp_dst=31619 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x208c9ca3e7c6379a, duration=64.778s, table=0, n_packets=0, n_bytes=0, idle_age=64, priority=110,tcp,in_port=4,tp_dst=31926 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) sh-4.4# ovs-ofctl add-flow br-ex action=normal sh-4.4# TOOLBOX_IMAGE=registry.fedoraproject.org/f33/fedora-toolbox sh-4.4# ^C sh-4.4# ^C sh-4.4# ^C sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0x0, duration=168.519s, table=0, n_packets=2639, n_bytes=4304443, idle_age=44, priority=0 actions=NORMAL cookie=0x0, duration=3.948s, table=0, n_packets=1171, n_bytes=1831499, idle_age=0, actions=NORMAL cookie=0xdeff105, duration=147.935s, table=0, n_packets=0, n_bytes=0, idle_age=147, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=147.935s, table=0, n_packets=0, n_bytes=0, idle_age=147, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=147.935s, table=0, n_packets=254, n_bytes=47510, idle_age=7, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=147.935s, table=0, n_packets=238, n_bytes=353808, idle_age=7, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=71.126s, table=0, n_packets=1215, n_bytes=1075568, idle_age=4, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=71.126s, table=0, n_packets=0, n_bytes=0, idle_age=71, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=71.126s, table=0, n_packets=1435, n_bytes=556871, idle_age=4, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 sh-4.4# systemctl restart NetworkManager sh-4.4# nmcli conn NAME UUID TYPE DEVICE ovs-if-br-ex 8a55c503-4638-4390-b70f-83837e7776ac ovs-interface br-ex br-ex 0f3d0d7b-62b7-4586-8439-2dc678ffe71c ovs-bridge br-ex ovs-if-phys0 6c600c36-6997-40ef-8089-831dfcdea6df ethernet ens5 ovs-port-br-ex a9187acf-8445-4f96-a9c7-a5f5053b9d01 ovs-port br-ex ovs-port-phys0 75223cff-0568-4046-849e-df8c0421dc4a ovs-port ens5 sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0x0, duration=184.848s, table=0, n_packets=2639, n_bytes=4304443, idle_age=61, priority=0 actions=NORMAL cookie=0x0, duration=20.277s, table=0, n_packets=4812, n_bytes=9589412, idle_age=0, actions=NORMAL cookie=0xdeff105, duration=164.264s, table=0, n_packets=0, n_bytes=0, idle_age=164, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=164.264s, table=0, n_packets=0, n_bytes=0, idle_age=164, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=164.264s, table=0, n_packets=254, n_bytes=47510, idle_age=24, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=164.264s, table=0, n_packets=238, n_bytes=353808, idle_age=24, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=87.455s, table=0, n_packets=1215, n_bytes=1075568, idle_age=20, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=87.455s, table=0, n_packets=0, n_bytes=0, idle_age=87, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=87.455s, table=0, n_packets=1435, n_bytes=556871, idle_age=20, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0x0, duration=187.150s, table=0, n_packets=2639, n_bytes=4304443, idle_age=63, priority=0 actions=NORMAL cookie=0x0, duration=22.579s, table=0, n_packets=4984, n_bytes=9714213, idle_age=0, actions=NORMAL cookie=0xdeff105, duration=166.566s, table=0, n_packets=0, n_bytes=0, idle_age=166, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=166.566s, table=0, n_packets=0, n_bytes=0, idle_age=166, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=166.566s, table=0, n_packets=254, n_bytes=47510, idle_age=26, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=166.566s, table=0, n_packets=238, n_bytes=353808, idle_age=26, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=89.757s, table=0, n_packets=1215, n_bytes=1075568, idle_age=22, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=89.757s, table=0, n_packets=0, n_bytes=0, idle_age=89, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=89.757s, table=0, n_packets=1435, n_bytes=556871, idle_age=22, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0x0, duration=189.304s, table=0, n_packets=2639, n_bytes=4304443, idle_age=65, priority=0 actions=NORMAL cookie=0x0, duration=24.733s, table=0, n_packets=5249, n_bytes=9870828, idle_age=0, actions=NORMAL cookie=0xdeff105, duration=168.720s, table=0, n_packets=0, n_bytes=0, idle_age=168, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=168.720s, table=0, n_packets=0, n_bytes=0, idle_age=168, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=168.720s, table=0, n_packets=254, n_bytes=47510, idle_age=28, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=168.720s, table=0, n_packets=238, n_bytes=353808, idle_age=28, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=91.911s, table=0, n_packets=1215, n_bytes=1075568, idle_age=24, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=91.911s, table=0, n_packets=0, n_bytes=0, idle_age=91, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=91.911s, table=0, n_packets=1435, n_bytes=556871, idle_age=24, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0x0, duration=190.462s, table=0, n_packets=2639, n_bytes=4304443, idle_age=66, priority=0 actions=NORMAL cookie=0x0, duration=25.891s, table=0, n_packets=5394, n_bytes=9945465, idle_age=0, actions=NORMAL cookie=0xdeff105, duration=169.878s, table=0, n_packets=0, n_bytes=0, idle_age=169, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=169.878s, table=0, n_packets=0, n_bytes=0, idle_age=169, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=169.878s, table=0, n_packets=254, n_bytes=47510, idle_age=29, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=169.878s, table=0, n_packets=238, n_bytes=353808, idle_age=29, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=93.069s, table=0, n_packets=1215, n_bytes=1075568, idle_age=26, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=93.069s, table=0, n_packets=0, n_bytes=0, idle_age=93, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=93.069s, table=0, n_packets=1435, n_bytes=556871, idle_age=26, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0x0, duration=191.329s, table=0, n_packets=2639, n_bytes=4304443, idle_age=67, priority=0 actions=NORMAL cookie=0x0, duration=26.758s, table=0, n_packets=5508, n_bytes=10065032, idle_age=0, actions=NORMAL cookie=0xdeff105, duration=170.745s, table=0, n_packets=0, n_bytes=0, idle_age=170, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=170.745s, table=0, n_packets=0, n_bytes=0, idle_age=170, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=170.745s, table=0, n_packets=254, n_bytes=47510, idle_age=30, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=170.745s, table=0, n_packets=238, n_bytes=353808, idle_age=30, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=93.936s, table=0, n_packets=1215, n_bytes=1075568, idle_age=26, priority=205,udp,in_port=4,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=93.936s, table=0, n_packets=0, n_bytes=0, idle_age=93, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=93.936s, table=0, n_packets=1435, n_bytes=556871, idle_age=26, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 sh-4.4# systemctl restart NetworkManager sh-4.4# nmcli conn NAME UUID TYPE DEVICE ovs-if-br-ex 8a55c503-4638-4390-b70f-83837e7776ac ovs-interface br-ex br-ex 0f3d0d7b-62b7-4586-8439-2dc678ffe71c ovs-bridge br-ex ovs-if-phys0 6c600c36-6997-40ef-8089-831dfcdea6df ethernet ens5 ovs-port-br-ex a9187acf-8445-4f96-a9c7-a5f5053b9d01 ovs-port br-ex ovs-port-phys0 75223cff-0568-4046-849e-df8c0421dc4a ovs-port ens5 sh-4.4# ovs-ofctl dump-flows br-ex | head NXST_FLOW reply (xid=0x4): cookie=0xdeff105, duration=179.125s, table=0, n_packets=0, n_bytes=0, idle_age=179, priority=500,ip,in_port=3,nw_src=10.0.148.33,nw_dst=169.254.169.2 actions=ct(commit,table=4,zone=64001,nat(dst=10.0.148.33)) cookie=0xdeff105, duration=179.125s, table=0, n_packets=0, n_bytes=0, idle_age=179, priority=500,ip,in_port=LOCAL,nw_dst=169.254.169.1 actions=ct(table=5,zone=64002,nat) cookie=0xdeff105, duration=179.125s, table=0, n_packets=323, n_bytes=59269, idle_age=0, priority=500,ip,in_port=LOCAL,nw_dst=172.30.0.0/16 actions=ct(commit,table=2,zone=64001,nat(src=169.254.169.2)) cookie=0xdeff105, duration=179.125s, table=0, n_packets=310, n_bytes=551240, idle_age=0, priority=500,ip,in_port=3,nw_src=172.30.0.0/16,nw_dst=169.254.169.2 actions=ct(table=3,zone=64001,nat) cookie=0xdeff105, duration=3.560s, table=0, n_packets=71, n_bytes=76497, idle_age=0, priority=205,udp,in_port=6,dl_dst=0e:b5:e0:d1:73:1d,tp_dst=6081 actions=LOCAL cookie=0xb06186a3a85d964d, duration=3.560s, table=0, n_packets=0, n_bytes=0, idle_age=3, priority=110,tcp,in_port=6,tp_dst=31619 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x208c9ca3e7c6379a, duration=3.560s, table=0, n_packets=0, n_bytes=0, idle_age=3, priority=110,tcp,in_port=6,tp_dst=31926 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=3.560s, table=0, n_packets=0, n_bytes=0, idle_age=3, priority=200,udp,in_port=6,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=3.560s, table=0, n_packets=1494, n_bytes=569715, idle_age=0, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:6 sh-4.4# systemctl restart NetworkManager --- > disconnected ~~~
After lots of banging my head, I think I found the offending flow: ~~~ sh-4.4# ovs-ofctl dump-flows br-ex | grep priority=100 cookie=0xdeff105, duration=588.277s, table=0, n_packets=27796, n_bytes=3028594, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:1 cookie=0xdeff105, duration=588.277s, table=0, n_packets=14225, n_bytes=6544765, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),output:1 cookie=0xdeff105, duration=588.277s, table=1, n_packets=34152, n_bytes=105626161, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=588.277s, table=1, n_packets=15418, n_bytes=41699230, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=588.277s, table=1, n_packets=0, n_bytes=0, idle_age=588, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=588.277s, table=1, n_packets=0, n_bytes=0, idle_age=588, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL sh-4.4# ovs-ofctl del-flows --strict br-ex 'priority=100,ip,in_port=LOCAL' sh-4.4# ovs-ofctl dump-flows br-ex | grep priority=100 cookie=0xdeff105, duration=622.159s, table=0, n_packets=29255, n_bytes=3172329, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:1 cookie=0xdeff105, duration=622.159s, table=1, n_packets=36001, n_bytes=111509524, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=622.159s, table=1, n_packets=16339, n_bytes=41878382, idle_age=1, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=622.159s, table=1, n_packets=0, n_bytes=0, idle_age=622, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=622.159s, table=1, n_packets=0, n_bytes=0, idle_age=622, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL sh-4.4# systemctl restart NetworkManager sh-4.4# ~~~
Likely here: ~~~ 1018 // table 0, packets coming from host Commit connections with ct_mark ctMarkHost 1019 // so that reverse direction goes back to the host. 1020 dftFlows = append(dftFlows, 1021 fmt.Sprintf("cookie=%s, priority=100, in_port=%s, ip, "+ 1022 "actions=ct(commit, zone=%d, exec(set_field:%s->ct_mark)), output:%s", 1023 defaultOpenFlowCookie, ofPortHost, config.Default.ConntrackZone, ctMarkHost, ofPortPhys)) 1024 1025 // table 0, packets coming from external. Send it through conntrack and 1026 // resubmit to table 1 to know the state and mark of the connection. 1027 dftFlows = append(dftFlows, NORMAL ᚠ 5e1ad49 pkg/node/gateway_shared_intf.go ~~~ I'll follow up tomorrow.
Example of what is happening: #################################################################### Let's look at the issue on a specific node in detail. Before NetworkManager restart - we can see that flows from LOCAL were programmed to have actions=output:2, that is port ens5: ~~~ sh-4.4# ovs-ofctl dump-flows br-ex | egrep 'priority=100|priority=110' | grep ct cookie=0x7d956a9aff1e82d9, duration=806.004s, table=0, n_packets=0, n_bytes=0, idle_age=806, priority=110,tcp,in_port=2,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=806.004s, table=0, n_packets=559, n_bytes=72898, idle_age=6, priority=110,tcp,in_port=2,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=806.004s, table=0, n_packets=0, n_bytes=0, idle_age=806, priority=110,tcp,in_port=3,tp_src=30906 actions=output:2 cookie=0xa9f3b7acc211a828, duration=806.004s, table=0, n_packets=460, n_bytes=159705, idle_age=6, priority=110,tcp,in_port=3,tp_src=32648 actions=output:2 cookie=0xdeff105, duration=1001.391s, table=0, n_packets=48851, n_bytes=5208677, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:2 # this flow here cookie=0xdeff105, duration=1001.391s, table=0, n_packets=44786, n_bytes=9825273, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),output:2 # flow above cookie=0xdeff105, duration=1001.391s, table=1, n_packets=54915, n_bytes=176089659, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1001.391s, table=1, n_packets=130997, n_bytes=625065266, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=1001.391s, table=1, n_packets=0, n_bytes=0, idle_age=1001, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1001.391s, table=1, n_packets=0, n_bytes=0, idle_age=1001, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL sh-4.4# ovs-ofctl show br-ex OFPT_FEATURES_REPLY (xid=0x2): dpid:00000e5e000a1447 n_tables:254, n_buffers:0 capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 2(ens5): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 3(patch-br-ex_ip-): addr:ae:71:7f:b8:a9:8b config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max LOCAL(br-ex): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0 sh-4.4# ip a ls dev br-ex 7: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 0e:5e:00:0a:14:47 brd ff:ff:ff:ff:ff:ff inet 10.0.155.59/20 brd 10.0.159.255 scope global dynamic noprefixroute br-ex valid_lft 2296sec preferred_lft 2296sec inet6 fe80::3350:966d:18ae:143e/64 scope link noprefixroute valid_lft forever preferred_lft forever ~~~ After NetworkManager restart - note how the flows still point out port number 2, which does not exist any more because NM unplugged and replugged all ports out of/into the bridge: ~~~ [root@ip-10-0-155-59 tmp]# ovs-ofctl show br-ex OFPT_FEATURES_REPLY (xid=0x2): dpid:00000e5e000a1447 n_tables:254, n_buffers:0 capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 3(patch-br-ex_ip-): addr:ae:71:7f:b8:a9:8b config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 4(ens5): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max LOCAL(br-ex): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0 [root@ip-10-0-155-59 tmp]# ip a ls dev br-ex 24: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 0e:5e:00:0a:14:47 brd ff:ff:ff:ff:ff:ff inet6 fe80::3350:966d:18ae:143e/64 scope link noprefixroute valid_lft forever preferred_lft forever [root@ip-10-0-155-59 tmp]# ovs-ofctl dump-flows br-ex | egrep 'priority=100|priority=110' | grep ct cookie=0x7d956a9aff1e82d9, duration=1149.827s, table=0, n_packets=0, n_bytes=0, idle_age=1149, priority=110,tcp,in_port=2,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=1149.827s, table=0, n_packets=920, n_bytes=118853, idle_age=58, priority=110,tcp,in_port=2,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=1149.827s, table=0, n_packets=0, n_bytes=0, idle_age=1149, priority=110,tcp,in_port=3,tp_src=30906 actions=output:2 cookie=0xa9f3b7acc211a828, duration=1149.827s, table=0, n_packets=754, n_bytes=326659, idle_age=0, priority=110,tcp,in_port=3,tp_src=32648 actions=output:2 cookie=0xdeff105, duration=1345.214s, table=0, n_packets=64333, n_bytes=6926605, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:2 # this flow here cookie=0xdeff105, duration=1345.214s, table=0, n_packets=48981, n_bytes=13472381, idle_age=2, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),output:2 # flow above cookie=0xdeff105, duration=1345.214s, table=1, n_packets=72044, n_bytes=228122244, idle_age=57, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1345.214s, table=1, n_packets=135125, n_bytes=626407905, idle_age=57, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=1345.214s, table=1, n_packets=0, n_bytes=0, idle_age=1345, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1345.214s, table=1, n_packets=0, n_bytes=0, idle_age=1345, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL ~~~ And NM cannot get the IP address, because the packets never make it to ens5 due to the now broken flows: ~~~ [root@ip-10-0-155-59 tmp]# nmcli conn NAME UUID TYPE DEVICE ovs-if-br-ex 2fa1deca-aed4-4b54-b96e-7114f3c135b3 ovs-interface br-ex # <--- yellow, all others are green br-ex 4ff63c37-b94d-40a0-a084-c8a4e28ce549 ovs-bridge br-ex ovs-if-phys0 d8643a0f-f9f6-497b-a2ae-81160e0b43ee ethernet ens5 ovs-port-br-ex cb60e9a7-9aa3-4a45-9397-d86b311cb326 ovs-port br-ex ovs-port-phys0 32c84cb1-7023-41ac-8f9e-d9780ad74057 ovs-port ens5 Wired connection 1 67ebaae8-b025-361b-aafc-dd1061c2a4ba ethernet -- ~~~ A tcpdump at this point in time does show DHCP requests on br-ex, but they never make it to ens5 --- logically: ~~~ sh-5.0# tcpdump -nne -i br-ex port 68 or port 67 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on br-ex, link-type EN10MB (Ethernet), capture size 262144 bytes 22:56:06.552361 0e:b5:e0:d1:73:1d > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 324: 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 0e:b5:e0:d1:73:1d, length 282 22:56:10.640169 0e:b5:e0:d1:73:1d > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 324: 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 0e:b5:e0:d1:73:1d, length 282 ^C 2 packets captured 2 packets received by filter 0 packets dropped by kernel sh-5.0# timeout 90 tcpdump -nne -i ens5 port 68 or port 67 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens5, link-type EN10MB (Ethernet), capture size 262144 bytes 0 packets captured 0 packets received by filter 0 packets dropped by kernel ~~~ Now, upon deletion of the flow in question, DHCP will work again: ~~~ [root@ip-10-0-155-59 tmp]# ]# ovs-ofctl del-flows --strict br-ex 'priority=100,ip,in_port=LOCAL' [root@ip-10-0-155-59 tmp]# ]# ovs-ofctl dump-flows br-ex | egrep 'priority=100|p cookie=0x7d956a9aff1e82d9, duration=1248.874s, table=0, n_packets=0, n_bytes=0, idle_age=1248, priority=110,tcp,in_port=2,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=1248.874s, table=0, n_packets=920, n_bytes=118853, idle_age=157, priority=110,tcp,in_port=2,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=1248.874s, table=0, n_packets=0, n_bytes=0, idle_age=1248, priority=110,tcp,in_port=3,tp_src=30906 actions=output:2 cookie=0xa9f3b7acc211a828, duration=1248.875s, table=0, n_packets=756, n_bytes=326791, idle_age=68, priority=110,tcp,in_port=3,tp_src=32648 actions=output:2 cookie=0xdeff105, duration=1444.262s, table=0, n_packets=68235, n_bytes=7254084, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:2 cookie=0xdeff105, duration=1444.262s, table=1, n_packets=72044, n_bytes=228122244, idle_age=156, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1444.262s, table=1, n_packets=135125, n_bytes=626407905, idle_age=156, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=1444.262s, table=1, n_packets=0, n_bytes=0, idle_age=1444, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1444.262s, table=1, n_packets=0, n_bytes=0, idle_age=1444, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL [root@ip-10-0-155-59 tmp]# ~~~ Network manager can connect and get an IP address: ~~~ [root@ip-10-0-155-59 tmp]# nmcli conn NAME UUID TYPE DEVICE ovs-if-br-ex 2fa1deca-aed4-4b54-b96e-7114f3c135b3 ovs-interface br-ex # <---- green br-ex 4ff63c37-b94d-40a0-a084-c8a4e28ce549 ovs-bridge br-ex ovs-if-phys0 d8643a0f-f9f6-497b-a2ae-81160e0b43ee ethernet ens5 ovs-port-br-ex cb60e9a7-9aa3-4a45-9397-d86b311cb326 ovs-port br-ex ovs-port-phys0 32c84cb1-7023-41ac-8f9e-d9780ad74057 ovs-port ens5 Wired connection 1 67ebaae8-b025-361b-aafc-dd1061c2a4ba ethernet -- ~~~ After a couple of minutes, ovnkube will reconfigure the flows (look at the output ports): ~~~ [root@ip-10-0-155-59 tmp]# ovs-ofctl dump-flows br-ex | egrep 'priority=100|prior everse-i-search)`dump-flow': ovs-ofctl cookie=0x7d956a9aff1e82d9, duration=18.791s, table=0, n_packets=0, n_bytes=0, idle_age=18, priority=110,tcp,in_port=4,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=18.791s, table=0, n_packets=0, n_bytes=0, idle_age=18, priority=110,tcp,in_port=4,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=18.791s, table=0, n_packets=0, n_bytes=0, idle_age=1296, priority=110,tcp,in_port=3,tp_src=30906 actions=output:4 cookie=0xa9f3b7acc211a828, duration=18.791s, table=0, n_packets=756, n_bytes=326791, idle_age=115, priority=110,tcp,in_port=3,tp_src=32648 actions=output:4 cookie=0xdeff105, duration=18.791s, table=0, n_packets=73693, n_bytes=7969151, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:4 cookie=0xdeff105, duration=18.791s, table=0, n_packets=840, n_bytes=466458, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),output:4 cookie=0xdeff105, duration=1491.452s, table=1, n_packets=77346, n_bytes=244035518, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1491.452s, table=1, n_packets=135824, n_bytes=629032462, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=1491.452s, table=1, n_packets=0, n_bytes=0, idle_age=1491, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1491.452s, table=1, n_packets=0, n_bytes=0, idle_age=1491, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL ~~~ Root cause ################################################################### The problem here is that ens5 is disconnected from the OVS bridge, and that ens5's internal port ID inside OVS changes. However, OVS flows are programmed by port ID with the exception of the LOCAL match ((the bridge's internal port). So, the flows that were programmed earlier to steer packets coming from in_port=LOCAL to out_port=x will break after a NetworkManager restart, because ens5 is no longer at OVS internal id x but at a new ID y. In my opinion, the main problem is that NetworkManager upon a restart deletes all interface <-> OVS connections. NetworkManager is designed to simply pull out and replug everything. Instead, it should shutdown the linux ports, and keep the interfaces which are connected to the OVS bridges shutdown instead of ripping them out and plugging them in completely. Just as an example for the "power" that NM has: NetworkManager rips out the internal bridge interface when it restarts. That is something that administrators **cannot** do. For an admin, it's impossible to do `ip link del br-ex`, but NetworkManager achieves that feat and adds a new internal port to OVS bridge br-ex (the interface index increases to a new value (as shown by ip link ls dev br-ex), clearly showing that the OVS internal interface is removed from the kernel, then readded). But that's actually not the problem in this specific case, so let's focus on the essential. I'd argue that pulling all interfaces from an OVS bridge and plugging them back in on a running system is an invalid operation for OVS and should not be done. This IMO clearly is a NetworkManager bug or design problem. Removing and recreating ports that were plugged into OVS (not just shutting them down) has the potential to lead to issues. When an interface is removed and readded to OVS, the OVS interface ID will change as well, thus wreaking havoc to installed OVS flows which identify the in/out port via OF port id. Again, I personally believe that this is a NetworkManager design issue and that it must be addressed in NetworkManager eventually. NetworkManager runs totally against administrator intuition here and unnecessarily pulls out and replugs ports into OVS. An NM restart should *not* lead to such an action With that said, the issue can only harm us due to the fact that ovnkube-node programs OVS flows and connects to the rest of the cluster via br-ex. So, our controller uses the dataplane that it programs itself, leading to chicken/egg style issues when flows are broken. When the node sends out a DHCP request to get its IP address after a NetworkManager restart, the DHCP request is steered into the non-existing out_port=x and is dropped. The node never gets an IP address. Without an IP address, ovnkube-node cannot contact its control plane. And without that, it will not reprogram the OpenFlow flows to match the new port numbers.. However, with regards to the symptoms that we are seeing, and as to why we are only observing them as of OCP 4.10: packets outbound from br-ex (in_port=LOCAL) used to hit action=normal and would go out any port, regardless of its port_id. A quick comparison of OCP 4.9 to 4.10 for the flows in question shows: OpenShift 4.9 (logic for IPv4 shown only): ~~~ func commonFlows(ofPortPhys, bridgeMacAddress, ofPortPatch string) []string { var dftFlows []string maxPktLength := getMaxFrameLength() // table 0, we check to see if this dest mac is the shared mac, if so flood to both ports dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=10, table=0, in_port=%s, dl_dst=%s, actions=output:%s,output:LOCAL", defaultOpenFlowCookie, ofPortPhys, bridgeMacAddress, ofPortPatch)) if config.IPv4Mode { // table 0, packets coming from pods headed externally. Commit connections // so that reverse direction goes back to the pods. dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=100, in_port=%s, ip, "+ "actions=ct(commit, zone=%d), output:%s", defaultOpenFlowCookie, ofPortPatch, config.Default.ConntrackZone, ofPortPhys)) // table 0, packets coming from external. Send it through conntrack and // resubmit to table 1 to know the state of the connection. dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=50, in_port=%s, ip, "+ "actions=ct(zone=%d, table=1)", defaultOpenFlowCookie, ofPortPhys, config.Default.ConntrackZone)) } ~~~ OpenShift 4.10 (logic for IPv4 shown only): ~~~ if config.IPv4Mode { // table 0, packets coming from pods headed externally. Commit connections with ct_mark ctMarkOVN // so that reverse direction goes back to the pods. dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=100, in_port=%s, ip, "+ "actions=ct(commit, zone=%d, exec(set_field:%s->ct_mark)), output:%s", defaultOpenFlowCookie, ofPortPatch, config.Default.ConntrackZone, ctMarkOVN, ofPortPhys)) // table 0, packets coming from host Commit connections with ct_mark ctMarkHost // so that reverse direction goes back to the host. dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=100, in_port=%s, ip, "+ "actions=ct(commit, zone=%d, exec(set_field:%s->ct_mark)), output:%s", defaultOpenFlowCookie, ofPortHost, config.Default.ConntrackZone, ctMarkHost, ofPortPhys)) // table 0, packets coming from external. Send it through conntrack and // resubmit to table 1 to know the state and mark of the connection. dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=50, in_port=%s, ip, "+ "actions=ct(zone=%d, table=1)", defaultOpenFlowCookie, ofPortPhys, config.Default.ConntrackZone)) } ~~~ The commit that introduced the problematic flows is: https://github.com/ovn-org/ovn-kubernetes/commit/126b132f863bcc86fb06e1557cd54a46c4ad9b02 ~~~ Shared gateway use ct_mark to classify host/ovn traffic Today, openflow rules on the shared gateway bridge send incoming traffic from uplink to conntrack on zone 64000 to decide if the traffic needs to be steered towards OVN or the host. established/related connections are sent to OVN via the patch port to OVN integration bridge while new connections are set to the host. This poses an issue for NICs that are capable of offloading conneciton tracking rules as only established connections are offloadable. This commit modifies the openflow rules on the physical bridge to allow the use of ct_mark on ct zone 64000 to classify traffic going to host and OVN - table 0: set up higher priority rule (205/200) to steer tunneled traffic based on udp encap port - table 0: commit traffic coming out of ovn (patch to br-int) to zone 64000 with ct_mark=0x1 - table 0: commit traffic coming from host (LOCAL) to zone 64000 with ct_mark=0x2 - table 1: send established/related connections on zone 64000 to either OVN or host based on ct_mark ~~~ That means before that commit, the host could still send a DHCP request and it hit the normal action and thus was flooded out all ports on OVS bridge br-ex. The host thus could get an IP address and eventually fix the broken flows. Since that commit, however, outbound host flows are marked with ct_mark=0x2 and are steered explicitly out of ofPortHost (our out_port=x) instead of having action=normal. When network manager restarts, the flows remains the same, but the OVS ID of ens5 is not port=y, and thus the flow does not match ens5 any more: ~~~ if config.IPv4Mode { (...) // table 1, established and related connections in zone 64000 with ct_mark ctMarkHost go to host dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=100, table=1, ip, ct_state=+trk+est, ct_mark=%s, "+ "actions=output:%s", defaultOpenFlowCookie, ctMarkHost, ofPortHost)) dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=100, table=1, ip, ct_state=+trk+rel, ct_mark=%s, "+ "actions=output:%s", defaultOpenFlowCookie, ctMarkHost, ofPortHost)) (...) ~~~ The return flows matching that mark ct_mark=0x2 are then steered with the following flows, to LOCAL: ~~~ (...) if config.IPv4Mode { // table 1, established and related connections in zone 64000 with ct_mark ctMarkOVN go to OVN dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=100, table=1, ip, ct_state=+trk+est, ct_mark=%s, "+ "actions=%s", defaultOpenFlowCookie, ctMarkOVN, actions)) dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=100, table=1, ip, ct_state=+trk+rel, ct_mark=%s, "+ "actions=%s", defaultOpenFlowCookie, ctMarkOVN, actions)) // table 1, established and related connections in zone 64000 with ct_mark ctMarkHost go to host dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=100, table=1, ip, ct_state=+trk+est, ct_mark=%s, "+ "actions=output:%s", defaultOpenFlowCookie, ctMarkHost, ofPortHost)) dftFlows = append(dftFlows, fmt.Sprintf("cookie=%s, priority=100, table=1, ip, ct_state=+trk+rel, ct_mark=%s, "+ "actions=output:%s", defaultOpenFlowCookie, ctMarkHost, ofPortHost)) (...) ~~~ Possible solutions: ########################################################################## Short term: We can hence fix this on the ovnkube side by creating a bypass flow specifically for DHCP traffic (a), or by modifying the action to normal and by having normal switch logic handle this for outbound packets from the host (b). The return traffic will then either hit the NORMAL flow (a), or it will match the conntrack return flow with actions=LOCAL as the out port (b). Long term: Fix NetworkManager.
Here's a fix for this specific issue: https://github.com/ovn-org/ovn-kubernetes/pull/2785 With that fix in OCP 4.10, NetworkManager can be safely restarted: ~~~ sh-4.4# ovs-ofctl dump-flows br-ex | grep LOCAL | grep priority=100 cookie=0xdeff105, duration=1029.107s, table=0, n_packets=177772, n_bytes=150655268, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),NORMAL cookie=0xdeff105, duration=12824.653s, table=1, n_packets=354589, n_bytes=1148457012, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=12824.653s, table=1, n_packets=0, n_bytes=0, idle_age=12824, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL sh-4.4# systemctl restart NetworkManager sh-4.4# sh-4.4# nmcli conn NAME UUID TYPE DEVICE ovs-if-br-ex 2fa1deca-aed4-4b54-b96e-7114f3c135b3 ovs-interface br-ex br-ex 4ff63c37-b94d-40a0-a084-c8a4e28ce549 ovs-bridge br-ex ovs-if-phys0 d8643a0f-f9f6-497b-a2ae-81160e0b43ee ethernet ens5 ovs-port-br-ex cb60e9a7-9aa3-4a45-9397-d86b311cb326 ovs-port br-ex ovs-port-phys0 32c84cb1-7023-41ac-8f9e-d9780ad74057 ovs-port ens5 Wired connection 1 67ebaae8-b025-361b-aafc-dd1061c2a4ba ethernet -- ~~~ There's a connected issue though and that again has to do with the port changes. Upon NetworkManager restart, traffic to pods / to OVN might break until the new flows are actually installed on br-ex. That issue can be shown with: ~~~ ovs-ofctl show br-ex ovs-ofctl dump-flows br-ex | egrep 'in_port=2|output:2|in_port=4|output:4' systemctl restart NetworkManager sleep 5 ovs-ofctl show br-ex ovs-ofctl dump-flows br-ex | egrep 'in_port=2|output:2|in_port=4|output:4' for i in {0..10}; do sleep 5; date; ovs-ofctl dump-flows br-ex | egrep 'priority=100|priority=110'; done ~~~ ~~~ sh-4.4# ovs-ofctl show br-ex OFPT_FEATURES_REPLY (xid=0x2): dpid:00000e5e000a1447 n_tables:254, n_buffers:0 capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 2(ens5): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 3(patch-br-ex_ip-): addr:5e:f1:2f:35:7d:ef config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max LOCAL(br-ex): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0 sh-4.4# ovs-ofctl dump-flows br-ex | egrep 'in_port=2|output:2|in_port=4|output:4' cookie=0xdeff105, duration=970.121s, table=0, n_packets=25139, n_bytes=23707979, idle_age=0, priority=205,udp,in_port=2,dl_dst=0e:5e:00:0a:14:47,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=970.121s, table=0, n_packets=0, n_bytes=0, idle_age=970, priority=200,udp,in_port=2,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=970.121s, table=0, n_packets=27462, n_bytes=46434632, idle_age=0, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:2 cookie=0x7d956a9aff1e82d9, duration=970.121s, table=0, n_packets=7, n_bytes=632, idle_age=371, priority=110,tcp,in_port=2,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=970.121s, table=0, n_packets=596, n_bytes=75460, idle_age=23, priority=110,tcp,in_port=2,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=970.121s, table=0, n_packets=6, n_bytes=3041, idle_age=371, priority=110,tcp,in_port=3,tp_src=30906 actions=output:2 cookie=0xa9f3b7acc211a828, duration=970.121s, table=0, n_packets=480, n_bytes=272857, idle_age=23, priority=110,tcp,in_port=3,tp_src=32648 actions=output:2 cookie=0xdeff105, duration=970.121s, table=0, n_packets=44893, n_bytes=4918796, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:2 cookie=0xdeff105, duration=970.121s, table=0, n_packets=73368, n_bytes=198575001, idle_age=0, priority=50,ip,in_port=2 actions=ct(table=1,zone=64000) cookie=0xdeff105, duration=970.121s, table=0, n_packets=1, n_bytes=42, idle_age=969, priority=10,in_port=2,dl_dst=0e:5e:00:0a:14:47 actions=output:3,LOCAL cookie=0xdeff105, duration=970.121s, table=1, n_packets=0, n_bytes=0, idle_age=970, priority=13,udp,in_port=2,tp_dst=3784 actions=output:3,LOCAL sh-4.4# systemctl restart NetworkManager sh-4.4# sleep 5 sh-4.4# ovs-ofctl show br-ex OFPT_FEATURES_REPLY (xid=0x2): dpid:00000e5e000a1447 n_tables:254, n_buffers:0 capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 3(patch-br-ex_ip-): addr:5e:f1:2f:35:7d:ef config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 4(ens5): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max LOCAL(br-ex): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0 sh-4.4# ovs-ofctl dump-flows br-ex | egrep 'in_port=2|output:2|in_port=4|output:4' cookie=0xdeff105, duration=975.604s, table=0, n_packets=25139, n_bytes=23707979, idle_age=5, priority=205,udp,in_port=2,dl_dst=0e:5e:00:0a:14:47,tp_dst=6081 actions=LOCAL cookie=0xdeff105, duration=975.604s, table=0, n_packets=0, n_bytes=0, idle_age=975, priority=200,udp,in_port=2,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=975.604s, table=0, n_packets=27563, n_bytes=46450303, idle_age=0, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:2 cookie=0x7d956a9aff1e82d9, duration=975.604s, table=0, n_packets=7, n_bytes=632, idle_age=377, priority=110,tcp,in_port=2,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=975.604s, table=0, n_packets=596, n_bytes=75460, idle_age=28, priority=110,tcp,in_port=2,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=975.604s, table=0, n_packets=6, n_bytes=3041, idle_age=377, priority=110,tcp,in_port=3,tp_src=30906 actions=output:2 cookie=0xa9f3b7acc211a828, duration=975.604s, table=0, n_packets=480, n_bytes=272857, idle_age=28, priority=110,tcp,in_port=3,tp_src=32648 actions=output:2 cookie=0xdeff105, duration=975.604s, table=0, n_packets=45063, n_bytes=4957973, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:2 cookie=0xdeff105, duration=975.604s, table=0, n_packets=73504, n_bytes=198862321, idle_age=4, priority=50,ip,in_port=2 actions=ct(table=1,zone=64000) cookie=0xdeff105, duration=975.604s, table=0, n_packets=1, n_bytes=42, idle_age=975, priority=10,in_port=2,dl_dst=0e:5e:00:0a:14:47 actions=output:3,LOCAL cookie=0xdeff105, duration=975.604s, table=1, n_packets=0, n_bytes=0, idle_age=975, priority=13,udp,in_port=2,tp_dst=3784 actions=output:3,LOCAL sh-4.4# for i in {0..10}; do sleep 5; date; ovs-ofctl dump-flows br-ex | egrep 'priority=100|priority=110'; done Tue Feb 1 14:27:07 UTC 2022 cookie=0xa9f3b7acc211a828, duration=2.394s, table=0, n_packets=16, n_bytes=1676, idle_age=0, priority=110,tcp,in_port=4,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=2.394s, table=0, n_packets=0, n_bytes=0, idle_age=2, priority=110,tcp,in_port=4,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=2.394s, table=0, n_packets=491, n_bytes=280481, idle_age=0, priority=110,tcp,in_port=3,tp_src=32648 actions=output:4 cookie=0x7d956a9aff1e82d9, duration=2.394s, table=0, n_packets=6, n_bytes=3041, idle_age=383, priority=110,tcp,in_port=3,tp_src=30906 actions=output:4 cookie=0xdeff105, duration=981.790s, table=0, n_packets=19768, n_bytes=12769972, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),NORMAL cookie=0xdeff105, duration=2.394s, table=0, n_packets=45734, n_bytes=5036639, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:4 cookie=0xdeff105, duration=981.790s, table=1, n_packets=55338, n_bytes=161809601, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=981.790s, table=1, n_packets=18915, n_bytes=39471177, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=981.790s, table=1, n_packets=0, n_bytes=0, idle_age=981, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=981.790s, table=1, n_packets=0, n_bytes=0, idle_age=981, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL ^C ~~~ ~~~ ovs-ofctl show br-ex ovs-ofctl dump-flows br-ex | egrep 'in_port=4|output:4|in_port=5|output:5' systemctl restart NetworkManager sleep 5 ovs-ofctl show br-ex ovs-ofctl dump-flows br-ex | egrep 'in_port=4|output:4|in_port=5|output:5' for i in {0..10}; do sleep 5; date; ovs-ofctl dump-flows br-ex | egrep 'priority=100|priority=110'; done ~~~ It depends a bit on the timing, but this can potentially last quite a long time: ~~~ sh-4.4# ovs-ofctl show br-ex OFPT_FEATURES_REPLY (xid=0x2): dpid:00000e5e000a1447 n_tables:254, n_buffers:0 capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 3(patch-br-ex_ip-): addr:5e:f1:2f:35:7d:ef config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 4(ens5): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max LOCAL(br-ex): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0 sh-4.4# ovs-ofctl dump-flows br-ex | egrep 'in_port=4|output:4|in_port=5|output:5' cookie=0xdeff105, duration=94.075s, table=0, n_packets=2815, n_bytes=2644550, idle_age=0, priority=205,udp,in_port=4,dl_dst=0e:5e:00:0a:14:47,tp_dst=6081 actions=LOCAL cookie=0xa9f3b7acc211a828, duration=94.075s, table=0, n_packets=112, n_bytes=13715, idle_age=5, priority=110,tcp,in_port=4,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=94.075s, table=0, n_packets=0, n_bytes=0, idle_age=94, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=94.075s, table=0, n_packets=30780, n_bytes=52486447, idle_age=0, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 cookie=0x7d956a9aff1e82d9, duration=94.075s, table=0, n_packets=0, n_bytes=0, idle_age=94, priority=110,tcp,in_port=4,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=94.075s, table=0, n_packets=565, n_bytes=322054, idle_age=5, priority=110,tcp,in_port=3,tp_src=32648 actions=output:4 cookie=0x7d956a9aff1e82d9, duration=94.075s, table=0, n_packets=6, n_bytes=3041, idle_age=474, priority=110,tcp,in_port=3,tp_src=30906 actions=output:4 cookie=0xdeff105, duration=94.075s, table=0, n_packets=50248, n_bytes=5499600, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:4 cookie=0xdeff105, duration=94.075s, table=0, n_packets=9312, n_bytes=29858394, idle_age=0, priority=50,ip,in_port=4 actions=ct(table=1,zone=64000) cookie=0xdeff105, duration=94.075s, table=0, n_packets=0, n_bytes=0, idle_age=94, priority=10,in_port=4,dl_dst=0e:5e:00:0a:14:47 actions=output:3,LOCAL cookie=0xdeff105, duration=94.075s, table=1, n_packets=0, n_bytes=0, idle_age=94, priority=13,udp,in_port=4,tp_dst=3784 actions=output:3,LOCAL sh-4.4# systemctl restart NetworkManager sh-4.4# sleep 5 sh-4.4# ovs-ofctl show br-ex OFPT_FEATURES_REPLY (xid=0x2): dpid:00000e5e000a1447 n_tables:254, n_buffers:0 capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst 3(patch-br-ex_ip-): addr:5e:f1:2f:35:7d:ef config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 5(ens5): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max LOCAL(br-ex): addr:0e:5e:00:0a:14:47 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0 sh-4.4# ovs-ofctl dump-flows br-ex | egrep 'in_port=4|output:4|in_port=5|output:5' cookie=0xdeff105, duration=99.381s, table=0, n_packets=2817, n_bytes=2646678, idle_age=5, priority=205,udp,in_port=4,dl_dst=0e:5e:00:0a:14:47,tp_dst=6081 actions=LOCAL cookie=0xa9f3b7acc211a828, duration=99.381s, table=0, n_packets=112, n_bytes=13715, idle_age=10, priority=110,tcp,in_port=4,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=99.381s, table=0, n_packets=0, n_bytes=0, idle_age=99, priority=200,udp,in_port=4,tp_dst=6081 actions=NORMAL cookie=0xdeff105, duration=99.381s, table=0, n_packets=30915, n_bytes=52537336, idle_age=0, priority=200,udp,in_port=LOCAL,tp_dst=6081 actions=output:4 cookie=0x7d956a9aff1e82d9, duration=99.381s, table=0, n_packets=0, n_bytes=0, idle_age=99, priority=110,tcp,in_port=4,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=99.381s, table=0, n_packets=565, n_bytes=322054, idle_age=10, priority=110,tcp,in_port=3,tp_src=32648 actions=output:4 cookie=0x7d956a9aff1e82d9, duration=99.381s, table=0, n_packets=6, n_bytes=3041, idle_age=480, priority=110,tcp,in_port=3,tp_src=30906 actions=output:4 cookie=0xdeff105, duration=99.381s, table=0, n_packets=50378, n_bytes=5521679, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:4 cookie=0xdeff105, duration=99.381s, table=0, n_packets=9387, n_bytes=29996178, idle_age=5, priority=50,ip,in_port=4 actions=ct(table=1,zone=64000) cookie=0xdeff105, duration=99.381s, table=0, n_packets=0, n_bytes=0, idle_age=99, priority=10,in_port=4,dl_dst=0e:5e:00:0a:14:47 actions=output:3,LOCAL cookie=0xdeff105, duration=99.381s, table=1, n_packets=0, n_bytes=0, idle_age=99, priority=13,udp,in_port=4,tp_dst=3784 actions=output:3,LOCAL sh-4.4# for i in {0..10}; do sleep 5; date; ovs-ofctl dump-flows br-ex | egrep 'priority=100|priority=110'; done Tue Feb 1 14:28:51 UTC 2022 cookie=0xa9f3b7acc211a828, duration=106.011s, table=0, n_packets=112, n_bytes=13715, idle_age=17, priority=110,tcp,in_port=4,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=106.011s, table=0, n_packets=0, n_bytes=0, idle_age=106, priority=110,tcp,in_port=4,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=106.011s, table=0, n_packets=565, n_bytes=322054, idle_age=17, priority=110,tcp,in_port=3,tp_src=32648 actions=output:4 cookie=0x7d956a9aff1e82d9, duration=106.011s, table=0, n_packets=6, n_bytes=3041, idle_age=486, priority=110,tcp,in_port=3,tp_src=30906 actions=output:4 cookie=0xdeff105, duration=1085.407s, table=0, n_packets=22996, n_bytes=14260812, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),NORMAL cookie=0xdeff105, duration=106.011s, table=0, n_packets=50574, n_bytes=5562731, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:4 cookie=0xdeff105, duration=1085.407s, table=1, n_packets=60870, n_bytes=178000227, idle_age=11, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1085.407s, table=1, n_packets=21842, n_bytes=50374376, idle_age=11, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=1085.407s, table=1, n_packets=0, n_bytes=0, idle_age=1085, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1085.407s, table=1, n_packets=0, n_bytes=0, idle_age=1085, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL Tue Feb 1 14:28:56 UTC 2022 cookie=0xa9f3b7acc211a828, duration=111.016s, table=0, n_packets=112, n_bytes=13715, idle_age=22, priority=110,tcp,in_port=4,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=111.016s, table=0, n_packets=0, n_bytes=0, idle_age=111, priority=110,tcp,in_port=4,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=111.016s, table=0, n_packets=565, n_bytes=322054, idle_age=22, priority=110,tcp,in_port=3,tp_src=32648 actions=output:4 cookie=0x7d956a9aff1e82d9, duration=111.016s, table=0, n_packets=6, n_bytes=3041, idle_age=491, priority=110,tcp,in_port=3,tp_src=30906 actions=output:4 cookie=0xdeff105, duration=1090.412s, table=0, n_packets=23167, n_bytes=14312909, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),NORMAL cookie=0xdeff105, duration=111.016s, table=0, n_packets=50730, n_bytes=5612034, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:4 cookie=0xdeff105, duration=1090.412s, table=1, n_packets=60870, n_bytes=178000227, idle_age=16, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1090.412s, table=1, n_packets=21842, n_bytes=50374376, idle_age=16, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=1090.412s, table=1, n_packets=0, n_bytes=0, idle_age=1090, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1090.412s, table=1, n_packets=0, n_bytes=0, idle_age=1090, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL Tue Feb 1 14:29:01 UTC 2022 cookie=0xa9f3b7acc211a828, duration=116.027s, table=0, n_packets=112, n_bytes=13715, idle_age=27, priority=110,tcp,in_port=4,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=116.027s, table=0, n_packets=0, n_bytes=0, idle_age=116, priority=110,tcp,in_port=4,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=116.027s, table=0, n_packets=565, n_bytes=322054, idle_age=27, priority=110,tcp,in_port=3,tp_src=32648 actions=output:4 cookie=0x7d956a9aff1e82d9, duration=116.027s, table=0, n_packets=6, n_bytes=3041, idle_age=496, priority=110,tcp,in_port=3,tp_src=30906 actions=output:4 cookie=0xdeff105, duration=1095.423s, table=0, n_packets=23240, n_bytes=14322781, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),NORMAL cookie=0xdeff105, duration=116.027s, table=0, n_packets=50861, n_bytes=5650827, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:4 cookie=0xdeff105, duration=1095.423s, table=1, n_packets=60870, n_bytes=178000227, idle_age=21, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1095.423s, table=1, n_packets=21842, n_bytes=50374376, idle_age=21, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=1095.423s, table=1, n_packets=0, n_bytes=0, idle_age=1095, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1095.423s, table=1, n_packets=0, n_bytes=0, idle_age=1095, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL Tue Feb 1 14:29:06 UTC 2022 cookie=0xa9f3b7acc211a828, duration=121.035s, table=0, n_packets=112, n_bytes=13715, idle_age=32, priority=110,tcp,in_port=4,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=121.035s, table=0, n_packets=0, n_bytes=0, idle_age=121, priority=110,tcp,in_port=4,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=121.035s, table=0, n_packets=570, n_bytes=322504, idle_age=0, priority=110,tcp,in_port=3,tp_src=32648 actions=output:4 cookie=0x7d956a9aff1e82d9, duration=121.035s, table=0, n_packets=6, n_bytes=3041, idle_age=501, priority=110,tcp,in_port=3,tp_src=30906 actions=output:4 cookie=0xdeff105, duration=1100.431s, table=0, n_packets=23362, n_bytes=14638804, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),NORMAL cookie=0xdeff105, duration=121.035s, table=0, n_packets=50977, n_bytes=5684377, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:4 cookie=0xdeff105, duration=1100.431s, table=1, n_packets=60870, n_bytes=178000227, idle_age=26, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1100.431s, table=1, n_packets=21842, n_bytes=50374376, idle_age=26, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=1100.431s, table=1, n_packets=0, n_bytes=0, idle_age=1100, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1100.431s, table=1, n_packets=0, n_bytes=0, idle_age=1100, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL Tue Feb 1 14:29:11 UTC 2022 cookie=0x7d956a9aff1e82d9, duration=3.653s, table=0, n_packets=0, n_bytes=0, idle_age=3, priority=110,tcp,in_port=5,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=3.653s, table=0, n_packets=1, n_bytes=54, idle_age=0, priority=110,tcp,in_port=5,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=3.653s, table=0, n_packets=6, n_bytes=3041, idle_age=506, priority=110,tcp,in_port=3,tp_src=30906 actions=output:5 cookie=0xa9f3b7acc211a828, duration=3.653s, table=0, n_packets=572, n_bytes=322684, idle_age=0, priority=110,tcp,in_port=3,tp_src=32648 actions=output:5 cookie=0xdeff105, duration=1105.436s, table=0, n_packets=23847, n_bytes=14715284, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),NORMAL cookie=0xdeff105, duration=3.653s, table=0, n_packets=51520, n_bytes=5757170, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:5 cookie=0xdeff105, duration=1105.436s, table=1, n_packets=61499, n_bytes=180129102, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1105.436s, table=1, n_packets=21966, n_bytes=50723481, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=1105.436s, table=1, n_packets=0, n_bytes=0, idle_age=1105, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1105.436s, table=1, n_packets=0, n_bytes=0, idle_age=1105, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL Tue Feb 1 14:29:16 UTC 2022 cookie=0x7d956a9aff1e82d9, duration=8.659s, table=0, n_packets=0, n_bytes=0, idle_age=8, priority=110,tcp,in_port=5,tp_dst=30906 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xa9f3b7acc211a828, duration=8.659s, table=0, n_packets=1, n_bytes=54, idle_age=5, priority=110,tcp,in_port=5,tp_dst=32648 actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0x7d956a9aff1e82d9, duration=8.659s, table=0, n_packets=6, n_bytes=3041, idle_age=511, priority=110,tcp,in_port=3,tp_src=30906 actions=output:5 cookie=0xa9f3b7acc211a828, duration=8.659s, table=0, n_packets=572, n_bytes=322684, idle_age=5, priority=110,tcp,in_port=3,tp_src=32648 actions=output:5 cookie=0xdeff105, duration=1110.442s, table=0, n_packets=23986, n_bytes=14735332, idle_age=0, priority=100,ip,in_port=LOCAL actions=ct(commit,zone=64000,exec(load:0x2->NXM_NX_CT_MARK[])),NORMAL cookie=0xdeff105, duration=8.659s, table=0, n_packets=51926, n_bytes=5803708, idle_age=0, priority=100,ip,in_port=3 actions=ct(commit,zone=64000,exec(load:0x1->NXM_NX_CT_MARK[])),output:5 cookie=0xdeff105, duration=1110.442s, table=1, n_packets=62019, n_bytes=181795696, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1110.442s, table=1, n_packets=22108, n_bytes=51405496, idle_age=0, priority=100,ct_state=+est+trk,ct_mark=0x2,ip actions=LOCAL cookie=0xdeff105, duration=1110.442s, table=1, n_packets=0, n_bytes=0, idle_age=1110, priority=100,ct_state=+rel+trk,ct_mark=0x1,ip actions=check_pkt_larger(8915)->NXM_NX_REG0[0],resubmit(,11) cookie=0xdeff105, duration=1110.442s, table=1, n_packets=0, n_bytes=0, idle_age=1110, priority=100,ct_state=+rel+trk,ct_mark=0x2,ip actions=LOCAL ^C ~~~ That timing might have to do with the ovnkube-node pod restarts on that node: ~~~ [akaris@linux ipi-us-east-1]$ oc get pods -n openshift-ovn-kubernetes -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-master-b887s 6/6 Running 0 51m 10.0.164.163 ip-10-0-164-163.ec2.internal <none> <none> ovnkube-master-csttd 6/6 Running 0 53m 10.0.147.226 ip-10-0-147-226.ec2.internal <none> <none> ovnkube-master-pmvfc 6/6 Running 0 55m 10.0.132.217 ip-10-0-132-217.ec2.internal <none> <none> ovnkube-node-74pl5 5/5 Running 0 54m 10.0.135.122 ip-10-0-135-122.ec2.internal <none> <none> ovnkube-node-8cj8d 5/5 Running 0 53m 10.0.162.245 ip-10-0-162-245.ec2.internal <none> <none> ovnkube-node-bbx5q 5/5 Running 0 55m 10.0.147.226 ip-10-0-147-226.ec2.internal <none> <none> ovnkube-node-gl2xk 5/5 Running 14 (2m6s ago) 54m 10.0.155.59 ip-10-0-155-59.ec2.internal <none> <none> ovnkube-node-qfmbm 5/5 Running 0 55m 10.0.164.163 ip-10-0-164-163.ec2.internal <none> <none> ovnkube-node-swms6 5/5 Running 0 53m 10.0.132.217 ip-10-0-132-217.ec2.internal <none> <none> ~~~ ~~~ oc describe pod -n openshift-ovn-kubernetes ovnkube-node-gl2xk (...) I0201 14:27:05.433194 21059 node_ip_handler_linux.go:213] Skipping non-useable IP address for host: fe80::b89d:c4ff:fe14:65fa I0201 14:27:05.433199 21059 node_ip_handler_linux.go:213] Skipping non-useable IP address for host: fe80::4ad:4aff:fed6:e0c7 I0201 14:27:05.433204 21059 node_ip_handler_linux.go:213] Skipping non-useable IP address for host: fe80::5ce4:8dff:fe37:dad I0201 14:27:05.433209 21059 node_ip_handler_linux.go:213] Skipping non-useable IP address for host: fe80::a05f:40ff:fe01:2555 I0201 14:27:05.433213 21059 node_ip_handler_linux.go:213] Skipping non-useable IP address for host: fe80::ec5e:deff:fe0d:9b0a I0201 14:27:05.433219 21059 node_ip_handler_linux.go:213] Skipping non-useable IP address for host: fe80::1447:6dff:fe9f:ef0 I0201 14:27:05.433223 21059 node_ip_handler_linux.go:213] Skipping non-useable IP address for host: fe80::e4bd:f3ff:feac:3b97 I0201 14:27:05.433227 21059 node_ip_handler_linux.go:213] Skipping non-useable IP address for host: fe80::7b93:10e:313b:fa77 I0201 14:27:05.439621 21059 helper_linux.go:74] Found default gateway interface br-ex 10.0.144.1 I0201 14:27:05.464040 21059 ovn_db.go:382] Not registering OVN DB Metrics on this Node since OVN DBs are not running on this node. W0201 14:27:05.481993 21059 management-port_linux.go:309] missing management port nat rule in chain OVN-KUBE-SNAT-MGMTPORT, adding it I0201 14:28:39.649559 21059 node_ip_handler_linux.go:63] Removing IP: 10.0.155.59, from node IP manager I0201 14:28:39.649612 21059 kube.go:99] Setting annotations map[k8s.ovn.org/host-addresses:[]] on node ip-10-0-155-59.ec2.internal I0201 14:28:40.553095 21059 node_ip_handler_linux.go:49] Adding IP: 10.0.155.59, to node IP manager I0201 14:28:40.553122 21059 kube.go:99] Setting annotations map[k8s.ovn.org/host-addresses:["10.0.155.59"]] on node ip-10-0-155-59.ec2.internal E0201 14:28:50.622980 21059 healthcheck.go:364] Fatal error: phys port ens5 ofport changed from 4 to 5 ~~~ I also hit a case during my tests where NM disconnected ens5 from br-ex and would never reconnect it until I manually fixed the issue with: ~~~ ovs-vsctl add-port br-ex ens5 ~~~ And that then lead to ovnkube-node crash looping: ~~~ I0201 14:06:57.863647 214702 ovs.go:204] exec(14): /usr/bin/ovs-vsctl --timeout=15 br-get-external-id br-ex bridge-uplink I0201 14:06:57.869210 214702 ovs.go:207] exec(14): stdout: "" I0201 14:06:57.869230 214702 ovs.go:208] exec(14): stderr: "" I0201 14:06:57.869243 214702 ovs.go:204] exec(15): /usr/bin/ovs-vsctl --timeout=15 get interface -ex ofport I0201 14:06:57.874760 214702 ovs.go:207] exec(15): stdout: "" I0201 14:06:57.874779 214702 ovs.go:208] exec(15): stderr: "ovs-vsctl: no row \"-ex\" in table Interface\n" I0201 14:06:57.874785 214702 ovs.go:210] exec(15): err: exit status 1 F0201 14:06:57.874823 214702 ovnkube.go:133] Bridge for interface failed for br-ex: Failed to find intfName for br-ex: failed to get ofport of -ex, stderr: "ovs-vsctl: no row \"-ex\" in table Interface\n", error: exit status 1 ~~~
Fix upstream: https://github.com/ovn-org/ovn-kubernetes/pull/2785
I also created https://bugzilla.redhat.com/show_bug.cgi?id=2049103
Unable to reproduce the issue with simulated link flap (`ip l s ens3 down` on host or virtual network adapter disconnect/reconnect on vSphere). Have not tested link flap on baremetal. It looks like link flaps can't trigger the issue. I'm not sure what endogenous events could cause NetworkManager to stop or restart. This is a low probability, high impact issue. If it happens nodes will eventually fail connectivity health checks in things like AWS UI, so admin will see node as unhealthy and reboot in the rare case that it happens. We could document something like "if NetworkManager ever restarts the node will lose network connectivity and must be recovered." Customers should be advised to not manually restart NetworkManager without atomically restarting openvswitch as well, as indicated in Comment 9.
Removing "link flap" from summary, issue does not occur on link flap. Clarifying release note text, issue only occurs on OVNKubernetes. Doc Text verified.
Verified on 4.11.0-0.nightly-2022-03-04-063157 Starting pod/rbrattai-o411i53-j7xqh-master-0-debug ... /run/ofport_requests.br-ex:declare -A INTERFACES=([ens3]="2" ) Starting pod/rbrattai-o411i53-j7xqh-master-1-debug ... /run/ofport_requests.br-ex:declare -A INTERFACES=([ens3]="2" ) Starting pod/rbrattai-o411i53-j7xqh-master-2-debug ... /run/ofport_requests.br-ex:declare -A INTERFACES=([ens3]="2" ) Starting pod/rbrattai-o411i53-j7xqh-worker-1-q9lvk-debug ... /run/ofport_requests.br-ex:declare -A INTERFACES=([ens3]="1" ) Starting pod/rbrattai-o411i53-j7xqh-worker-2-2xqt4-debug ... /run/ofport_requests.br-ex:declare -A INTERFACES=([ens3]="1" ) Starting pod/rbrattai-o411i53-j7xqh-worker-3-9mf6k-debug ... /run/ofport_requests.br-ex:declare -A INTERFACES=([ens3]="2" ) Mar 04 17:52:55.273869 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[172425]: + INTERFACE_NAME=ens3 Mar 04 17:52:55.274703 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[172425]: ++ awk -F : '{if($1=="ens3" && $2!~/^ovs*/) print $NF}' Mar 04 17:52:56.257974 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[172425]: + ovs-vsctl list interface ens3 Mar 04 17:52:56.284102 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[172425]: declare -A INTERFACES=([ens3]="2" ) Mar 04 17:52:56.284341 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[172425]: ++ INTERFACES=([ens3]="2") Mar 04 17:52:56.284632 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[172425]: + ovs-vsctl set Interface ens3 ofport_request=2 Mar 04 18:13:34.749790 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[223669]: + INTERFACE_NAME=ens3 Mar 04 18:13:34.750666 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[223669]: ++ awk -F : '{if($1=="ens3" && $2!~/^ovs*/) print $NF}' Mar 04 18:13:35.780342 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[223669]: + ovs-vsctl list interface ens3 Mar 04 18:13:35.805120 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[223669]: declare -A INTERFACES=([ens3]="2" ) Mar 04 18:13:35.805293 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[223669]: ++ INTERFACES=([ens3]="2") Mar 04 18:13:35.805505 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[223669]: + ovs-vsctl set Interface ens3 ofport_request=2 Mar 04 18:30:45.785195 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[267177]: + INTERFACE_NAME=ens3 Mar 04 18:30:45.786035 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[267177]: ++ awk -F : '{if($1=="ens3" && $2!~/^ovs*/) print $NF}' Mar 04 18:30:46.804640 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[267177]: + ovs-vsctl list interface ens3 Mar 04 18:30:46.829820 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[267177]: declare -A INTERFACES=([ens3]="2" ) Mar 04 18:30:46.830011 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[267177]: ++ INTERFACES=([ens3]="2") Mar 04 18:30:46.830184 rbrattai-o411i53-j7xqh-master-0 nm-dispatcher[267177]: + ovs-vsctl set Interface ens3 ofport_request=2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069