Description of problem: The setup is based on OCP 4.7.11 and has local gateway mode with ICNI 1.0 and ICNI 2.0 configured. There are 140 "loadbalancer" pods running in a namespace that establish BFD sessions with ovn-controllers on the worker nodes. We then create 35 projects each with 30 configmaps, 38 secrets, 38 pods and 38 services per namespace. All the pods are endpoints to one of the services and the remaining 37 services do not have endpoints. We observe that pods take a long time to come up and have errors such as Warning FailedCreatePodSandBox 27s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-served-1-9-job-34_served-ns-34_1ac1c4be-2e1d-45fe-bf27-c33b7a5100b2_0(4145ac66123a991c1502f9a0e401448d9e79d8e6aab94e20647ec3c8acd772eb): [served-ns-34/pod-served-1-9-job-34:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[served-ns-34/pod-served-1-9-job-34 4145ac66123a991c1502f9a0e401448d9e79d8e6aab94e20647ec3c8acd772eb] [served-ns-34/pod-served-1-9-job-34 4145ac66123a991c1502f9a0e401448d9e79d8e6aab94e20647ec3c8acd772eb] timed out waiting for annotations ' Warning FailedCreatePodSandBox 5s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-served-1-9-job-34_served-ns-34_1ac1c4be-2e1d-45fe-bf27-c33b7a5100b2_0(d071c130b033127aea00a3e2ded082df33a44bec41ccc5fea7bd5a45403e27f1): [served-ns-34/pod-served-1-9-job-34:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[served-ns-34/pod-served-1-9-job-34 d071c130b033127aea00a3e2ded082df33a44bec41ccc5fea7bd5a45403e27f1] [served-ns-34/pod-served-1-9-job-34 d071c130b033127aea00a3e2ded082df33a44bec41ccc5fea7bd5a45403e27f1] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows In some cases pods do not even come up eventually. We notice that ovn-controller on worker nodes is pegged and shows many warnings about unreasonably long polling intervals. ============================================================================= 2021-08-23T16:37:08Z|05144|timeval|WARN|Unreasonably long 17196ms poll interval (17113ms user, 8ms system) 2021-08-23T16:37:26Z|05152|timeval|WARN|Unreasonably long 17231ms poll interval (17147ms user, 8ms system) 2021-08-23T16:37:43Z|05158|timeval|WARN|Unreasonably long 17127ms poll interval (17046ms user, 7ms system) 2021-08-23T16:38:01Z|05167|timeval|WARN|Unreasonably long 17310ms poll interval (17224ms user, 10ms system) 2021-08-23T16:38:18Z|05174|timeval|WARN|Unreasonably long 17291ms poll interval (17207ms user, 8ms system) 2021-08-23T16:38:36Z|05181|timeval|WARN|Unreasonably long 17267ms poll interval (17179ms user, 12ms system) 2021-08-23T16:38:53Z|05190|timeval|WARN|Unreasonably long 17400ms poll interval (17320ms user, 3ms system) 2021-08-23T16:39:11Z|05197|timeval|WARN|Unreasonably long 17372ms poll interval (17289ms user, 7ms system) 2021-08-23T16:39:28Z|05204|timeval|WARN|Unreasonably long 17362ms poll interval (17282ms user, 3ms system) 2021-08-23T16:39:46Z|05211|timeval|WARN|Unreasonably long 17407ms poll interval (17318ms user, 12ms system) 2021-08-23T16:40:03Z|05220|timeval|WARN|Unreasonably long 17219ms poll interval (17133ms user, 11ms system) 2021-08-23T16:40:21Z|05227|timeval|WARN|Unreasonably long 17322ms poll interval (17241ms user, 5ms system) 2021-08-23T16:40:38Z|05234|timeval|WARN|Unreasonably long 17519ms poll interval (17438ms user, 3ms system) 2021-08-23T16:40:56Z|05244|timeval|WARN|Unreasonably long 17373ms poll interval (17286ms user, 10ms system) 2021-08-23T16:41:14Z|05250|timeval|WARN|Unreasonably long 17666ms poll interval (17581ms user, 5ms system) 2021-08-23T16:41:36Z|05259|timeval|WARN|Unreasonably long 17879ms poll interval (17792ms user, 7ms system) 2021-08-23T16:41:54Z|05268|timeval|WARN|Unreasonably long 17917ms poll interval (17828ms user, 8ms system) 2021-08-23T16:42:12Z|05275|timeval|WARN|Unreasonably long 18003ms poll interval (17916ms user, 7ms system) 2021-08-23T16:42:30Z|05282|timeval|WARN|Unreasonably long 18025ms poll interval (17937ms user, 9ms system) 2021-08-23T16:42:48Z|05289|timeval|WARN|Unreasonably long 17974ms poll interval (17881ms user, 13ms system) 2021-08-23T16:43:07Z|05298|timeval|WARN|Unreasonably long 18005ms poll interval (17914ms user, 11ms system) 2021-08-23T16:43:25Z|05305|timeval|WARN|Unreasonably long 18133ms poll interval (18048ms user, 4ms system) 2021-08-23T16:43:43Z|05312|timeval|WARN|Unreasonably long 17975ms poll interval (17883ms user, 14ms system) 2021-08-23T16:44:01Z|05321|timeval|WARN|Unreasonably long 18272ms poll interval (18179ms user, 12ms system) ==================================================================== ovnkube-node-xkhpg ovn-controller 559m 13761Mi ovnkube-node-xkhpg ovnkube-node 521m 151Mi ovnkube-node-xp72n kube-rbac-proxy 0m 33Mi ovnkube-node-xp72n ovn-controller 1300m 13777Mi ovnkube-node-xp72n ovnkube-node 465m 155Mi ovnkube-node-xwk9h kube-rbac-proxy 0m 33Mi ovnkube-node-xwk9h ovn-controller 624m 13717Mi ovnkube-node-xwk9h ovnkube-node 894m 150Mi ovnkube-node-xzwk4 kube-rbac-proxy 0m 32Mi ovnkube-node-xzwk4 ovn-controller 1075m 13771Mi ovnkube-node-xzwk4 ovnkube-node 855m 150Mi ovnkube-node-z4jw8 kube-rbac-proxy 0m 33Mi ovnkube-node-z4jw8 ovn-controller 1096m 13752Mi ovnkube-node-z4jw8 ovnkube-node 916m 139Mi ovnkube-node-z5jq6 kube-rbac-proxy 0m 33Mi ovnkube-node-z5jq6 ovn-controller 1208m 13747Mi ovnkube-node-z5jq6 ovnkube-node 619m 148Mi ovnkube-node-z64vn kube-rbac-proxy 0m 28Mi ovnkube-node-z64vn ovn-controller 1138m 14300Mi ovnkube-node-z64vn ovnkube-node 2m 134Mi ovnkube-node-zbsj9 kube-rbac-proxy 0m 35Mi ovnkube-node-zbsj9 ovn-controller 757m 13771Mi ovnkube-node-zbsj9 ovnkube-node 969m 147Mi ovnkube-node-zgmzv kube-rbac-proxy 0m 33Mi ovnkube-node-zgmzv ovn-controller 1034m 13756Mi ovnkube-node-zgmzv ovnkube-node 813m 154Mi ovnkube-node-zsvks kube-rbac-proxy 0m 34Mi ovnkube-node-zsvks ovn-controller 1218m 13704Mi ovnkube-node-zsvks ovnkube-node 779m 147Mi ===================================================================== There are also several egress firewalls. Version-Release number of selected component (if applicable): 4.7.11 ovn2.13-20.12.0-24.el8fdp.x86_64 How reproducible: 100% Steps to Reproduce: 1. Run cluster-density test with kube-burner templates for ICNI 2.0 2. Observe pod creation 3. Actual results: Several errors/warnings on pod creation that lead to high podready latencies Also, several pods never go into running. Expected results: The test should complete successfully. Additional info:
We also observed several nbdb leader election in this case.
Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1962344
(In reply to Tim Rozet from comment #3) > Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1962344 Tim, But seems like the fix to that is only for shared gateway mode but not local gateway mode? We are using local gateway mode in our tests.
(In reply to Sai Sindhur Malleni from comment #4) > (In reply to Tim Rozet from comment #3) > > Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1962344 > > Tim, > > But seems like the fix to that is only for shared gateway mode but not local > gateway mode? We are using local gateway mode in our tests. Tim has been talking to Han and Team to do the same for LGW, if not we'll take it up ourselves. Btw I know we have a lot of 4.7.z bugs for scale related to staleness and/or pod latency and/or creation/deletion. I'll need to do some live debugging on the cluster and discover what's happening. At the very least when you see such problems happening could you grab a full must-gather and also with the gather_network_logs and attach it to the bz please? I can try and go through that at least. cc @smalleni and @jlema
This bug will be used to track the local gateway changes, which will be irrelevant for 4.9 and later.
Filed a bug in OVN to track the dependency there: https://bugzilla.redhat.com/show_bug.cgi?id=2007694
4.7.24 and later versions include OVN 20.12-140 that has a number of fixes to greatly reduce Logical Flows, including for ICNIv2. Do we see an improvement with 4.7.24 and later in the tests?
Seeing this in 4.7.28 as well - provided the must-gather and DBs to Dan over slack.
root cause of this was determined to be https://bugzilla.redhat.com/show_bug.cgi?id=1978605
OVN version with the fix is already present in 4.10 and 4.9
*** Bug 2011110 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days