Bug 1996886 - timedout waiting for flows during pod creation and ovn-controller pegged on worker nodes
Summary: timedout waiting for flows during pod creation and ovn-controller pegged on w...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: x86_64
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: Surya Seetharaman
QA Contact: Anurag saxena
URL:
Whiteboard:
: 2011110 (view as bug list)
Depends On: 1978605
Blocks: 2011385
TreeView+ depends on / blocked
 
Reported: 2021-08-23 21:57 UTC by Sai Sindhur Malleni
Modified: 2023-09-15 01:14 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2011385 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:05:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:06:24 UTC

Description Sai Sindhur Malleni 2021-08-23 21:57:21 UTC
Description of problem:
The setup is based on OCP 4.7.11 and has local gateway mode with ICNI 1.0 and ICNI 2.0 configured. There are 140 "loadbalancer" pods running in a namespace that establish BFD sessions with ovn-controllers on the worker nodes. We then create 35 projects each with 30 configmaps, 38 secrets, 38 pods and 38 services per namespace. All the pods are endpoints to one of the services and the remaining 37 services do not have endpoints.

We observe that pods take a long time to come up and have errors such as 

Warning FailedCreatePodSandBox 27s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-served-1-9-job-34_served-ns-34_1ac1c4be-2e1d-45fe-bf27-c33b7a5100b2_0(4145ac66123a991c1502f9a0e401448d9e79d8e6aab94e20647ec3c8acd772eb): [served-ns-34/pod-served-1-9-job-34:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[served-ns-34/pod-served-1-9-job-34 4145ac66123a991c1502f9a0e401448d9e79d8e6aab94e20647ec3c8acd772eb] [served-ns-34/pod-served-1-9-job-34 4145ac66123a991c1502f9a0e401448d9e79d8e6aab94e20647ec3c8acd772eb] timed out waiting for annotations
'
 Warning FailedCreatePodSandBox 5s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-served-1-9-job-34_served-ns-34_1ac1c4be-2e1d-45fe-bf27-c33b7a5100b2_0(d071c130b033127aea00a3e2ded082df33a44bec41ccc5fea7bd5a45403e27f1): [served-ns-34/pod-served-1-9-job-34:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[served-ns-34/pod-served-1-9-job-34 d071c130b033127aea00a3e2ded082df33a44bec41ccc5fea7bd5a45403e27f1] [served-ns-34/pod-served-1-9-job-34 d071c130b033127aea00a3e2ded082df33a44bec41ccc5fea7bd5a45403e27f1] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows


In some cases pods do not even come up eventually.

We notice that ovn-controller on worker nodes is pegged and shows many warnings about unreasonably long polling intervals.

=============================================================================

2021-08-23T16:37:08Z|05144|timeval|WARN|Unreasonably long 17196ms poll interval (17113ms user, 8ms system)
2021-08-23T16:37:26Z|05152|timeval|WARN|Unreasonably long 17231ms poll interval (17147ms user, 8ms system)
2021-08-23T16:37:43Z|05158|timeval|WARN|Unreasonably long 17127ms poll interval (17046ms user, 7ms system)
2021-08-23T16:38:01Z|05167|timeval|WARN|Unreasonably long 17310ms poll interval (17224ms user, 10ms system)
2021-08-23T16:38:18Z|05174|timeval|WARN|Unreasonably long 17291ms poll interval (17207ms user, 8ms system)
2021-08-23T16:38:36Z|05181|timeval|WARN|Unreasonably long 17267ms poll interval (17179ms user, 12ms system)
2021-08-23T16:38:53Z|05190|timeval|WARN|Unreasonably long 17400ms poll interval (17320ms user, 3ms system)
2021-08-23T16:39:11Z|05197|timeval|WARN|Unreasonably long 17372ms poll interval (17289ms user, 7ms system)
2021-08-23T16:39:28Z|05204|timeval|WARN|Unreasonably long 17362ms poll interval (17282ms user, 3ms system)
2021-08-23T16:39:46Z|05211|timeval|WARN|Unreasonably long 17407ms poll interval (17318ms user, 12ms system)
2021-08-23T16:40:03Z|05220|timeval|WARN|Unreasonably long 17219ms poll interval (17133ms user, 11ms system)
2021-08-23T16:40:21Z|05227|timeval|WARN|Unreasonably long 17322ms poll interval (17241ms user, 5ms system)
2021-08-23T16:40:38Z|05234|timeval|WARN|Unreasonably long 17519ms poll interval (17438ms user, 3ms system)
2021-08-23T16:40:56Z|05244|timeval|WARN|Unreasonably long 17373ms poll interval (17286ms user, 10ms system)
2021-08-23T16:41:14Z|05250|timeval|WARN|Unreasonably long 17666ms poll interval (17581ms user, 5ms system)
2021-08-23T16:41:36Z|05259|timeval|WARN|Unreasonably long 17879ms poll interval (17792ms user, 7ms system)
2021-08-23T16:41:54Z|05268|timeval|WARN|Unreasonably long 17917ms poll interval (17828ms user, 8ms system)
2021-08-23T16:42:12Z|05275|timeval|WARN|Unreasonably long 18003ms poll interval (17916ms user, 7ms system)
2021-08-23T16:42:30Z|05282|timeval|WARN|Unreasonably long 18025ms poll interval (17937ms user, 9ms system)
2021-08-23T16:42:48Z|05289|timeval|WARN|Unreasonably long 17974ms poll interval (17881ms user, 13ms system)
2021-08-23T16:43:07Z|05298|timeval|WARN|Unreasonably long 18005ms poll interval (17914ms user, 11ms system)
2021-08-23T16:43:25Z|05305|timeval|WARN|Unreasonably long 18133ms poll interval (18048ms user, 4ms system)
2021-08-23T16:43:43Z|05312|timeval|WARN|Unreasonably long 17975ms poll interval (17883ms user, 14ms system)
2021-08-23T16:44:01Z|05321|timeval|WARN|Unreasonably long 18272ms poll interval (18179ms user, 12ms system)
====================================================================


ovnkube-node-xkhpg   ovn-controller    559m         13761Mi         
ovnkube-node-xkhpg   ovnkube-node      521m         151Mi           
ovnkube-node-xp72n   kube-rbac-proxy   0m           33Mi            
ovnkube-node-xp72n   ovn-controller    1300m        13777Mi         
ovnkube-node-xp72n   ovnkube-node      465m         155Mi           
ovnkube-node-xwk9h   kube-rbac-proxy   0m           33Mi            
ovnkube-node-xwk9h   ovn-controller    624m         13717Mi         
ovnkube-node-xwk9h   ovnkube-node      894m         150Mi           
ovnkube-node-xzwk4   kube-rbac-proxy   0m           32Mi            
ovnkube-node-xzwk4   ovn-controller    1075m        13771Mi         
ovnkube-node-xzwk4   ovnkube-node      855m         150Mi           
ovnkube-node-z4jw8   kube-rbac-proxy   0m           33Mi            
ovnkube-node-z4jw8   ovn-controller    1096m        13752Mi         
ovnkube-node-z4jw8   ovnkube-node      916m         139Mi           
ovnkube-node-z5jq6   kube-rbac-proxy   0m           33Mi            
ovnkube-node-z5jq6   ovn-controller    1208m        13747Mi         
ovnkube-node-z5jq6   ovnkube-node      619m         148Mi           
ovnkube-node-z64vn   kube-rbac-proxy   0m           28Mi            
ovnkube-node-z64vn   ovn-controller    1138m        14300Mi         
ovnkube-node-z64vn   ovnkube-node      2m           134Mi           
ovnkube-node-zbsj9   kube-rbac-proxy   0m           35Mi            
ovnkube-node-zbsj9   ovn-controller    757m         13771Mi         
ovnkube-node-zbsj9   ovnkube-node      969m         147Mi           
ovnkube-node-zgmzv   kube-rbac-proxy   0m           33Mi            
ovnkube-node-zgmzv   ovn-controller    1034m        13756Mi         
ovnkube-node-zgmzv   ovnkube-node      813m         154Mi           
ovnkube-node-zsvks   kube-rbac-proxy   0m           34Mi            
ovnkube-node-zsvks   ovn-controller    1218m        13704Mi         
ovnkube-node-zsvks   ovnkube-node      779m         147Mi    
=====================================================================

There are also several egress firewalls.

Version-Release number of selected component (if applicable):
4.7.11
ovn2.13-20.12.0-24.el8fdp.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Run cluster-density test with kube-burner templates for ICNI 2.0
2. Observe pod creation
3.

Actual results:
Several errors/warnings on pod creation that lead to high podready latencies
Also, several pods never go into running.

Expected results:

The test should complete successfully.

Additional info:

Comment 2 Sai Sindhur Malleni 2021-08-23 22:28:53 UTC
We also observed several nbdb leader election in this case.

Comment 4 Sai Sindhur Malleni 2021-08-25 18:53:51 UTC
(In reply to Tim Rozet from comment #3)
> Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1962344

Tim,

But seems like the fix to that is only for shared gateway mode but not local gateway mode? We are using local gateway mode in our tests.

Comment 5 Surya Seetharaman 2021-08-26 20:51:22 UTC
(In reply to Sai Sindhur Malleni from comment #4)
> (In reply to Tim Rozet from comment #3)
> > Could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1962344
> 
> Tim,
> 
> But seems like the fix to that is only for shared gateway mode but not local
> gateway mode? We are using local gateway mode in our tests.

Tim has been talking to Han and Team to do the same for LGW, if not we'll take it up ourselves.

Btw I know we have a lot of 4.7.z bugs for scale related to staleness and/or pod latency and/or creation/deletion. I'll need to do some live debugging on the cluster and discover what's happening. At the very least when you see such problems happening could you grab a full must-gather and also with the gather_network_logs and attach it to the bz please? I can try and go through that at least.

cc @smalleni and @jlema

Comment 6 Tim Rozet 2021-09-14 19:04:50 UTC
This bug will be used to track the local gateway changes, which will be irrelevant for 4.9 and later.

Comment 7 Tim Rozet 2021-09-24 15:56:39 UTC
Filed a bug in OVN to track the dependency there: https://bugzilla.redhat.com/show_bug.cgi?id=2007694

Comment 8 Dan Williams 2021-09-24 17:57:07 UTC
4.7.24 and later versions include OVN 20.12-140 that has a number of fixes to greatly reduce Logical Flows, including for ICNIv2.

Do we see an improvement with 4.7.24 and later in the tests?

Comment 9 Sai Sindhur Malleni 2021-10-04 13:52:10 UTC
Seeing this in 4.7.28 as well - provided the must-gather and DBs to Dan over slack.

Comment 10 Tim Rozet 2021-10-06 14:14:29 UTC
root cause of this was determined to be https://bugzilla.redhat.com/show_bug.cgi?id=1978605

Comment 11 Tim Rozet 2021-10-06 14:15:58 UTC
OVN version with the fix is already present in 4.10 and 4.9

Comment 12 Tim Rozet 2021-10-06 14:16:39 UTC
*** Bug 2011110 has been marked as a duplicate of this bug. ***

Comment 20 errata-xmlrpc 2022-03-10 16:05:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 21 Red Hat Bugzilla 2023-09-15 01:14:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.