Bug 2028493

Summary: OVN-migration failed - ovnkube-node: error waiting for node readiness: timed out waiting for the condition
Product: OpenShift Container Platform Reporter: Yurii Prokulevych <yprokule>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: achernet, anusaxen, cgoncalves, ctrautma, dceara, eglottma, fpaoline, jcaamano, jiji, tonyg, trozet
Version: 4.10   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2077129 (view as bug list) Environment:
Last Closed: 2022-08-10 10:40:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2077129    

Description Yurii Prokulevych 2021-12-02 13:56:16 UTC
Description of problem:
-----------------------
OVN migration to second interface failed due to ovnkube-node pod failing

oc get po -n openshift-ovn-kubernetes -o wide | awk /openshift-master-1/
ovnkube-master-5sw2m   6/6     Running   12               109m   10.1.208.21   openshift-master-1   <none>           <none>
ovnkube-node-b6lmg     3/4     Running   13 (3m57s ago)   109m   10.1.208.21   openshift-master-1   <none>           <none>

oc get po -n openshift-ovn-kubernetes -o wide | awk /openshift-master-1/
ovnkube-master-5sw2m   6/6     Running   12               109m   10.1.208.21   openshift-master-1   <none>           <none>
ovnkube-node-b6lmg     3/4     Running   13 (3m57s ago)   109m   10.1.208.21   openshift-master-1   <none>           <none>
[kni ~]$ oc describe po ovnkube-node-b6lmg -n openshift-ovn-kubernetes
Name:                 ovnkube-node-b6lmg
Namespace:            openshift-ovn-kubernetes
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 openshift-master-1/10.1.208.21
Start Time:           Thu, 02 Dec 2021 06:46:47 -0500
Labels:               app=ovnkube-node
                      component=network
                      controller-revision-hash=5895968979
                      kubernetes.io/os=linux
                      openshift.io/component=network
                      pod-template-generation=1
                      type=infra
Annotations:          networkoperator.openshift.io/ip-family-mode: single-stack
Status:               Running
IP:                   10.1.208.21
IPs:
  IP:           10.1.208.21
Controlled By:  DaemonSet/ovnkube-node
Containers:
  ovn-controller:
    Container ID:  cri-o://68889fc9266ce191823569e745c5001e37d9c7929726d6723422532764bf2c31
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      set -e
      if [[ -f "/env/${K8S_NODE}" ]]; then
        set -o allexport
        source "/env/${K8S_NODE}"
        set +o allexport
      fi

      echo "$(date -Iseconds) - starting ovn-controller"
      exec ovn-controller unix:/var/run/openvswitch/db.sock -vfile:off \
        --no-chdir --pidfile=/var/run/ovn/ovn-controller.pid \
        --syslog-method="null" \
        --log-file=/var/log/ovn/acl-audit-log.log \
        -vFACILITY:"local0" \
        -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt \
        -vconsole:"${OVN_LOG_LEVEL}" -vconsole:"acl_log:off" \
        -vPATTERN:console:"%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" \
        -vsyslog:"acl_log:info" \
        -vfile:"acl_log:info"

    State:          Running
      Started:      Thu, 02 Dec 2021 07:36:17 -0500
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:     10m
      memory:  300Mi
    Environment:
      OVN_LOG_LEVEL:  info
      K8S_NODE:        (v1:spec.nodeName)
    Mounts:
      /dev/log from log-socket (rw)
      /env from env-overrides (rw)
      /etc/openvswitch from etc-openvswitch (rw)
      /etc/ovn/ from etc-openvswitch (rw)
      /ovn-ca from ovn-ca (rw)
      /ovn-cert from ovn-cert (rw)
      /run/openvswitch from run-openvswitch (rw)
      /run/ovn/ from run-ovn (rw)
      /var/lib/openvswitch from var-lib-openvswitch (rw)
      /var/log/ovn from node-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vk2gn (ro)
  ovn-acl-logging:
    Container ID:  cri-o://16092f57da4eb0329c0978a6a8ec905243e38d0622aae9c397ba1b2d026b16f7
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      set -euo pipefail

      # Rotate audit log files when then get to max size (in bytes)
      MAXFILESIZE=$(( "50"*1000000 ))
      LOGFILE=/var/log/ovn/acl-audit-log.log
      CONTROLLERPID=$(cat /run/ovn/ovn-controller.pid)

      # Redirect err to null so no messages are shown upon rotation
      tail -F ${LOGFILE} 2> /dev/null &

      while true
      do
        # Make sure ovn-controller's logfile exists, and get current size in bytes
        if [ -f "$LOGFILE" ]; then
          file_size=`du -b ${LOGFILE} | tr -s '\t' ' ' | cut -d' ' -f1`
        else
          ovs-appctl -t /var/run/ovn/ovn-controller.${CONTROLLERPID}.ctl vlog/reopen
          file_size=`du -b ${LOGFILE} | tr -s '\t' ' ' | cut -d' ' -f1`
        fi

        if [ $file_size -gt $MAXFILESIZE ];then
          echo "Rotating OVN ACL Log File"
          timestamp=`date '+%Y-%m-%dT%H-%M-%S'`
          mv ${LOGFILE} /var/log/ovn/acl-audit-log.$timestamp.log
          ovs-appctl -t /run/ovn/ovn-controller.${CONTROLLERPID}.ctl vlog/reopen
          CONTROLLERPID=$(cat /run/ovn/ovn-controller.pid)
        fi

        # sleep for 30 seconds to avoid wasting CPU
        sleep 30
      done
    State:          Running
      Started:      Thu, 02 Dec 2021 07:36:17 -0500
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /run/ovn/ from run-ovn (rw)
      /var/log/ovn from node-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vk2gn (ro)
  kube-rbac-proxy:
    Container ID:  cri-o://df3da787138c488d1dfa74c998d1102a62da28d25a4de93a55246dc3cf4e27a7
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da2149a0469fb553d83418d6ff5f4430fafab71c594deaf789376167cfeed9cc
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da2149a0469fb553d83418d6ff5f4430fafab71c594deaf789376167cfeed9cc
    Port:          9103/TCP
    Host Port:     9103/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -euo pipefail
      TLS_PK=/etc/pki/tls/metrics-cert/tls.key
      TLS_CERT=/etc/pki/tls/metrics-cert/tls.crt
      # As the secret mount is optional we must wait for the files to be present.
      # The service is created in monitor.yaml and this is created in sdn.yaml.
      # If it isn't created there is probably an issue so we want to crashloop.
      retries=0
      TS=$(date +%s)
      WARN_TS=$(( ${TS} + $(( 20 * 60)) ))
      HAS_LOGGED_INFO=0

      log_missing_certs(){
          CUR_TS=$(date +%s)
          if [[ "${CUR_TS}" -gt "WARN_TS"  ]]; then
            echo $(date -Iseconds) WARN: ovn-node-metrics-cert not mounted after 20 minutes.
          elif [[ "${HAS_LOGGED_INFO}" -eq 0 ]] ; then
            echo $(date -Iseconds) INFO: ovn-node-metrics-cert not mounted. Waiting one hour.
            HAS_LOGGED_INFO=1
          fi
      }
      while [[ ! -f "${TLS_PK}" ||  ! -f "${TLS_CERT}" ]] ; do
        log_missing_certs
        sleep 5
      done

      echo $(date -Iseconds) INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy
      exec /usr/bin/kube-rbac-proxy \
        --logtostderr \
        --secure-listen-address=:9103 \
        --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 \
        --upstream=http://127.0.0.1:29103/ \
        --tls-private-key-file=${TLS_PK} \
        --tls-cert-file=${TLS_CERT}
    State:          Running
      Started:      Thu, 02 Dec 2021 07:36:17 -0500
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /etc/pki/tls/metrics-cert from ovn-node-metrics-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vk2gn (ro)
  ovnkube-node:
    Container ID:  cri-o://32cbd581af2f60dabb7cad4f8b46b26f3bd31c7ad1200ea9ddeb44b400f9d613
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e
    Port:          29103/TCP
    Host Port:     29103/TCP
    Command:
      /bin/bash
      -c
      set -xe
      if [[ -f "/env/${K8S_NODE}" ]]; then
        set -o allexport
        source "/env/${K8S_NODE}"
        set +o allexport
      fi
      echo "I$(date "+%m%d %H:%M:%S.%N") - waiting for db_ip addresses"
      cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
      ovn_config_namespace=openshift-ovn-kubernetes
      echo "I$(date "+%m%d %H:%M:%S.%N") - disable conntrack on geneve port"
      iptables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK
      iptables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK
      ip6tables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK
      ip6tables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK
      retries=0
      while true; do
        # TODO: change to use '--request-timeout=30s', if https://github.com/kubernetes/kubernetes/issues/49343 is fixed.
        db_ip=$(timeout 30 kubectl get ep -n ${ovn_config_namespace} ovnkube-db -o jsonpath='{.subsets[0].addresses[0].ip}')
        if [[ -n "${db_ip}" ]]; then
          break
        fi
        (( retries += 1 ))
        if [[ "${retries}" -gt 40 ]]; then
          echo "E$(date "+%m%d %H:%M:%S.%N") - db endpoint never came up"
          exit 1
        fi
        echo "I$(date "+%m%d %H:%M:%S.%N") - waiting for db endpoint"
        sleep 5
      done

      echo "I$(date "+%m%d %H:%M:%S.%N") - starting ovnkube-node db_ip ${db_ip}"
      if [ "shared" == "shared" ]; then
        gateway_mode_flags="--gateway-mode shared --gateway-interface br-ex"
      elif [ "shared" == "local" ]; then
        gateway_mode_flags="--gateway-mode local --gateway-interface br-ex"
      else
        echo "Invalid OVN_GATEWAY_MODE: \"shared\". Must be \"local\" or \"shared\"."
        exit 1
      fi

      export_network_flows_flags=
      if [[ -n "${NETFLOW_COLLECTORS}" ]] ; then
        export_network_flows_flags="--netflow-targets ${NETFLOW_COLLECTORS}"
      fi
      if [[ -n "${SFLOW_COLLECTORS}" ]] ; then
        export_network_flows_flags="$export_network_flows_flags --sflow-targets ${SFLOW_COLLECTORS}"
      fi
      if [[ -n "${IPFIX_COLLECTORS}" ]] ; then
        export_network_flows_flags="$export_network_flows_flags --ipfix-targets ${IPFIX_COLLECTORS}"
      fi
      gw_interface_flag=
      # if br-ex1 is configured on the node, we want to use it for external gateway traffic
      if [ -d /sys/class/net/br-ex1 ]; then
        gw_interface_flag="--exgw-interface=br-ex1"
      fi
      exec /usr/bin/ovnkube --init-node "${K8S_NODE}" \
        --nb-address "ssl:10.1.208.20:9641,ssl:10.1.208.21:9641,ssl:10.1.208.22:9641" \
        --sb-address "ssl:10.1.208.20:9642,ssl:10.1.208.21:9642,ssl:10.1.208.22:9642" \
        --nb-client-privkey /ovn-cert/tls.key \
        --nb-client-cert /ovn-cert/tls.crt \
        --nb-client-cacert /ovn-ca/ca-bundle.crt \
        --nb-cert-common-name "ovn" \
        --sb-client-privkey /ovn-cert/tls.key \
        --sb-client-cert /ovn-cert/tls.crt \
        --sb-client-cacert /ovn-ca/ca-bundle.crt \
        --sb-cert-common-name "ovn" \
        --config-file=/run/ovnkube-config/ovnkube.conf \
        --loglevel "${OVN_KUBE_LOG_LEVEL}" \
        --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
        ${gateway_mode_flags} \
        --metrics-bind-address "127.0.0.1:29103" \
        --metrics-enable-pprof \
        ${export_network_flows_flags} \
        ${gw_interface_flag}

    State:       Running
      Started:   Thu, 02 Dec 2021 08:32:53 -0500
    Last State:  Terminated
      Reason:    Error
      Message:    stderr: ""
I1202 13:32:30.332571  162643 ovs.go:206] exec(3036): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface patch-br-ex1_openshift-master-1-to-br-int ofport
I1202 13:32:30.335837  162643 ovs.go:209] exec(3036): stdout: "2\n"
I1202 13:32:30.335844  162643 ovs.go:210] exec(3036): stderr: ""
I1202 13:32:30.335848  162643 gateway.go:264] Gateway is ready
I1202 13:32:30.335866  162643 ovs.go:206] exec(3037): /usr/bin/ovs-appctl --timeout=15 -t /var/run/ovn/ovn-controller.9999.ctl connection-status
I1202 13:32:30.338144  162643 ovs.go:209] exec(3037): stdout: "connected\n"
I1202 13:32:30.338151  162643 ovs.go:210] exec(3037): stderr: ""
I1202 13:32:30.338154  162643 node.go:239] Node connection status = connected
I1202 13:32:30.338158  162643 ovs.go:206] exec(3038): /usr/bin/ovs-vsctl --timeout=15 -- br-exists br-int
I1202 13:32:30.341647  162643 ovs.go:209] exec(3038): stdout: ""
I1202 13:32:30.341654  162643 ovs.go:210] exec(3038): stderr: ""
I1202 13:32:30.341657  162643 ovs.go:206] exec(3039): /usr/bin/ovs-ofctl dump-aggregate br-int
I1202 13:32:30.344256  162643 ovs.go:209] exec(3039): stdout: "NXST_AGGREGATE reply (xid=0x4): packet_count=3056957 byte_count=815706824 flow_count=1929\n"
I1202 13:32:30.344263  162643 ovs.go:210] exec(3039): stderr: ""
I1202 13:32:30.344272  162643 ovs.go:206] exec(3040): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface patch-br-ex_openshift-master-1-to-br-int ofport
I1202 13:32:30.347453  162643 ovs.go:209] exec(3040): stdout: ""
I1202 13:32:30.347459  162643 ovs.go:210] exec(3040): stderr: ""
I1202 13:32:30.347464  162643 ovs.go:206] exec(3041): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface patch-br-ex1_openshift-master-1-to-br-int ofport
I1202 13:32:30.350651  162643 ovs.go:209] exec(3041): stdout: "2\n"
I1202 13:32:30.350659  162643 ovs.go:210] exec(3041): stderr: ""
I1202 13:32:30.350661  162643 gateway.go:264] Gateway is ready
F1202 13:32:30.350680  162643 ovnkube.go:132] error waiting for node readiness: timed out waiting for the condition

      Exit Code:    1
      Started:      Thu, 02 Dec 2021 08:27:29 -0500
      Finished:     Thu, 02 Dec 2021 08:32:30 -0500
    Ready:          False
    Restart Count:  10
    Requests:
      cpu:      10m
      memory:   300Mi
    Readiness:  exec [test -f /etc/cni/net.d/10-ovn-kubernetes.conf] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      KUBERNETES_SERVICE_PORT:          6443
      KUBERNETES_SERVICE_HOST:          api-int.kni-qe-4.lab.eng.rdu2.redhat.com
      OVN_CONTROLLER_INACTIVITY_PROBE:  180000
      OVN_KUBE_LOG_LEVEL:               4
      K8S_NODE:                          (v1:spec.nodeName)
    Mounts:
      /cni-bin-dir from host-cni-bin (rw)
      /env from env-overrides (rw)
      /etc/cni/net.d from host-cni-netd (rw)
      /etc/openvswitch from etc-openvswitch (rw)
      /etc/ovn/ from etc-openvswitch (rw)
      /etc/systemd/system from systemd-units (ro)
      /host from host-slash (ro)
      /ovn-ca from ovn-ca (rw)
      /ovn-cert from ovn-cert (rw)
      /run/netns from host-run-netns (ro)
      /run/openvswitch from run-openvswitch (rw)
      /run/ovn-kubernetes/ from host-run-ovn-kubernetes (rw)
      /run/ovn/ from run-ovn (rw)
      /run/ovnkube-config/ from ovnkube-config (rw)
      /var/lib/cni/networks/ovn-k8s-cni-overlay from host-var-lib-cni-networks-ovn-kubernetes (rw)
      /var/lib/openvswitch from var-lib-openvswitch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vk2gn (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  systemd-units:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/systemd/system
    HostPathType:
  host-slash:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  host-run-netns:
    Type:          HostPath (bare host directory volume)
    Path:          /run/netns
    HostPathType:
  var-lib-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/openvswitch/data
    HostPathType:
  etc-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/openvswitch
    HostPathType:
  run-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/openvswitch
    HostPathType:
  run-ovn:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/ovn
    HostPathType:
  node-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/ovn
    HostPathType:
  log-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:
  host-run-ovn-kubernetes:
    Type:          HostPath (bare host directory volume)
    Path:          /run/ovn-kubernetes
    HostPathType:
  host-cni-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/bin
    HostPathType:
  host-cni-netd:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/multus/cni/net.d
    HostPathType:
  host-var-lib-cni-networks-ovn-kubernetes:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks/ovn-k8s-cni-overlay
    HostPathType:
  ovnkube-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ovnkube-config
    Optional:  false
  env-overrides:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      env-overrides
    Optional:  true
  ovn-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ovn-ca
    Optional:  false
  ovn-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ovn-cert
    Optional:    false
  ovn-node-metrics-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ovn-node-metrics-cert
    Optional:    true
  kube-api-access-vk2gn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              beta.kubernetes.io/os=linux
Tolerations:                 op=Exists
Events:
  Type     Reason        Age                  From               Message
  ----     ------        ----                 ----               -------
  Normal   Scheduled     109m                 default-scheduler  Successfully assigned openshift-ovn-kubernetes/ovnkube-node-b6lmg to openshift-master-1
  Warning  FailedMount   109m                 kubelet            MountVolume.SetUp failed for volume "ovn-cert" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount   109m                 kubelet            MountVolume.SetUp failed for volume "ovn-ca" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount   109m                 kubelet            MountVolume.SetUp failed for volume "env-overrides" : failed to sync configmap cache: timed out waiting for the condition
  Normal   Pulling       109m                 kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e"
  Normal   Pulled        109m                 kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" in 7.48203
1219s
  Normal   Pulled        109m                 kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" already present on m
achine
  Normal   Started       109m                 kubelet            Started container ovn-controller
  Normal   Created       109m                 kubelet            Created container ovn-controller
  Normal   Created       109m                 kubelet            Created container ovn-acl-logging
  Normal   Started       109m                 kubelet            Started container ovn-acl-logging
  Normal   Pulling       109m                 kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da2149a0469fb553d83418d6ff5f4430fafab71c594deaf789376167cfeed9cc"
  Normal   Pulled        109m                 kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da2149a0469fb553d83418d6ff5f4430fafab71c594deaf789376167cfeed9cc" in 519.808
15ms
  Normal   Created       109m                 kubelet            Created container kube-rbac-proxy
  Normal   Started       109m                 kubelet            Started container kube-rbac-proxy
  Normal   Pulled        109m                 kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" already present on m
achine
  Normal   Created       109m                 kubelet            Created container ovnkube-node
  Normal   Started       109m                 kubelet            Started container ovnkube-node
  Warning  Unhealthy     108m (x8 over 109m)  kubelet            Readiness probe failed:
  Warning  NodeNotReady  64m                  node-controller    Node is not ready
  Normal   Pulled        60m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" already present on m
achine
  Normal   Started       60m                  kubelet            Started container kube-rbac-proxy
  Normal   Created       60m                  kubelet            Created container ovn-controller
  Normal   Started       60m                  kubelet            Started container ovn-controller
  Normal   Pulled        60m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" already present on m
achine
  Normal   Created       60m                  kubelet            Created container ovn-acl-logging
  Normal   Started       60m                  kubelet            Started container ovn-acl-logging
  Normal   Created       60m                  kubelet            Created container kube-rbac-proxy
  Normal   Pulled        60m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da2149a0469fb553d83418d6ff5f4430fafab71c594deaf789376167cfeed9cc" already present on m
achine
  Normal   Created       60m                  kubelet            Created container ovnkube-node
  Normal   Started       60m                  kubelet            Started container ovnkube-node
  Normal   Pulled        50m (x3 over 60m)    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7c1537ce54b4edf652c6b210d6e9e6344161a77088833074f0356ef5ebbf5f1e" already present on m
achine
  Warning  Unhealthy     45m                  kubelet            Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of fa319ba7878442f588697436d695599a6435e231f3
213623611f99a1b9200e22 is running failed: container process not found
  Warning  BackOff       15m (x47 over 50m)   kubelet            Back-off restarting failed container
  Warning  Unhealthy     1s (x616 over 60m)   kubelet            Readiness probe failed:
 

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
4.10.0-0.nightly-2021-12-02-033910


Steps to Reproduce:
-------------------
1. Deploy BM IPI cluster
2. Start OVN migration to 2nd interface
3. Observe status of ovnkube pods in openshift-ovn-kubernetes

Actual results:
---------------
OVN migration fails. From machine-config-daemon pods on machine:

I1202 12:36:19.033534   10130 daemon.go:872] systemctl --failed:
NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online
I1202 12:36:19.033546   10130 daemon.go:608] Starting MachineConfigDaemon
I1202 12:36:19.033624   10130 daemon.go:615] Enabling Kubelet Healthz Monitor
I1202 12:36:48.954555   10130 trace.go:205] Trace[1002495192]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (02-Dec-2021 12:36:18.953) (total time: 30000ms):
Trace[1002495192]: [30.000891916s] [30.000891916s] END
E1202 12:36:48.954596   10130 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp>
I1202 12:36:48.954846   10130 trace.go:205] Trace[2118766229]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (02-Dec-2021 12:36:18.953) >
Trace[2118766229]: [30.001136739s] [30.001136739s] END



Expected results:
-----------------
OVN migration succeeds

Additional info:
----------------
Baremetal cluster: 3masters + 4 workers - baremetal network IPv4 - provisioning network IPv6

Comment 2 Federico Paolinelli 2021-12-02 16:27:16 UTC
I think there's something odd in ovs. Here are my findings:

Ovnkube node complains about the missing patchport for br-ex:

I1202 15:04:22.319087  434538 ovs.go:206] exec(2095): /usr/bin/ovs-vsctl --timeout=15 --if-exists get interface patch-br-ex_openshift-master-1-to-br-int ofport
I1202 15:04:22.322518  434538 ovs.go:209] exec(2095): stdout: ""
I1202 15:04:22.322524  434538 ovs.go:210] exec(2095): stderr: ""

There's a check that verifies the result being different from "" :
https://github.com/openshift/ovn-kubernetes/blob/29b2532df414f4af3519a9da205146380a8dac55/go-controller/pkg/node/gateway.go#L258


The patchport is not there:
[root@openshift-master-1 ~]# /usr/bin/ovs-vsctl list-ports br-ex1
bond0.113
patch-br-ex1_openshift-master-1-to-br-int
[root@openshift-master-1 ~]# /usr/bin/ovs-vsctl list-ports br-ex
bond0

The physnet option is set on the port:

_uuid               : 9201aac4-c512-42dd-9e46-4746c34296e1
addresses           : [unknown]
dhcpv4_options      : []
dhcpv6_options      : []
dynamic_addresses   : []
enabled             : []
external_ids        : {}
ha_chassis_group    : []
name                : br-ex_openshift-master-1
options             : {network_name=physnet}
parent_name         : []
port_security       : []
tag                 : []
tag_request         : 0
type                : localnet
up                  : false

The mappings are in place:

[root@openshift-master-1 ~]# /usr/bin/ovs-vsctl list Open_vSwitch .
_uuid               : 0a6ea54b-e583-42eb-8d63-a9df6ab56eb0
bridges             : [0beb018b-4939-4a26-964d-530f2e433570, 45c2d5cd-3945-4250-9297-7f51a50e91dd, 7ac2fb5e-dd11-4d71-bbc7-a79b3e28b919, b3cc26fc-e209-486d-a959-1de6d33575f7]
cur_cfg             : 266
datapath_types      : [netdev, system]
datapaths           : {system=b272b3d3-2e73-4ca8-9257-b446c03cf558}
db_version          : "8.2.0"
dpdk_initialized    : false
dpdk_version        : "DPDK 20.11.1"
external_ids        : {hostname=openshift-master-1, ovn-bridge-mappings="physnet:br-ex,exgwphysnet:br-ex1", ovn-enable-lflow-cache="true", ovn-encap-ip="10.1.208.21", ovn-encap-type=geneve, ovn-limit-lflow-cache-kb="1048576", ovn-monitor-all="true", ovn-openflow-probe-interval="180", ovn-remote="ssl:10.1.208.20:9642,ssl:10.1.208.21:9642,ssl:10.1.208.22:9642", ovn-remote-probe-interval="180000", rundir="/var/run/openvswitch", system-id="0d2b1da5-a235-463a-8055-c71272cacbf6"}
iface_types         : [bareudp, erspan, geneve, gre, gtpu, internal, ip6erspan, ip6gre, lisp, patch, stt, system, tap, vxlan]
manager_options     : []
next_cfg            : 266
other_config        : {vlan-limit="0"}
ovs_version         : "2.15.2"
ssl                 : e37ab694-b1d6-4367-992b-e9e0b8f714b5
statistics          : {}
system_type         : rhcos
system_version      : "4.10"


This should trigger some magic in ovs to create the patchport, afaik.

Comment 4 Tim Rozet 2022-04-06 17:35:20 UTC
OVN SB chassis:
_uuid               : 6143a1d8-e278-41af-a60e-d0d2f697906b
encaps              : [9a7af96b-db0a-4fe1-9c23-4e70767b259d]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet:br-ex,exgwphysnet:br-ex1", ovn-chassis-mac-mappings="", ovn-cms-options="", ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="true", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : openshift-master-1
name                : "0d2b1da5-a235-463a-8055-c71272cacbf6"
nb_cfg              : 0
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet:br-ex,exgwphysnet:br-ex1", ovn-chassis-mac-mappings="", ovn-cms-options="", ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="true", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []


NB switch ports:
root@fedora ~]# ovn-nbctl list logical_switch_port br-ex1_openshift-master-1
_uuid               : 12ff7c78-5607-4f26-85a1-be2ddf3b4ee8
addresses           : [unknown]
dhcpv4_options      : []
dhcpv6_options      : []
dynamic_addresses   : []
enabled             : []
external_ids        : {}
ha_chassis_group    : []
name                : br-ex1_openshift-master-1
options             : {network_name=exgwphysnet}
parent_name         : []
port_security       : []
tag                 : []
tag_request         : []
type                : localnet
up                  : false
[root@fedora ~]# ovn-nbctl list logical_switch_port 9201aac4-c512-42dd-9e46-4746c34296e1
_uuid               : 9201aac4-c512-42dd-9e46-4746c34296e1
addresses           : [unknown]
dhcpv4_options      : []
dhcpv6_options      : []
dynamic_addresses   : []
enabled             : []
external_ids        : {}
ha_chassis_group    : []
name                : br-ex_openshift-master-1
options             : {network_name=physnet}
parent_name         : []
port_security       : []
tag                 : []
tag_request         : 0
type                : localnet
up                  : false

Comment 5 Dumitru Ceara 2022-04-06 17:56:49 UTC
Per offline discussion, this is a bug on the ovn-kubernetes side, incorrectly assigning the same LSP to multiple logical switches.

Comment 6 Tim Rozet 2022-04-06 18:03:26 UTC
[root@fedora ~]# ovn-nbctl show exgw-ext_openshift-master-1
switch 4825e419-f614-4ff2-b1e7-3dd506fa386a (exgw-ext_openshift-master-1)
    port br-ex1_openshift-master-1
        type: localnet
        addresses: ["unknown"]
    port etor-GR_openshift-master-1
        type: router
        addresses: ["0c:42:a1:ee:7f:12"]
        router-port: rtoe-GR_openshift-master-1
[root@fedora ~]# ovn-nbctl show ext_openshift-master-1
switch f361acea-b980-416b-b518-5fe24c6f3413 (ext_openshift-master-1)
    port etor-GR_openshift-master-1
        type: router
        addresses: ["0c:42:a1:ee:7f:12"]
        router-port: rtoe-GR_openshift-master-1
    port br-ex_openshift-master-1
        type: localnet
        addresses: ["unknown"]


Same router port on multiple switches, this is a regression due to libovsdb changes.

Comment 10 Ross Brattain 2022-05-26 13:20:26 UTC
Verified on 4.11.0-0.nightly-2022-05-20-213928  BM IPI dual-stack

I see different routers on different switches.

exgw-rtoe-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com
rtoe-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com


sh-4.4# for f in $(ovn-nbctl show | awk '/ext_/ {gsub("[()]", "", $3) ; print $3}' | sort ) ; do ovn-nbctl show $f ; done
switch 131438be-352a-46f5-a4ce-c3e00ef26ed6 (exgw-ext_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com)
    port br-ex1_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: localnet
        addresses: ["unknown"]
    port exgw-etor-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: router
        addresses: ["52:54:00:12:c1:38"]
        router-port: exgw-rtoe-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com
switch 904e96f7-9362-4737-a490-25b8997c647f (exgw-ext_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com)
    port exgw-etor-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: router
        addresses: ["52:54:00:64:68:ed"]
        router-port: exgw-rtoe-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com
    port br-ex1_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: localnet
        addresses: ["unknown"]
switch ccc43ce6-2a4a-49d8-8fa1-37fee2c32a2d (exgw-ext_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com)
    port exgw-etor-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: router
        addresses: ["52:54:00:3a:b8:e2"]
        router-port: exgw-rtoe-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com
    port br-ex1_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: localnet
        addresses: ["unknown"]
switch 4d4789d1-c0b5-44c9-aa2e-6e4c90dee061 (ext_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com)
    port etor-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: router
        addresses: ["52:54:00:12:c1:38"]
        router-port: rtoe-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com
    port br-ex_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: localnet
        addresses: ["unknown"]
switch 68fa87b5-0a8b-4e4a-83d1-3aa1ac3c7be7 (ext_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com)
    port br-ex_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: localnet
        addresses: ["unknown"]
    port etor-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: router
        addresses: ["52:54:00:64:68:ed"]
        router-port: rtoe-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com
switch d2b74a48-bdbf-4ba6-b632-4d6222cdd6ab (ext_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com)
    port etor-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: router
        addresses: ["52:54:00:3a:b8:e2"]
        router-port: rtoe-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com
    port br-ex_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com
        type: localnet
        addresses: ["unknown"]



sh-4.4# for f in $(ovn-nbctl show | awk '/ext_/ {gsub("[()]", "", $3) ; print $3}' | sort ) ; do echo -e "\n$f" ; ovn-nbctl lsp-list $f ; echo ; done

exgw-ext_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com
42b0d986-d7c8-4b73-8af5-800b747172ea (br-ex1_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com)
ebea449e-a3d9-4780-88a9-f6830488c22b (exgw-etor-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com)


exgw-ext_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com
94d6b35e-ce13-4874-b7e9-1e94f0543b25 (br-ex1_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com)
7cfd0622-06f6-4d57-982a-460e6e861175 (exgw-etor-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com)


exgw-ext_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com
aa1368a6-3267-4620-9dca-16d71bbcd594 (br-ex1_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com)
7c0d1f38-9b14-4e49-8afc-46da64e9385d (exgw-etor-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com)


ext_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com
8f0e1d40-d1fe-408c-8dce-ac71532cb7b7 (br-ex_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com)
31e0c30a-bdb5-4ba4-ab7e-cfabb7e83bd5 (etor-GR_master-0-0.rbrattai-o411e1db-0.qe.lab.redhat.com)


ext_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com
103d0e52-f3a7-4f41-a843-98e4db199d2a (br-ex_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com)
62e3f17a-e051-46b4-a723-67d7605f3bae (etor-GR_master-0-1.rbrattai-o411e1db-0.qe.lab.redhat.com)


ext_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com
8c6441c6-b8a9-452f-aabf-448d65d78001 (br-ex_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com)
887e2721-5399-49b7-a704-37e151e863ab (etor-GR_master-0-2.rbrattai-o411e1db-0.qe.lab.redhat.com)


I don't have F5 so can't test that.

Comment 12 errata-xmlrpc 2022-08-10 10:40:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069