2028532 – noobaa-pg-db-0 pod stuck in Init:0/2

Bug 2028532 - noobaa-pg-db-0 pod stuck in Init:0/2

Summary: noobaa-pg-db-0 pod stuck in Init:0/2

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.9
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Fabio Bertinatto
QA Contact:	Penghao Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-02 15:15 UTC by Shirisha S Rao
Modified:	2023-09-15 01:50 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:40:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:40:47 UTC

Description Shirisha S Rao 2021-12-02 15:15:33 UTC

escription of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCS 4.8
OCP/ROKS 4.8


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, OCS is not getting installed successfully and blocking further development


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
Yes


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Install the ocs-operator with internal-attached devices
2.Use local disk discovery


Actual results:
OCS operator stuck in installing phase

Expected results:
Successful installation of the OCS operator

Additional info:

Pods :
```
% oc get pods -n openshift-storage                               
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-25pwz                                            3/3     Running     0          33h
csi-cephfsplugin-lh89l                                            3/3     Running     0          33h
csi-cephfsplugin-nbx92                                            3/3     Running     0          33h
csi-cephfsplugin-provisioner-864dbdf9bc-cckc5                     6/6     Running     1          33h
csi-cephfsplugin-provisioner-864dbdf9bc-xdzrn                     6/6     Running     0          33h
csi-rbdplugin-9wjgm                                               3/3     Running     0          33h
csi-rbdplugin-lsp6x                                               3/3     Running     0          33h
csi-rbdplugin-provisioner-fcd5849c4-fvz6v                         6/6     Running     0          33h
csi-rbdplugin-provisioner-fcd5849c4-kd9b4                         6/6     Running     0          33h
csi-rbdplugin-zcfhs                                               3/3     Running     0          33h
noobaa-core-0                                                     1/1     Running     0          33h
noobaa-db-pg-0                                                    0/1     Init:0/2    0          33h
noobaa-operator-746c58fc5d-fzv9h                                  1/1     Running     0          33h
ocs-metrics-exporter-6658fb78c-7kxdp                              1/1     Running     0          33h
ocs-operator-798fb49965-p7mp5                                     0/1     Running     1          33h
rook-ceph-crashcollector-10.195.88.169-5b95bb9d8d-qfv9k           1/1     Running     0          33h
rook-ceph-crashcollector-10.210.42.158-598b97c5cb-td9k5           1/1     Running     0          33h
rook-ceph-crashcollector-10.63.5.229-7b48d6667d-cxgc5             1/1     Running     0          33h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-65bbc8d8ctrsm   2/2     Running     0          33h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-58974bcfcd8xc   2/2     Running     0          33h
rook-ceph-mgr-a-654b74b775-4cf89                                  2/2     Running     0          33h
rook-ceph-mon-a-7c4fcdf7b-p7hxq                                   2/2     Running     0          33h
rook-ceph-mon-b-6567948565-rth5f                                  2/2     Running     0          33h
rook-ceph-mon-c-57cdf85ddd-9zmsw                                  2/2     Running     0          33h
rook-ceph-operator-ccf658b8b-2wbql                                1/1     Running     0          33h
rook-ceph-osd-0-7669c6b84c-rlwpb                                  2/2     Running     0          33h
rook-ceph-osd-1-546f79757f-h7k9h                                  2/2     Running     0          33h
rook-ceph-osd-2-68c6c589b7-7xjjb                                  2/2     Running     0          33h
rook-ceph-osd-prepare-ocs-deviceset-0-data-0g4lgq-sqg59           0/1     Completed   0          33h
rook-ceph-osd-prepare-ocs-deviceset-1-data-06khvv-lkqfc           0/1     Completed   0          33h
rook-ceph-osd-prepare-ocs-deviceset-2-data-02k6nw-lplzd           0/1     Completed   0          33h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7d6bfcdcwwn6   2/2     Running     0          33h
```

Comment 3 Sahina Bose 2021-12-03 05:47:48 UTC

From Shirisha:
I tried a sample pod using the rbd class and it's stuck in containerCreating
So, I think that's the issue
The PVC is getting created successfully and bound, but the pod is not able to mount it

@muagarwa can someone from CSI team look at it?

Comment 7 Shirisha S Rao 2021-12-06 08:10:03 UTC

OCP(ROKS) must gather 
https://drive.google.com/file/d/1J6i4nwFpVuIhjSxoVuc95p9OjAsxr8Z7/view?usp=sharing

Based on my discussion with @mrajanna, it appears to be an OCP issue. So, attaching the OCP must gather of the cluster

Comment 9 Jan Safranek 2021-12-07 14:50:46 UTC

Pretty-printed panic:

recovered from panic "runtime error: invalid memory address or nil pointer dereference". (err=<nil>) Call stack:
goroutine 34051057 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.RecoverFromPanic(0xc0032a4ed8)
	/builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:158 +0xbd
panic(0x42372a0, 0x763d780)
	/opt/rh/go-toolset-1.16/root/usr/lib/go-toolset-1.16-golang/src/runtime/panic.go:971 +0x499
k8s.io/kubernetes/pkg/volume/csi.(*csiAttacher).Attach(0xc00186cfc0, 0xc002c8bab8, 0x7fffc5e34e96, 0xe, 0x0, 0x0, 0x1000, 0x7f8c771fa700)
	/builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/pkg/volume/csi/csi_attacher.go:81 +0x1b8
k8s.io/kubernetes/pkg/volume/util/operationexecutor.(*operationGenerator).GenerateAttachVolumeFunc.func1(0x0, 0x0, 0x0, 0x0, 0x0)
	/builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/pkg/volume/util/operationexecutor/operation_generator.go:342 +0x1fa
k8s.io/kubernetes/pkg/volume/util/types.(*GeneratedOperations).Run(0xc003c72fc0, 0x419ff20, 0xc0000fa060, 0x0, 0x0)
	/builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/pkg/volume/util/types/types.go:79 +0xf9
k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations.(*nestedPendingOperations).Run.func1(0xc001b75780, 0xc0031aeab0, 0x86, 0x0, 0x0, 0x0, 0x0, 0xc003c72fc0, 0x0, 0x0, ...)
	/builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go:183 +0xfa
created by k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations.(*nestedPendingOperations).Run
	/builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go:178 +0x479

Comment 10 Jan Safranek 2021-12-07 14:56:28 UTC

OCP version: 4.8.18

Comment 12 Sahina Bose 2021-12-08 13:47:08 UTC

Hi Fabio,
Any update on this? If this is a configuration related issue - what config should IBM team be looking for?
The IBM ROKS team informs us that this issue is seen only on their Classic environment (which uses local storage)

Comment 13 akgunjal@in.ibm.com 2021-12-10 14:13:25 UTC

Please provide an update about this issue with the error snippet posted about some CSI driver issue with ceph. If there is any kubelet issue let us know what to check as there is a customer blocked due to this and its a blocker issue now as ODF 4.8 cannot be deployed on classic infra.

Comment 14 akgunjal@in.ibm.com 2021-12-10 14:26:09 UTC

Adding the events of localvolume created from local-storage-operator. Not sure if its related but adding it here if it helps investigation.

I see "was defined in devicePaths, but expected a path in /dev/" but not sure if its causing any issue as I see the persistent volume created and also bound to a PVC. But the issue is reproduced only when using local SSD disk.

Events:
  Type     Reason                    Age   From                            Message
  ----     ------                    ----  ----                            -------
  Normal   FoundMatchingDisk         24m   localvolume-symlink-controller  10.63.5.229 - found matching disk /dev/sdc with id /dev/disk/by-id/scsi-3600062b206bb59c028c05ce0e8455ea4
  Normal   FoundMatchingDisk         24m   localvolume-symlink-controller  10.210.42.158 - found matching disk /dev/sdc with id /dev/disk/by-id/scsi-3600605b00e510a0028c088630a6513ed
  Normal   FoundMatchingDisk         24m   localvolume-symlink-controller  10.195.88.169 - found matching disk /dev/sdc with id /dev/disk/by-id/scsi-3600062b206a0928028c05c67e787f76e
  Warning  ErrorFindingMatchingDisk  23m   localvolume-symlink-controller  10.195.88.169 - unable to find matching disk /dev/sdc
  Warning  ErrorFindingMatchingDisk  23m   localvolume-symlink-controller  10.195.88.169 -  was defined in devicePaths, but expected a path in /dev/
  Warning  ErrorFindingMatchingDisk  23m   localvolume-symlink-controller  10.210.42.158 - unable to find matching disk /dev/sdc
  Warning  ErrorFindingMatchingDisk  23m   localvolume-symlink-controller  10.210.42.158 -  was defined in devicePaths, but expected a path in /dev/
  Warning  ErrorFindingMatchingDisk  23m   localvolume-symlink-controller  10.63.5.229 - unable to find matching disk /dev/sdc
  Warning  ErrorFindingMatchingDisk  23m   localvolume-symlink-controller  10.63.5.229 -  was defined in devicePaths, but expected a path in /dev/
avgunjal@Akashs-MacBook-Pro ~ %

Comment 15 Fabio Bertinatto 2021-12-10 15:40:00 UTC

The panic is happening in the kubelet (as per kubelet logs), inside this function: https://github.com/bertinatto/kubernetes/blob/1b0d83f1d63b4e392e6d7e1845ad1554e28ad1ed/pkg/volume/csi/csi_attacher.go#L69

func (c *csiAttacher) Attach(spec *volume.Spec, nodeName types.NodeName) (string, error) {

However, this function is supposed to be running in the controller-manager, not in the kubelet.

The kubelet *may* be used for attaching volumes, the code is there, but it's a legacy code and not how we attach volumes in OCP.

Comment 16 Fabio Bertinatto 2021-12-10 15:47:39 UTC

I can see in the kubelet logs that the options enableControllerAttachDetach is set to false. Can you change it to true and try again?

Comment 17 Fabio Bertinatto 2021-12-10 17:24:02 UTC

$ cat quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6f386b0e4757af4e3d07c8dbd7593cc5c1f4463656ff9f476fcde262b5c7ca79/host_service_logs/masters/kubelet_service.log | grep EnableControllerAttachDetach

Dec 03 07:29:00.241378 kube-c5rq0lpd0c3dbc69t070-myclusterda-myworke-00000598.iks.ibm kubelet.service[423333]: I1203 01:29:00.213467  423333 server.go:288] "KubeletConfiguration" configuration={TypeMeta:{Kind: APIVersion:} EnableServer:true StaticPodPath:/etc/kubernetes/manifests SyncFrequency:{Duration:1m0s} FileCheckFrequency:{Duration:5s} HTTPCheckFrequency:{Duration:20s} StaticPodURL: StaticPodURLHeader:map[] Address:0.0.0.0 Port:10250 ReadOnlyPort:0 VolumePluginDir:/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ProviderID:ibm://14ad7320a22647f5b720aa354f506a34///c5rq0lpd0c3dbc69t070/kube-c5rq0lpd0c3dbc69t070-myclusterda-myworke-00000598 TLSCertFile:/etc/kubernetes/cert/kubelet-serving.crt TLSPrivateKeyFile:/etc/kubernetes/cert/kubelet-serving-key.pem TLSCipherSuites:[TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256] TLSMinVersion: RotateCertificates:false ServerTLSBootstrap:false Authentication:{X509:{ClientCAFile:/etc/kubernetes/kubelet-ca.crt} Webhook:{Enabled:true CacheTTL:{Duration:2m0s}} Anonymous:{Enabled:false}} Authorization:{Mode:Webhook Webhook:{CacheAuthorizedTTL:{Duration:5m0s} CacheUnauthorizedTTL:{Duration:30s}}} RegistryPullQPS:5 RegistryBurst:10 EventRecordQPS:0 EventBurst:10 EnableDebuggingHandlers:true EnableContentionProfiling:false HealthzPort:10248 HealthzBindAddress:127.0.0.1 OOMScoreAdj:-999 ClusterDomain:cluster.local ClusterDNS:[172.21.0.10] StreamingConnectionIdleTimeout:{Duration:30m0s} NodeStatusUpdateFrequency:{Duration:10s} NodeStatusReportFrequency:{Duration:5m0s} NodeLeaseDurationSeconds:40 ImageMinimumGCAge:{Duration:2m0s} ImageGCHighThresholdPercent:85 ImageGCLowThresholdPercent:80 VolumeStatsAggPeriod:{Duration:1m0s} KubeletCgroups:/podruntime.slice SystemCgroups: CgroupRoot: CgroupsPerQOS:true CgroupDriver:systemd CPUManagerPolicy:none CPUManagerReconcilePeriod:{Duration:10s} MemoryManagerPolicy:None TopologyManagerPolicy:none TopologyManagerScope:container QOSReserved:map[] RuntimeRequestTimeout:{Duration:15m0s} HairpinMode:promiscuous-bridge MaxPods:250 PodCIDR: PodPidsLimit:229553 ResolverConfig:/etc/resolv.conf RunOnce:false CPUCFSQuota:true CPUCFSQuotaPeriod:{Duration:100ms} MaxOpenFiles:1000000 NodeStatusMaxImages:50 ContentType:application/vnd.kubernetes.protobuf KubeAPIQPS:20 KubeAPIBurst:40 SerializeImagePulls:true EvictionHard:map[imagefs.available:5% imagefs.inodesFree:5% memory.available:100Mi nodefs.available:5% nodefs.inodesFree:5%] EvictionSoft:map[imagefs.available:10% imagefs.inodesFree:10% memory.available:100Mi nodefs.available:10% nodefs.inodesFree:10%] EvictionSoftGracePeriod:map[imagefs.available:10m imagefs.inodesFree:10m memory.available:10m nodefs.available:10m nodefs.inodesFree:10m] EvictionPressureTransitionPeriod:{Duration:5m0s} EvictionMaxPodGracePeriod:0 EvictionMinimumReclaim:map[] PodsPerCore:0 EnableControllerAttachDetach:false ProtectKernelDefaults:false MakeIPTablesUtilChains:true IPTablesMasqueradeBit:14 IPTablesDropBit:15 FeatureGates:map[CRIContainerLogRotation:true DownwardAPIHugePages:true ExpandInUsePersistentVolumes:true LegacyNodeRoleBehavior:false RotateKubeletServerCertificate:true] FailSwapOn:true ContainerLogMaxSize:100Mi ContainerLogMaxFiles:3 ConfigMapAndSecretChangeDetectionStrategy:Watch AllowedUnsafeSysctls:[] KernelMemcgNotification:false SystemReserved:map[cpu:108m ephemeral-storage:26390503096 memory:3307Mi pid:12752] KubeReserved:map[cpu:72m ephemeral-storage:26390503096 memory:2204Mi pid:12752] SystemReservedCgroup:/system.slice KubeReservedCgroup:/podruntime.slice EnforceNodeAllocatable:[pods] ReservedSystemCPUs: ShowHiddenMetricsForVersion: Logging:{Format:text Sanitization:false} EnableSystemLogHandler:true ShutdownGracePeriod:{Duration:0s} ShutdownGracePeriodCriticalPods:{Duration:0s} ReservedMemory:[] EnableProfilingHandler:true EnableDebugFlagsHandler:true}

Comment 18 akgunjal@in.ibm.com 2021-12-10 18:48:53 UTC

I had a discussion with Fabio and after changing the parameter EnableControllerAttachDetach to true as per the suggestion it fixed my issue. Thanks for the help. The issue can be closed now.

Comment 19 Fabio Bertinatto 2021-12-10 19:09:08 UTC

We might need to do something to prevent this from happening again in the future. I few options:

1. If attaching from the kubelet is *not* supported in OCP, we should document that.
2. If that's something we should support in OCP, we should fix the panic and introduce proper testing.
3. If that's something OCP *and* upstream k8s don't support, we should deprecate the behavior (in k8s) and remove the code.

I'll discuss this with the team and update this ticket accordingly.

Comment 20 akgunjal@in.ibm.com 2021-12-13 11:53:38 UTC

@Fabio: I have a query on this. The point is this enableControllerAttachDetach was always false and it only failed in ODF 4.8
Even now we have 4.7 cluster where it works fine with false as value.

Can you explain why ODF works fine in 4.7 and older versions with the same configuration while it failed in 4.8 and we had to change it to get it working?

Comment 21 Fabio Bertinatto 2021-12-13 19:34:57 UTC

@akgunjal, the code that is causing the panic is not present in OCP 4.7:

<snip>
$ git diff ocp/release-4.7..ocp/release-4.8 -- pkg/volume/csi/csi_attacher.go
(...)

-	var vaSrc storage.VolumeAttachmentSource
-	if spec.InlineVolumeSpecForCSIMigration {
-		// inline PV scenario - use PV spec to populate VA source.
-		// The volume spec will be populated by CSI translation API
-		// for inline volumes. This allows fields required by the CSI
-		// attacher such as AccessMode and MountOptions (in addition to
-		// fields in the CSI persistent volume source) to be populated
-		// as part of CSI translation for inline volumes.
-		vaSrc = storage.VolumeAttachmentSource{
-			InlineVolumeSpec: &spec.PersistentVolume.Spec,
-		}
-	} else {
-		// regular PV scenario - use PV name to populate VA source
-		pvName := spec.PersistentVolume.GetName()
-		vaSrc = storage.VolumeAttachmentSource{
-			PersistentVolumeName: &pvName,
-		}
+	attachment, err := c.plugin.volumeAttachmentLister.Get(attachID)
+	if err != nil && !apierrors.IsNotFound(err) {
+		return "", errors.New(log("failed to get volume attachment from lister: %v", err))
 	}

</snip>

Comment 22 Shirisha S Rao 2021-12-14 15:04:52 UTC

@fbertina 

Could we know the reason for this code change? As it didn't lead to any issues earlier

Comment 23 Sahina Bose 2021-12-16 07:22:55 UTC

(In reply to Shirisha S Rao from comment #22)
> @fbertina 
> 
> Could we know the reason for this code change? As it didn't lead to any
> issues earlier

I think this may have introduced it - https://github.com/kubernetes/kubernetes/pull/97834

Comment 24 Jan Safranek 2021-12-17 12:59:23 UTC

Since we know the root cause and it is basically a kubelet misconfiguration, can we lower the severity? It does not look urgent any longer.

Comment 25 Fabio Bertinatto 2021-12-27 14:22:09 UTC

(In reply to Sahina Bose from comment #23)
> (In reply to Shirisha S Rao from comment #22)
> > @fbertina 
> > 
> > Could we know the reason for this code change? As it didn't lead to any
> > issues earlier
> 
> I think this may have introduced it -
> https://github.com/kubernetes/kubernetes/pull/97834

It seems so, and the reason for the code change is stated there.

This change indeed broke the "attaching from the kubelet" use case, but that is something we don't use in OCP. The supported behavior is to use the Attach Detach Controller for that.

We do want to prevent this from happening in the fugure, so I'm leaving this ticket open for further discussing the options from comment #19.

I'm lowering the severity for the reason stated in comment #24. Feel free to change them if you disagree.

Comment 29 Fabio Bertinatto 2022-05-30 14:28:19 UTC

Moving to MODIFIED because the kube v1.24 rebase (that contains the proposed fix [1]), landed.

[1] https://github.com/kubernetes/kubernetes/pull/107887

Comment 35 errata-xmlrpc 2022-08-10 10:40:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 36 Red Hat Bugzilla 2023-09-15 01:50:22 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.