escription of problem (please be detailed as possible and provide log snippests): Version of all relevant components (if applicable): OCS 4.8 OCP/ROKS 4.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, OCS is not getting installed successfully and blocking further development Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Install the ocs-operator with internal-attached devices 2.Use local disk discovery Actual results: OCS operator stuck in installing phase Expected results: Successful installation of the OCS operator Additional info: Pods : ``` % oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-25pwz 3/3 Running 0 33h csi-cephfsplugin-lh89l 3/3 Running 0 33h csi-cephfsplugin-nbx92 3/3 Running 0 33h csi-cephfsplugin-provisioner-864dbdf9bc-cckc5 6/6 Running 1 33h csi-cephfsplugin-provisioner-864dbdf9bc-xdzrn 6/6 Running 0 33h csi-rbdplugin-9wjgm 3/3 Running 0 33h csi-rbdplugin-lsp6x 3/3 Running 0 33h csi-rbdplugin-provisioner-fcd5849c4-fvz6v 6/6 Running 0 33h csi-rbdplugin-provisioner-fcd5849c4-kd9b4 6/6 Running 0 33h csi-rbdplugin-zcfhs 3/3 Running 0 33h noobaa-core-0 1/1 Running 0 33h noobaa-db-pg-0 0/1 Init:0/2 0 33h noobaa-operator-746c58fc5d-fzv9h 1/1 Running 0 33h ocs-metrics-exporter-6658fb78c-7kxdp 1/1 Running 0 33h ocs-operator-798fb49965-p7mp5 0/1 Running 1 33h rook-ceph-crashcollector-10.195.88.169-5b95bb9d8d-qfv9k 1/1 Running 0 33h rook-ceph-crashcollector-10.210.42.158-598b97c5cb-td9k5 1/1 Running 0 33h rook-ceph-crashcollector-10.63.5.229-7b48d6667d-cxgc5 1/1 Running 0 33h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-65bbc8d8ctrsm 2/2 Running 0 33h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-58974bcfcd8xc 2/2 Running 0 33h rook-ceph-mgr-a-654b74b775-4cf89 2/2 Running 0 33h rook-ceph-mon-a-7c4fcdf7b-p7hxq 2/2 Running 0 33h rook-ceph-mon-b-6567948565-rth5f 2/2 Running 0 33h rook-ceph-mon-c-57cdf85ddd-9zmsw 2/2 Running 0 33h rook-ceph-operator-ccf658b8b-2wbql 1/1 Running 0 33h rook-ceph-osd-0-7669c6b84c-rlwpb 2/2 Running 0 33h rook-ceph-osd-1-546f79757f-h7k9h 2/2 Running 0 33h rook-ceph-osd-2-68c6c589b7-7xjjb 2/2 Running 0 33h rook-ceph-osd-prepare-ocs-deviceset-0-data-0g4lgq-sqg59 0/1 Completed 0 33h rook-ceph-osd-prepare-ocs-deviceset-1-data-06khvv-lkqfc 0/1 Completed 0 33h rook-ceph-osd-prepare-ocs-deviceset-2-data-02k6nw-lplzd 0/1 Completed 0 33h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7d6bfcdcwwn6 2/2 Running 0 33h ```
From Shirisha: I tried a sample pod using the rbd class and it's stuck in containerCreating So, I think that's the issue The PVC is getting created successfully and bound, but the pod is not able to mount it @muagarwa can someone from CSI team look at it?
OCP(ROKS) must gather https://drive.google.com/file/d/1J6i4nwFpVuIhjSxoVuc95p9OjAsxr8Z7/view?usp=sharing Based on my discussion with @mrajanna, it appears to be an OCP issue. So, attaching the OCP must gather of the cluster
Pretty-printed panic: recovered from panic "runtime error: invalid memory address or nil pointer dereference". (err=<nil>) Call stack: goroutine 34051057 [running]: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.RecoverFromPanic(0xc0032a4ed8) /builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:158 +0xbd panic(0x42372a0, 0x763d780) /opt/rh/go-toolset-1.16/root/usr/lib/go-toolset-1.16-golang/src/runtime/panic.go:971 +0x499 k8s.io/kubernetes/pkg/volume/csi.(*csiAttacher).Attach(0xc00186cfc0, 0xc002c8bab8, 0x7fffc5e34e96, 0xe, 0x0, 0x0, 0x1000, 0x7f8c771fa700) /builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/pkg/volume/csi/csi_attacher.go:81 +0x1b8 k8s.io/kubernetes/pkg/volume/util/operationexecutor.(*operationGenerator).GenerateAttachVolumeFunc.func1(0x0, 0x0, 0x0, 0x0, 0x0) /builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/pkg/volume/util/operationexecutor/operation_generator.go:342 +0x1fa k8s.io/kubernetes/pkg/volume/util/types.(*GeneratedOperations).Run(0xc003c72fc0, 0x419ff20, 0xc0000fa060, 0x0, 0x0) /builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/pkg/volume/util/types/types.go:79 +0xf9 k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations.(*nestedPendingOperations).Run.func1(0xc001b75780, 0xc0031aeab0, 0x86, 0x0, 0x0, 0x0, 0x0, 0xc003c72fc0, 0x0, 0x0, ...) /builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go:183 +0xfa created by k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations.(*nestedPendingOperations).Run /builddir/build/BUILD/openshift-4.8.0/_output/local/go/src/k8s.io/kubernetes/pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go:178 +0x479
OCP version: 4.8.18
Hi Fabio, Any update on this? If this is a configuration related issue - what config should IBM team be looking for? The IBM ROKS team informs us that this issue is seen only on their Classic environment (which uses local storage)
Please provide an update about this issue with the error snippet posted about some CSI driver issue with ceph. If there is any kubelet issue let us know what to check as there is a customer blocked due to this and its a blocker issue now as ODF 4.8 cannot be deployed on classic infra.
Adding the events of localvolume created from local-storage-operator. Not sure if its related but adding it here if it helps investigation. I see "was defined in devicePaths, but expected a path in /dev/" but not sure if its causing any issue as I see the persistent volume created and also bound to a PVC. But the issue is reproduced only when using local SSD disk. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal FoundMatchingDisk 24m localvolume-symlink-controller 10.63.5.229 - found matching disk /dev/sdc with id /dev/disk/by-id/scsi-3600062b206bb59c028c05ce0e8455ea4 Normal FoundMatchingDisk 24m localvolume-symlink-controller 10.210.42.158 - found matching disk /dev/sdc with id /dev/disk/by-id/scsi-3600605b00e510a0028c088630a6513ed Normal FoundMatchingDisk 24m localvolume-symlink-controller 10.195.88.169 - found matching disk /dev/sdc with id /dev/disk/by-id/scsi-3600062b206a0928028c05c67e787f76e Warning ErrorFindingMatchingDisk 23m localvolume-symlink-controller 10.195.88.169 - unable to find matching disk /dev/sdc Warning ErrorFindingMatchingDisk 23m localvolume-symlink-controller 10.195.88.169 - was defined in devicePaths, but expected a path in /dev/ Warning ErrorFindingMatchingDisk 23m localvolume-symlink-controller 10.210.42.158 - unable to find matching disk /dev/sdc Warning ErrorFindingMatchingDisk 23m localvolume-symlink-controller 10.210.42.158 - was defined in devicePaths, but expected a path in /dev/ Warning ErrorFindingMatchingDisk 23m localvolume-symlink-controller 10.63.5.229 - unable to find matching disk /dev/sdc Warning ErrorFindingMatchingDisk 23m localvolume-symlink-controller 10.63.5.229 - was defined in devicePaths, but expected a path in /dev/ avgunjal@Akashs-MacBook-Pro ~ %
The panic is happening in the kubelet (as per kubelet logs), inside this function: https://github.com/bertinatto/kubernetes/blob/1b0d83f1d63b4e392e6d7e1845ad1554e28ad1ed/pkg/volume/csi/csi_attacher.go#L69 func (c *csiAttacher) Attach(spec *volume.Spec, nodeName types.NodeName) (string, error) { However, this function is supposed to be running in the controller-manager, not in the kubelet. The kubelet *may* be used for attaching volumes, the code is there, but it's a legacy code and not how we attach volumes in OCP.
I can see in the kubelet logs that the options enableControllerAttachDetach is set to false. Can you change it to true and try again?
$ cat quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6f386b0e4757af4e3d07c8dbd7593cc5c1f4463656ff9f476fcde262b5c7ca79/host_service_logs/masters/kubelet_service.log | grep EnableControllerAttachDetach Dec 03 07:29:00.241378 kube-c5rq0lpd0c3dbc69t070-myclusterda-myworke-00000598.iks.ibm kubelet.service[423333]: I1203 01:29:00.213467 423333 server.go:288] "KubeletConfiguration" configuration={TypeMeta:{Kind: APIVersion:} EnableServer:true StaticPodPath:/etc/kubernetes/manifests SyncFrequency:{Duration:1m0s} FileCheckFrequency:{Duration:5s} HTTPCheckFrequency:{Duration:20s} StaticPodURL: StaticPodURLHeader:map[] Address:0.0.0.0 Port:10250 ReadOnlyPort:0 VolumePluginDir:/usr/libexec/kubernetes/kubelet-plugins/volume/exec/ ProviderID:ibm://14ad7320a22647f5b720aa354f506a34///c5rq0lpd0c3dbc69t070/kube-c5rq0lpd0c3dbc69t070-myclusterda-myworke-00000598 TLSCertFile:/etc/kubernetes/cert/kubelet-serving.crt TLSPrivateKeyFile:/etc/kubernetes/cert/kubelet-serving-key.pem TLSCipherSuites:[TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256] TLSMinVersion: RotateCertificates:false ServerTLSBootstrap:false Authentication:{X509:{ClientCAFile:/etc/kubernetes/kubelet-ca.crt} Webhook:{Enabled:true CacheTTL:{Duration:2m0s}} Anonymous:{Enabled:false}} Authorization:{Mode:Webhook Webhook:{CacheAuthorizedTTL:{Duration:5m0s} CacheUnauthorizedTTL:{Duration:30s}}} RegistryPullQPS:5 RegistryBurst:10 EventRecordQPS:0 EventBurst:10 EnableDebuggingHandlers:true EnableContentionProfiling:false HealthzPort:10248 HealthzBindAddress:127.0.0.1 OOMScoreAdj:-999 ClusterDomain:cluster.local ClusterDNS:[172.21.0.10] StreamingConnectionIdleTimeout:{Duration:30m0s} NodeStatusUpdateFrequency:{Duration:10s} NodeStatusReportFrequency:{Duration:5m0s} NodeLeaseDurationSeconds:40 ImageMinimumGCAge:{Duration:2m0s} ImageGCHighThresholdPercent:85 ImageGCLowThresholdPercent:80 VolumeStatsAggPeriod:{Duration:1m0s} KubeletCgroups:/podruntime.slice SystemCgroups: CgroupRoot: CgroupsPerQOS:true CgroupDriver:systemd CPUManagerPolicy:none CPUManagerReconcilePeriod:{Duration:10s} MemoryManagerPolicy:None TopologyManagerPolicy:none TopologyManagerScope:container QOSReserved:map[] RuntimeRequestTimeout:{Duration:15m0s} HairpinMode:promiscuous-bridge MaxPods:250 PodCIDR: PodPidsLimit:229553 ResolverConfig:/etc/resolv.conf RunOnce:false CPUCFSQuota:true CPUCFSQuotaPeriod:{Duration:100ms} MaxOpenFiles:1000000 NodeStatusMaxImages:50 ContentType:application/vnd.kubernetes.protobuf KubeAPIQPS:20 KubeAPIBurst:40 SerializeImagePulls:true EvictionHard:map[imagefs.available:5% imagefs.inodesFree:5% memory.available:100Mi nodefs.available:5% nodefs.inodesFree:5%] EvictionSoft:map[imagefs.available:10% imagefs.inodesFree:10% memory.available:100Mi nodefs.available:10% nodefs.inodesFree:10%] EvictionSoftGracePeriod:map[imagefs.available:10m imagefs.inodesFree:10m memory.available:10m nodefs.available:10m nodefs.inodesFree:10m] EvictionPressureTransitionPeriod:{Duration:5m0s} EvictionMaxPodGracePeriod:0 EvictionMinimumReclaim:map[] PodsPerCore:0 EnableControllerAttachDetach:false ProtectKernelDefaults:false MakeIPTablesUtilChains:true IPTablesMasqueradeBit:14 IPTablesDropBit:15 FeatureGates:map[CRIContainerLogRotation:true DownwardAPIHugePages:true ExpandInUsePersistentVolumes:true LegacyNodeRoleBehavior:false RotateKubeletServerCertificate:true] FailSwapOn:true ContainerLogMaxSize:100Mi ContainerLogMaxFiles:3 ConfigMapAndSecretChangeDetectionStrategy:Watch AllowedUnsafeSysctls:[] KernelMemcgNotification:false SystemReserved:map[cpu:108m ephemeral-storage:26390503096 memory:3307Mi pid:12752] KubeReserved:map[cpu:72m ephemeral-storage:26390503096 memory:2204Mi pid:12752] SystemReservedCgroup:/system.slice KubeReservedCgroup:/podruntime.slice EnforceNodeAllocatable:[pods] ReservedSystemCPUs: ShowHiddenMetricsForVersion: Logging:{Format:text Sanitization:false} EnableSystemLogHandler:true ShutdownGracePeriod:{Duration:0s} ShutdownGracePeriodCriticalPods:{Duration:0s} ReservedMemory:[] EnableProfilingHandler:true EnableDebugFlagsHandler:true}
I had a discussion with Fabio and after changing the parameter EnableControllerAttachDetach to true as per the suggestion it fixed my issue. Thanks for the help. The issue can be closed now.
We might need to do something to prevent this from happening again in the future. I few options: 1. If attaching from the kubelet is *not* supported in OCP, we should document that. 2. If that's something we should support in OCP, we should fix the panic and introduce proper testing. 3. If that's something OCP *and* upstream k8s don't support, we should deprecate the behavior (in k8s) and remove the code. I'll discuss this with the team and update this ticket accordingly.
@Fabio: I have a query on this. The point is this enableControllerAttachDetach was always false and it only failed in ODF 4.8 Even now we have 4.7 cluster where it works fine with false as value. Can you explain why ODF works fine in 4.7 and older versions with the same configuration while it failed in 4.8 and we had to change it to get it working?
@akgunjal, the code that is causing the panic is not present in OCP 4.7: <snip> $ git diff ocp/release-4.7..ocp/release-4.8 -- pkg/volume/csi/csi_attacher.go (...) - var vaSrc storage.VolumeAttachmentSource - if spec.InlineVolumeSpecForCSIMigration { - // inline PV scenario - use PV spec to populate VA source. - // The volume spec will be populated by CSI translation API - // for inline volumes. This allows fields required by the CSI - // attacher such as AccessMode and MountOptions (in addition to - // fields in the CSI persistent volume source) to be populated - // as part of CSI translation for inline volumes. - vaSrc = storage.VolumeAttachmentSource{ - InlineVolumeSpec: &spec.PersistentVolume.Spec, - } - } else { - // regular PV scenario - use PV name to populate VA source - pvName := spec.PersistentVolume.GetName() - vaSrc = storage.VolumeAttachmentSource{ - PersistentVolumeName: &pvName, - } + attachment, err := c.plugin.volumeAttachmentLister.Get(attachID) + if err != nil && !apierrors.IsNotFound(err) { + return "", errors.New(log("failed to get volume attachment from lister: %v", err)) } </snip>
@fbertina Could we know the reason for this code change? As it didn't lead to any issues earlier
(In reply to Shirisha S Rao from comment #22) > @fbertina > > Could we know the reason for this code change? As it didn't lead to any > issues earlier I think this may have introduced it - https://github.com/kubernetes/kubernetes/pull/97834
Since we know the root cause and it is basically a kubelet misconfiguration, can we lower the severity? It does not look urgent any longer.
(In reply to Sahina Bose from comment #23) > (In reply to Shirisha S Rao from comment #22) > > @fbertina > > > > Could we know the reason for this code change? As it didn't lead to any > > issues earlier > > I think this may have introduced it - > https://github.com/kubernetes/kubernetes/pull/97834 It seems so, and the reason for the code change is stated there. This change indeed broke the "attaching from the kubelet" use case, but that is something we don't use in OCP. The supported behavior is to use the Attach Detach Controller for that. We do want to prevent this from happening in the fugure, so I'm leaving this ticket open for further discussing the options from comment #19. I'm lowering the severity for the reason stated in comment #24. Feel free to change them if you disagree.
Moving to MODIFIED because the kube v1.24 rebase (that contains the proposed fix [1]), landed. [1] https://github.com/kubernetes/kubernetes/pull/107887
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days