rook: CSI based RBD pods go into CrashLoopBackOff upon deployment

Description

The recent changes in ceph/csi/rbd/templates folder and addition of ceph-csi-config (PR#3271) seems to be affecting rook-CSI deployment. The csi based rbd pods are unable to come up in Running state upon fresh deployment.

Steps performed for creating the setup:

Cloned the rook repo
Navigated to rook/cluster/examples/kubernetes/ceph
Ran common.yaml $oc create -f common.yaml
Applied rbac rules from csi folder $oc apply -f csi/rbac/rbd/ $oc apply -f csi/rbac/cephfs/
Created operator pod $ oc create -f operator-openshift-with-csi.yaml
Check the pods. The pods are either stuck in ContainerCreating state.
Create ceph pods. $ oc create -f cluster.yaml
Re-check the pod status. The csi-rbdplugin-* pods go in CrashloopBackOff and finally into error state.

Note: Tried similar steps in ocs-ci based setups as well, similar issue

Till cluster.yaml is run to create ceph cluster, the csi-rbplugin pods were in ContainerCreating state:

$ oc get pods -n rook-ceph csi-rbdplugin-f999c 0/2 ContainerCreating 0 53m 10.0.158.162 ip-10-0-158-162.us-east-2.compute.internal <none> <none> csi-rbdplugin-pkzx7 0/2 ContainerCreating 0 53m 10.0.168.120 ip-10-0-168-120.us-east-2.compute.internal <none> <none> csi-rbdplugin-provisioner-0 0/4 ContainerCreating 0 53m <none> ip-10-0-158-162.us-east-2.compute.internal <none> <none> csi-rbdplugin-st5jp 0/2 ContainerCreating 0 53m 10.0.135.141 ip-10-0-135-141.us-east-2.compute.internal <none> <none>

Pods in error state:

csi-rbdplugin-f999c                           1/2     CrashLoopBackOff    4          66m
csi-rbdplugin-pkzx7                           1/2     CrashLoopBackOff    4          66m
csi-rbdplugin-provisioner-0                   3/4     CrashLoopBackOff    5          66m
csi-rbdplugin-st5jp                           1/2     CrashLoopBackOff    4          66m

oc describe of one pod

Events:
  Type     Reason       Age                   From                                                 Message
  ----     ------       ----                  ----                                                 -------
  Normal   Scheduled    45m                   default-scheduler                                    Successfully assigned rook-ceph/csi-rbdplugin-provisioner-0 to ip-10-0-158-162.us-east-2.compute.internal
  **Warning  FailedMount  15m (x23 over 45m)    kubelet, ip-10-0-158-162.us-east-2.compute.internal  MountVolume.SetUp failed for volume "ceph-csi-config" : configmaps "rook-ceph-mon-endpoints" not found**

  **Warning  FailedMount  5m26s (x18 over 43m)  kubelet, ip-10-0-158-162.us-east-2.compute.internal  Unable to mount volumes for pod "csi-rbdplugin-provisioner-0_rook-ceph(49342068-90c4-11e9-906d-0ac7385058d4)": timeout expired waiting for volumes to attach or mount for pod "rook-ceph"/"csi-rbdplugin-provisioner-0". list of unmounted volumes=[ceph-csi-config]. list of unattached volumes=[host-dev host-rootfs host-sys lib-modules socket-dir ceph-csi-config rook-csi-rbd-provisioner-sa-token-htkg5]**

$ oc get cm -n rook-ceph

NAME                                                      DATA   AGE
local-device-ip-10-0-135-141.us-east-2.compute.internal   1      68m
local-device-ip-10-0-158-162.us-east-2.compute.internal   1      69m
local-device-ip-10-0-168-120.us-east-2.compute.internal   1      68m
rook-ceph-config                                          1      4m36s
rook-ceph-mon-endpoints                                   4      4m36s
rook-config-override                                      1      4m36s
rook-crush-config                                         1      3m39s

$ oc get nodes

NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-129-19.us-east-2.compute.internal    Ready    master   66m   v1.13.4+cb455d664
ip-10-0-135-141.us-east-2.compute.internal   Ready    worker   62m   v1.13.4+cb455d664
ip-10-0-149-108.us-east-2.compute.internal   Ready    master   66m   v1.13.4+cb455d664
ip-10-0-158-162.us-east-2.compute.internal   Ready    worker   62m   v1.13.4+cb455d664
ip-10-0-168-120.us-east-2.compute.internal   Ready    worker   62m   v1.13.4+cb455d664
ip-10-0-173-77.us-east-2.compute.internal    Ready    master   66m   v1.13.4+cb455d664

oc get pods -n rook-ceph

NAME                                          READY   STATUS             RESTARTS   AGE
csi-cephfsplugin-fj7w6                        2/2     Running            1          70m
csi-cephfsplugin-lsc7f                        2/2     Running            0          70m
csi-cephfsplugin-provisioner-0                3/3     Running            0          70m
csi-cephfsplugin-xz5wx                        2/2     Running            0          70m
csi-rbdplugin-f999c                           0/2     CrashLoopBackOff   8          70m
csi-rbdplugin-pkzx7                           0/2     CrashLoopBackOff   8          70m
csi-rbdplugin-provisioner-0                   2/4     CrashLoopBackOff   12         70m
csi-rbdplugin-st5jp                           0/2     CrashLoopBackOff   8          70m
rook-ceph-agent-b5wdl                         1/1     Running            0          70m
rook-ceph-agent-w62wh                         1/1     Running            0          70m
rook-ceph-agent-zbjkg                         1/1     Running            0          70m
rook-ceph-mgr-a-7c9b5494dd-8lpbw              1/1     Running            0          4m43s
rook-ceph-mon-a-84bb86696c-t4hx4              1/1     Running            0          5m41s
rook-ceph-mon-b-79db689db5-wsmwt              1/1     Running            0          5m29s
rook-ceph-mon-c-5c995b74bb-5jrp9              1/1     Running            0          5m6s
rook-ceph-operator-5b6856f864-bm2rq           1/1     Running            0          71m
rook-ceph-osd-0-7d59d5b7b5-bghhd              1/1     Running            0          3m24s
rook-ceph-osd-1-677fdb8f9f-q77v5              1/1     Running            0          3m19s
rook-ceph-osd-2-78f5f56fd5-lfjgz              1/1     Running            0          3m14s
rook-ceph-osd-prepare-ip-10-0-135-141-8cj6b   0/2     Completed          0          4m12s
rook-ceph-osd-prepare-ip-10-0-158-162-c5qz9   0/2     Completed          0          4m12s
rook-ceph-osd-prepare-ip-10-0-168-120-5vspd   0/2     Completed          0          4m12s
rook-discover-c7qbb                           1/1     Running            0          70m
rook-discover-dg4ck                           1/1     Running            0          70m
rook-discover-g75qd                           1/1     Running            0          70m

Deviation from expected behavior: The recent changes should not affect the deployment of csi-rbdplugin pods.

Expected behavior: All csi-rbdplugin pods should come up in Running state.

Environment:

OS (e.g. from /etc/os-release): Red Hat Enterprise Linux CoreOS 410.8.20190520.0 (Ootpa)
Kernel (e.g. uname -a): 4.18.0-80.1.2.el8_0.x86_64
Cloud provider or hardware configuration: AWS
Rook version (use rook version inside of a Rook Pod): rook: v1.0.0-122.g488fe64
Openshift version (use kubectl version):

Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0-201905091432+4910781-dirty", GitCommit:"4910781", GitTreeState:"dirty", BuildDate:"2019-05-09T19:19:42Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+838b4fa", GitCommit:"838b4fa", GitTreeState:"clean", BuildDate:"2019-05-19T23:51:04Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Openshift

oc get clusterversion

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0     True        False         83m     Cluster version is 4.1.0

Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): ceph health HEALTH_OK

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 19 (11 by maintainers)

Commits related to this issue

Update openshift with Ceph-CSI to image versions as in Kubernetes case PR #3217 changed the pod manifest to drop some parameters to the ceph-csi pods. This also resulted in a change to the operator w... — committed to ShyamsundarR/rook by ShyamsundarR 5 years ago
Update openshift with Ceph-CSI to image versions as in Kubernetes case PR #3217 changed the pod manifest to drop some parameters to the ceph-csi pods. This also resulted in a change to the operator w... — committed to ShyamsundarR/rook by ShyamsundarR 5 years ago
Need CSI RBD image added to example The operator-with-csi.yaml example now needs the updated CSI RBD image specified explicitly, otherwise the old image is looking for the metadata arg and fails to s... — committed to j-griffith/rook by j-griffith 5 years ago
Need CSI RBD image added to example The operator-with-csi.yaml example now needs the updated CSI RBD image specified explicitly, otherwise the old image is looking for the metadata arg and fails to s... — committed to j-griffith/rook by j-griffith 5 years ago

Most upvoted comments

@j-griffith hope your issue is resolved and we can close this one

Madhu-1 on Jun 28, 2019

closing this as its fixed

Madhu-1 on Aug 29, 2019

The case seems to be due to missing configmap ceph-csi-config when bringing up the Ceph-CSI related pods. @phlogistonjohn the step that rus oc create -f operator-openshift-with-csi.yaml can create an empty config file for the pods to start successfully, is that feasible?

ShyamsundarR on Jun 17, 2019