rook: External Ceph Cluster stuck in "Connecting"

Bug Report

Expected behavior:

CephCluster state is “CONNECTED”

Actual behavior:

CephCluster state is “CONNECTING”

How to reproduce it (minimal and precise):

kubectl create -f common.yaml kubectl create -f operator.yaml kubectl create -f common-external.yaml kubectl create -f cluster-external.yaml bash cluster/examples/kubernetes/ceph/import-external-cluster.sh

File(s) to submit:

2019-09-24 02:09:00.811884 W | op-cluster: waiting for the connection info of the external cluster. retrying in 1m0s.

2019-09-24 02:10:00.816375 W | op-cluster: waiting for the connection info of the external cluster. retrying in 1m0s.

2019-09-24 02:11:00.819592 W | op-cluster: waiting for the connection info of the external cluster. retrying in 1m0s.

2019-09-24 02:12:00.823023 W | op-cluster: waiting for the connection info of the external cluster. retrying in 1m0s.

Environment:

OS: RancherOS
Kernel (e.g. uname -a): 4.14.85-rancher
Cloud provider or hardware configuration: vSphere Virtual Machines
Rook version (use rook version inside of a Rook Pod): v1.1.1
Storage backend version (e.g. for ceph do ceph -v): ceph version 12.2.12-48.el7cp (26388d73d88602005946d4381cc5796d42904858) luminous (stable)
Kubernetes version (use kubectl version): v1.14.6
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Rancher managed
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

Other:

I have an external Ceph cluster and would like to use dynamic RBD volumes from Pod PVCs to a StorageClass with Rook, using Ceph CSI.

I think I have missed a step somewhere as I have been unable to get rook to connect to an external Ceph cluster. I’m not using any internal ceph cluster, so I followed documentation from

Ceph External Cluster

Of which I interpreted as;

set the environment variables and run the shell script
verify a configmap and secret has been created
inject common.yaml
inject operator.yaml
inject common-external.yaml
modify cluster-external.yaml changing the namespace to “rook-ceph”

At this stage, the Pods were in a crashbackoff loop, and I had to perform the following.

In the operator yaml I modified;

# RancherOS requires a different directory
- name: FLEXVOLUME_DIR_PATH
  value: "/var/lib/kubelet/volumeplugins"

# RancherOS requires a specific value for the Kubelet
- name: ROOK_CSI_KUBELET_DIR_PATH
  value: "/opt/rke/var/lib/kubelet"

I also had to modify the daemonset and change

- --containerized=false

With this in place, I have the CephCluster state stuck in “CONNECTING”

I can see the following error when I look at the csi-rbdplugin container logs within the operator pod

E0924 02:44:39.175178       1 utils.go:123] ID: 25 GRPC error: rpc error: code = InvalidArgument desc = failed to fetch monitor list using clusterID (rook-ceph): missing configuration for cluster ID (rook-ceph)

So where should the Ceph CSI RBD plugin be reading this information from?

If I look at the following documentation;

Ceph CSI

Seems to indicate a configmap is required with the clusterid and monitors, is the Rook Operator supposed to create this from the information provided in the secret and configmap from the bash script?

The samples given within the Ceph CSI repo indicates there is another configMap

Ceph CSI sample configmap

I’ve definitely missed something here but I can’t figure it out at this point.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 27 (11 by maintainers)

Most upvoted comments

Getting a similar issue here.

storage01 ~]# ceph -v
ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)

Procedure followed:

kubectl create -f crds.yaml -f common.yaml -f operator.yaml
# on ceph cluster run: 
./create-external-cluster-resources.sh
# export all the keys outputted from create-external-cluster-resources.sh into kubectl cli host. 
export ROOK_EXTERNAL_USER_SECRET=<>
export ROOK_EXTERNAL_USERNAME=client.healthchecker
export CSI_RBD_NODE_SECRET_SECRET=<>
export CSI_RBD_PROVISIONER_SECRET=<>
export CSI_CEPHFS_NODE_SECRET=<>
export CSI_CEPHFS_PROVISIONER_SECRET=<>
export NAMESPACE="rook-ceph"
export ROOK_EXTERNAL_FSID="<>"
export ROOK_EXTERNAL_CEPH_MON_DATA="IP01:6789,IP02:6789,IP03:6789"
./import-external-cluster.sh
sed -i 's/name: rook-ceph-external/name: rook-ceph/' cluster-external.yaml
sed -i 's/namespace: rook-ceph-external/namespace: rook-ceph/' cluster-external.yaml
kubectl create -f cluster-external.yaml

Results:

control01 ~]# kubectl get CephCluster -A
NAMESPACE   NAME        DATADIRHOSTPATH   MONCOUNT   AGE     PHASE        MESSAGE                 HEALTH
rook-ceph   rook-ceph                                2d22h   Connecting   Cluster is connecting

control01 ~]# kubectl logs -n rook-ceph -l app=rook-ceph-operator
2020-12-07 17:31:43.104445 W | op-mon: waiting for the csi connection info of the external cluster. retrying in 1m0s.
2020-12-07 17:32:43.107853 W | op-mon: waiting for the csi connection info of the external cluster. retrying in 1m0s.
2020-12-07 17:33:43.111118 W | op-mon: waiting for the csi connection info of the external cluster. retrying in 1m0s.
2020-12-07 17:34:43.113949 W | op-mon: waiting for the csi connection info of the external cluster. retrying in 1m0s.
2020-12-07 17:35:43.116510 W | op-mon: waiting for the csi connection info of the external cluster. retrying in 1m0s.
2020-12-07 17:36:43.119298 W | op-mon: waiting for the csi connection info of the external cluster. retrying in 1m0s.
2020-12-07 17:37:43.121864 W | op-mon: waiting for the csi connection info of the external cluster. retrying in 1m0s.
2020-12-07 17:38:43.124678 W | op-mon: waiting for the csi connection info of the external cluster. retrying in 1m0s.
2020-12-07 17:39:43.127947 W | op-mon: waiting for the csi connection info of the external cluster. retrying in 1m0s.
2020-12-07 17:40:43.131406 W | op-mon: waiting for the csi connection info of the external cluster. retrying in 1m0s.
...

control01 ~]# kubectl logs -n rook-ceph csi-rbdplugin-9k4nm csi-rbdplugin
W1204 18:50:59.252835 1370737 driver.go:171] EnableGRPCMetrics is deprecated

[root@control01 ~]# kubectl logs -n rook-ceph csi-rbdplugin-provisioner-65f85f795c-dss2f csi-provisioner
I1204 18:50:59.379229       1 csi-provisioner.go:121] Version: v2.0.0
I1204 18:50:59.379283       1 csi-provisioner.go:135] Building kube configs for running in cluster...
I1204 18:50:59.384295       1 connection.go:153] Connecting to unix:///csi/csi-provisioner.sock
I1204 18:51:00.384920       1 common.go:111] Probing CSI driver for readiness
W1204 18:51:00.389090       1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I1204 18:51:00.395272       1 leaderelection.go:243] attempting to acquire leader lease  rook-ceph/rook-ceph-rbd-csi-ceph-com...
I1204 18:51:00.401884       1 leaderelection.go:253] successfully acquired lease rook-ceph/rook-ceph-rbd-csi-ceph-com
I1204 18:51:00.502320       1 clone_controller.go:66] Starting CloningProtection controller
I1204 18:51:00.502331       1 volume_store.go:97] Starting save volume queue
I1204 18:51:00.502242       1 controller.go:820] Starting provisioner controller rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65f85f795c-dss2f_d5fdc667-5dc6-4973-98a9-eb52386097a7!
I1204 18:51:00.502375       1 clone_controller.go:84] Started CloningProtection controller
I1204 18:51:00.602870       1 controller.go:869] Started provisioner controller rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-65f85f795c-dss2f_d5fdc667-5dc6-4973-98a9-eb52386097a7!

control01 ~]# kubectl -n rook-ceph get cm
NAME                        DATA   AGE
rook-ceph-csi-config        1      2d23h
rook-ceph-operator-config   6      2d23h

 control01 ~]# kubectl -n rook-ceph get secret
NAME                                         TYPE                                  DATA   AGE
default-token-5qvz2                          kubernetes.io/service-account-token   3      2d23h
rook-ceph-admission-controller-token-fmfzc   kubernetes.io/service-account-token   3      2d23h
rook-ceph-cmd-reporter-token-dcwft           kubernetes.io/service-account-token   3      2d23h
rook-ceph-mgr-token-cgxcv                    kubernetes.io/service-account-token   3      2d23h
rook-ceph-osd-token-bxphp                    kubernetes.io/service-account-token   3      2d23h
rook-ceph-system-token-z959t                 kubernetes.io/service-account-token   3      2d23h
rook-csi-cephfs-plugin-sa-token-fb5xx        kubernetes.io/service-account-token   3      2d23h
rook-csi-cephfs-provisioner-sa-token-t6bpj   kubernetes.io/service-account-token   3      2d23h
rook-csi-rbd-plugin-sa-token-qrcrg           kubernetes.io/service-account-token   3      2d23h
rook-csi-rbd-provisioner-sa-token-zkn8x      kubernetes.io/service-account-token   3      2d23h

control01 ~]# cat < /dev/tcp/IP01/6789
ceph v027�
���x
��
control01 ~]# cat < /dev/tcp/IP02/6789
ceph v027�
���p
��
control01 ~]# cat < /dev/tcp/IP03/6789
ceph v027�
��˴
��

karasjoh000 on Dec 7, 2020

I have got same error.

Environment:

OS: RHCOS 4.1
Cloud provider or hardware configuration: Proxmox/KVM
Rook version: 1.1.1
Storage backend version: 12.2.12
Kubernetes version : v1.13.4
Kubernetes cluster type : Openshift 4.1
Storage backend status : HEALTH_OK

I just want use rook to connect my externl ceph cluster, so no internal ceph.

How to reproduce it (minimal and precise):

oc create -f common.yaml
oc create -f operator-openshift.yaml
export NAMESPACE=rook-ceph
export ROOK_EXTERNAL_FSID=92c7747d-240b-49e8-9182-d7d86340e987
export ROOK_EXTERNAL_ADMIN_SECRET=XXXXXXXXXXXXXXXXXXXXXXXXX==
export ROOK_EXTERNAL_CEPH_MON_DATA=a=10.0.22.11:6789,b=10.0.22.12:6789,c=10.0.22.13:6789
bash import-external-cluster.sh
oc create -f cluster-external.yaml
oc create -n rook-ceph -f csi/rbd/storageclass.yaml
oc create -n rook-ceph -f csi/rbd/pvc.yaml

cluster-external.yaml:

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  external:
    enable: true
  dataDirHostPath: /var/lib/rook
  cephVersion:
    image: ceph/ceph:v12.2.11-20190830 # MUST match external cluster version

storageclass.yaml(change pool name to ‘rbd’, which is the pool name of my external ceph pool):

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: rbd
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:
    size: 3
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    # clusterID is the namespace where the rook cluster is running
    # If you change this namespace, also change the namespace below where the secret namespaces are defined
    clusterID: rook-ceph

    # Ceph pool into which the RBD image shall be created
    pool: rbd

    # RBD image format. Defaults to "2".
    imageFormat: "2"

    # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
    imageFeatures: layering

    # The secrets contain Ceph admin credentials. These are generated automatically by the operator
    # in the same namespace as the cluster.
    csi.storage.k8s.io/provisioner-secret-name: rook-ceph-csi
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
    csi.storage.k8s.io/node-stage-secret-name: rook-ceph-csi
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
    # Specify the filesystem type of the volume. If not specified, csi-provisioner
    # will set default as `ext4`.
    csi.storage.k8s.io/fstype: ext4
# uncomment the following to use rbd-nbd as mounter on supported nodes
#mounter: rbd-nbd
reclaimPolicy: Delete

The csi-cluster-convig-json in cm/rook-ceph-csi-config are empty.

# kubectl -n rook-ceph get cm -o yaml
apiVersion: v1
items:
- apiVersion: v1
  data:
    csi-cluster-config-json: '[]'
  kind: ConfigMap
  metadata:
    creationTimestamp: "2019-09-27T02:19:43Z"
    name: rook-ceph-csi-config
    namespace: rook-ceph
    resourceVersion: "3117688"
    selfLink: /api/v1/namespaces/rook-ceph/configmaps/rook-ceph-csi-config
    uid: 44db79d9-e0cd-11e9-bd3d-8266f8d305ca
- apiVersion: v1
  data:
    data: a=10.0.22.11:6789,b=10.0.22.12:6789,c=10.0.22.13:6789
    mapping: '{}'
    maxMonId: "2"
  kind: ConfigMap
  metadata:
    creationTimestamp: "2019-09-27T02:17:54Z"
    name: rook-ceph-mon-endpoints
    namespace: rook-ceph
    resourceVersion: "3117107"
    selfLink: /api/v1/namespaces/rook-ceph/configmaps/rook-ceph-mon-endpoints
    uid: 03dcdcc6-e0cd-11e9-90f3-a28e3205bd00
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Part of error Log:

# oc logs pod/csi-rbdplugin-provisioner-0 -c csi-provisioner
I0927 06:36:32.229037       1 controller.go:1196] provision "rook-ceph/rbd-pvc" class "rook-ceph-block": started
I0927 06:36:32.233617       1 controller.go:471] CreateVolumeRequest {Name:pvc-37944939-e0d1-11e9-90f3-a28e3205bd00 CapacityRange:required_bytes:1073741824  VolumeCapabilities:[mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > ] Parameters:map[clusterID:rook-ceph csi.storage.k8s.io/node-stage-secret-name:rook-ceph-csi csi.storage.k8s.io/node-stage-secret-namespace:rook-ceph csi.storage.k8s.io/provisioner-secret-name:rook-ceph-csi csi.storage.k8s.io/provisioner-secret-namespace:rook-ceph imageFeatures:layering imageFormat:2 pool:rbd] Secrets:map[] VolumeContentSource:<nil> AccessibilityRequirements:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0927 06:36:32.233746       1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"rbd-pvc", UID:"37944939-e0d1-11e9-90f3-a28e3205bd00", APIVersion:"v1", ResourceVersion:"3127259", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "rook-ceph/rbd-pvc"
I0927 06:36:32.237352       1 connection.go:180] GRPC call: /csi.v1.Controller/CreateVolume
I0927 06:36:32.237387       1 connection.go:181] GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-37944939-e0d1-11e9-90f3-a28e3205bd00","parameters":{"clusterID":"rook-ceph","imageFeatures":"layering","imageFormat":"2","pool":"rbd"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0927 06:36:32.244201       1 connection.go:183] GRPC response: {}
I0927 06:36:32.244848       1 connection.go:184] GRPC error: rpc error: code = InvalidArgument desc = failed to fetch monitor list using clusterID (rook-ceph): missing configuration for cluster ID (rook-ceph)
I0927 06:36:32.244954       1 controller.go:979] Final error received, removing PVC 37944939-e0d1-11e9-90f3-a28e3205bd00 from claims in progress
W0927 06:36:32.244978       1 controller.go:886] Retrying syncing claim "37944939-e0d1-11e9-90f3-a28e3205bd00", failure 54
E0927 06:36:32.245030       1 controller.go:908] error syncing claim "37944939-e0d1-11e9-90f3-a28e3205bd00": failed to provision volume with StorageClass "rook-ceph-block": rpc error: code = InvalidArgument desc = failed to fetch monitor list using clusterID (rook-ceph): missing configuration for cluster ID (rook-ceph)
I0927 06:36:32.245117       1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"rook-ceph", Name:"rbd-pvc", UID:"37944939-e0d1-11e9-90f3-a28e3205bd00", APIVersion:"v1", ResourceVersion:"3127259", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "rook-ceph-block": rpc error: code = InvalidArgument desc = failed to fetch monitor list using clusterID (rook-ceph): missing configuration for cluster ID (rook-ceph)

Find some error in ceph-operator:

# oc log pod/rook-ceph-operator-b84844bd-nk6fr
2019-09-27 02:19:43.961566 I | op-cluster: starting cluster in namespace rook-ceph
2019-09-27 02:19:44.184075 I | operator: successfully started Ceph CSI driver(s)
2019-09-27 02:19:44.184125 I | op-cluster: CephCluster rook-ceph status: Connecting. 
2019-09-27 02:19:44.281851 I | op-mon: created csi secret for cluster rook-ceph
2019-09-27 02:19:44.299143 I | op-mon: parsing mon endpoints: a=10.0.22.11:6789,b=10.0.22.12:6789,c=10.0.22.13:6789
2019-09-27 02:19:44.299222 I | op-mon: loaded: maxMonID=2, mons=map[a:0xc001188f00 b:0xc001188f40 c:0xc001188f80], mapping=&{Node:map[]}
2019-09-27 02:19:44.299240 I | op-cluster: found the cluster info to connect to the external cluster. mons=map[b:0xc001188f40 c:0xc001188f80 a:0xc001188f00]
2019-09-27 02:19:44.299250 I | op-cluster: detecting the ceph image version for image ceph/ceph:v12.2.11-20190830...
2019-09-27 02:19:54.556734 I | op-cluster: Detected ceph image version: 12.2.11 <unknown version>
2019-09-27 02:19:54.574620 I | op-mon: created csi secret for cluster rook-ceph
2019-09-27 02:19:54.583662 I | op-mon: parsing mon endpoints: a=10.0.22.11:6789,b=10.0.22.12:6789,c=10.0.22.13:6789
2019-09-27 02:19:54.583735 I | op-mon: loaded: maxMonID=2, mons=map[a:0xc001747ee0 b:0xc001747f20 c:0xc001747f60], mapping=&{Node:map[]}
2019-09-27 02:19:54.584231 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-09-27 02:19:54.584423 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-09-27 02:19:54.584651 I | exec: Running command: ceph version --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/458248653
2019-09-27 02:19:55.238011 E | op-cluster: failed to configure external ceph cluster. failed to detect and validate ceph version. failed to validate ceph version between external and local. unsupported ceph version 12.2.12 <unknown version>, need at least nautilus, delete your cluster CR and create a new one with a correct ceph version
2019-09-27 02:19:55.238106 I | op-cluster: Update event for uninitialized cluster rook-ceph. Initializing...
2019-09-27 02:19:55.238127 I | op-cluster: CephCluster rook-ceph status: Connecting. 
2019-09-27 02:19:55.268198 I | op-mon: created csi secret for cluster rook-ceph
2019-09-27 02:19:55.271797 I | op-mon: parsing mon endpoints: a=10.0.22.11:6789,b=10.0.22.12:6789,c=10.0.22.13:6789
2019-09-27 02:19:55.271851 I | op-mon: loaded: maxMonID=2, mons=map[c:0xc001691bc0 a:0xc001691b40 b:0xc001691b80], mapping=&{Node:map[]}
2019-09-27 02:19:55.271868 I | op-cluster: found the cluster info to connect to the external cluster. mons=map[a:0xc001691b40 b:0xc001691b80 c:0xc001691bc0]
2019-09-27 02:19:55.272037 I | op-cluster: detecting the ceph image version for image ceph/ceph:v12.2.11-20190830...
2019-09-27 02:19:55.291655 I | op-k8sutil: Removing previous job rook-ceph-detect-version to start a new one
2019-09-27 02:19:55.310412 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2019-09-27 02:19:57.315039 I | op-k8sutil: batch job rook-ceph-detect-version deleted
2019-09-27 02:20:06.581720 I | op-cluster: Detected ceph image version: 12.2.11 <unknown version>
2019-09-27 02:20:06.617589 I | op-mon: created csi secret for cluster rook-ceph
2019-09-27 02:20:06.622756 I | op-mon: parsing mon endpoints: a=10.0.22.11:6789,b=10.0.22.12:6789,c=10.0.22.13:6789
2019-09-27 02:20:06.622851 I | op-mon: loaded: maxMonID=2, mons=map[a:0xc000fd77e0 b:0xc000fd7860 c:0xc000fd78e0], mapping=&{Node:map[]}
2019-09-27 02:20:06.623398 I | cephconfig: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2019-09-27 02:20:06.623597 I | cephconfig: generated admin config in /var/lib/rook/rook-ceph
2019-09-27 02:20:06.623763 I | exec: Running command: ceph version --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/636105928
2019-09-27 02:20:07.288383 E | op-cluster: failed to configure external ceph cluster. failed to detect and validate ceph version. failed to validate ceph version between external and local. unsupported ceph version 12.2.12 <unknown version>, need at least nautilus, delete your cluster CR and create a new one with a correct ceph version
W0927 02:26:21.720553       8 reflector.go:289] github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:169: watch of *v1.ConfigMap ended with: too old resource version: 3116809 (3119945)
W0927 02:36:19.727366       8 reflector.go:289] github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:169: watch of *v1.ConfigMap ended with: too old resource version: 3120105 (3123225)
W0927 02:44:28.735496       8 reflector.go:289] github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:169: watch of *v1.ConfigMap ended with: too old resource version: 3123398 (3125926)

Does this mean that rook 1.1.1 don’t support my externel ceph luminous?

Because RHCOS node don’t support StorageClass with the “kubernetes.io/rbd” provisioner. So, rook is my only solution.

I don’t know the reason of failure are mis-config or rook don’t support ceph 12.2.12?

jerryhan77 on Sep 27, 2019

I should also add that I am able to use a StorageClass with the “kubernetes.io/rbd” provisioner successfully, so it would seem to be some config I am missing on the Rook / Ceph CSI configuration side.

MAHDTech on Sep 25, 2019