rook: MountDevice failed for volume pvc-f631... An operation with the given Volume ID already exists

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: Kubernetes tries to attach the pvc to a pod and fails:

  Normal   SuccessfulAttachVolume  25m                  attachdetach-controller       AttachVolume.Attach succeeded for volume "pvc-f631ef53-35d6-438b-a496-d2ba77adb57d"
  Warning  FailedMount             23m                  kubelet, node3  MountVolume.MountDevice failed for volume "pvc-f631ef53-35d6-438b-a496-d2ba77adb57d" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             4m59s (x5 over 18m)  kubelet, node3  Unable to attach or mount volumes: unmounted volumes=[volume], unattached volumes=[volume default-token-4dbg8]: timed out waiting for the condition
  Warning  FailedMount             2m41s (x5 over 23m)  kubelet, node3  Unable to attach or mount volumes: unmounted volumes=[volume], unattached volumes=[default-token-4dbg8 volume]: timed out waiting for the condition
  Warning  FailedMount             32s (x18 over 23m)   kubelet, node3  MountVolume.MountDevice failed for volume "pvc-f631ef53-35d6-438b-a496-d2ba77adb57d" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-3e7b0d61-5335-11ea-a0a0-3e8b30a597e0 already exists

On other nodes in the cluster, the attach and mount works fine and as expected. How to reproduce it (minimal and precise):

Create an example cluster with a rbd-csi storage-class. Create a PVC and a pod, attaching the pvc. I think the issue lies somewhere in mismatching configuration, software, kernel modules, etc.

Environment: of the node trying to mount:

OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Kernel (e.g. uname -a): Linux lb-173 4.15.0-88-generic #88~16.04.1-Ubuntu SMP Wed Feb 12 04:19:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: Baremetal
Rook version (use rook version inside of a Rook Pod):

rook: v1.1.2-44.g2c195d7
go: go1.11

Storage backend version (e.g. for ceph do ceph -v): ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"archive", BuildDate:"2020-01-25T21:52:51Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:12:17Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Baremetal cluster with kubeadm
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 8
Comments: 79 (12 by maintainers)

Most upvoted comments

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

+10

yleiymei on Apr 6, 2023

I read kubelet’s log and solved this problem.

there some logs like the one below :

orphaned pod “ce9d44e6-4643-4bd9-9f12-24a7568463a8” found, but volume paths are still present on disk : There were a total of 2 errors similar to this. Turn up verbosity to see them.

then

rm /var/lib/kubelet/pods/d1136e6d-a113-4c55-a827-d641f56020e2 -rf

problem solved!

Warning: That’s true! rm -rf sometimes removes persistent data if that umount not executed. By the way: My first intention was to describe the problem. Sorry for all…

+10

tivizi on Sep 15, 2020

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Feb 3, 2023

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Jun 21, 2022

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Sep 30, 2022

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Aug 1, 2022

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Aug 23, 2023

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Jun 22, 2023

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Dec 12, 2022

Any update on this issue?

zimivanende on Sep 9, 2021

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Oct 26, 2023

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Apr 8, 2022

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Feb 4, 2022

It seems that users are having to disable host networking to get CSI working with host networking. But as @Madhu-1 pointed out “Running CSI daemonset pods on pod networking is not suggested as it’s having another issue.” @Madhu-1 would you mind linking or clarifying the issue using pod networking for users?

What are the risks of not using host networking, so users might better make their own decision about the trade-offs when using CNI overlay/pod networking?

For users, I strongly suspect this to be an issue with networking in most cases, and we would like to collect more information about this issue so that we can include helpful information about this in a Rook “common issues” document.

Firstly, it seems that firewalls may play a part for some users: https://github.com/rook/rook/issues/4896#issuecomment-756152009

I don’t believe port conflicts are likely to cause this behavior, but I would encourage users to look into the possibility if it isn’t a firewall issue. Ceph mons use ports 6789 and 3300.

@NicolaiSchmid reported here (https://github.com/rook/rook/issues/4896#issuecomment-600649666) that their breaking node was physically separate from their working nodes. For this case, I suspect that the node may be unable to reach the Ceph mon services running in Kubernetes. (Firewall configured differently on the node?) So, for all users experiencing the issue, all nodes, check if the node is able to access Kubernetes Service IPs. Do they fail on non-working nodes but succeed on working nodes?

Any information you are able to give around this issue will be helpful for updating the documentation. And thanks to everyone who has suggested fixes and given information about the issue for us.

BlaineEXE on Dec 22, 2021

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Dec 11, 2021

I had the same issue and I’ve fixed it by disabling the csi-plugin host network config in the operator’s configmap (enabled by default). Then, recreate your pod/pvc from scratch and it should be fine.

CSI_ENABLE_HOST_NETWORK: "false"

Running CSI daemonset pods on pod networking is not suggested as it’s having another issue.

Madhu-1 on Sep 28, 2021

Hitting the same issue

Events:
  Type     Reason                Age                  From                                                                                                              Message
  ----     ------                ----                 ----                                                                                                              -------
  Warning  ProvisioningFailed    35s                  rook-ceph.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-598854d87f-jskc5_c3095dae-28b8-4585-9c42-cc756cfe832a  failed to provision volume with StorageClass "rook-cephfs": rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   ExternalProvisioning  10s (x14 over 3m5s)  persistentvolume-controller                                                                                       waiting for a volume to be created, either by external provisioner "rook-ceph.cephfs.csi.ceph.com" or manually created by system administrator
  Normal   Provisioning          4s (x7 over 3m5s)    rook-ceph.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-598854d87f-jskc5_c3095dae-28b8-4585-9c42-cc756cfe832a  External provisioner is provisioning volume for claim "dps-sample-microservice-onprem/cephfs-pvc"
  Warning  ProvisioningFailed    4s (x6 over 35s)     rook-ceph.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-598854d87f-jskc5_c3095dae-28b8-4585-9c42-cc756cfe832a  failed to provision volume with StorageClass "rook-cephfs": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-5fd83ff9-835f-434c-9ccb-97d14298bfad already exists

Rook operator logs

$ kubectl -n rook-ceph logs -f rook-ceph-operator-86d785d4b8-8r7hk | grep cephfs
2020-12-07 08:00:03.661888 I | rookcmd: flag values: --add_dir_header=false, --alsologtostderr=false, --csi-cephfs-plugin-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin.yaml, --csi-cephfs-provisioner-dep-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-dep.yaml, --csi-cephfs-provisioner-sts-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-sts.yaml, --csi-rbd-plugin-template-path=/etc/ceph-csi/rbd/csi-rbdplugin.yaml, --csi-rbd-provisioner-dep-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-dep.yaml, --csi-rbd-provisioner-sts-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-sts.yaml, --enable-discovery-daemon=true, --enable-flex-driver=false, --enable-machine-disruption-budget=false, --help=false, --kubeconfig=, --log-flush-frequency=5s, --log-level=INFO, --log_backtrace_at=:0, --log_dir=, --log_file=, --log_file_max_size=1800, --logtostderr=true, --master=, --mon-healthcheck-interval=45s, --mon-out-timeout=10m0s, --operator-image=, --service-account=, --skip_headers=false, --skip_log_headers=false, --stderrthreshold=2, --v=0, --vmodule=
2020-12-07 08:00:19.722562 I | ceph-csi: CSIDriver CRD already had been registered for "rook-ceph.cephfs.csi.ceph.com"
2020-12-07 08:00:31.451312 I | cephclient: getting or creating ceph auth key "client.csi-cephfs-provisioner"
2020-12-07 08:00:31.757363 I | cephclient: getting or creating ceph auth key "client.csi-cephfs-node"
2020-12-07 08:01:12.060967 I | cephclient: getting or creating ceph auth key "client.csi-cephfs-provisioner"
2020-12-07 08:01:12.351015 I | cephclient: getting or creating ceph auth key "client.csi-cephfs-node"

csi-cephfsplugin-provisioner -c csi-provisioner logs

I1207 03:42:33.256896       1 csi-provisioner.go:107] Version: v1.6.0-0-g321fa5c1c
I1207 03:42:33.256938       1 csi-provisioner.go:121] Building kube configs for running in cluster...
I1207 03:42:33.265870       1 connection.go:153] Connecting to unix:///csi/csi-provisioner.sock
I1207 03:42:34.266614       1 common.go:111] Probing CSI driver for readiness
W1207 03:42:34.268207       1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I1207 03:42:34.272734       1 leaderelection.go:242] attempting to acquire leader lease  rook-ceph/rook-ceph-cephfs-csi-ceph-com...
I1207 03:42:53.928225       1 leaderelection.go:252] successfully acquired lease rook-ceph/rook-ceph-cephfs-csi-ceph-com
I1207 03:42:54.028649       1 controller.go:799] Starting provisioner controller rook-ceph.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-598854d87f-jskc5_c3095dae-28b8-4585-9c42-cc756cfe832a!
I1207 03:42:54.028701       1 clone_controller.go:58] Starting CloningProtection controller
I1207 03:42:54.028728       1 volume_store.go:97] Starting save volume queue
I1207 03:42:54.028789       1 clone_controller.go:74] Started CloningProtection controller
I1207 03:42:54.129369       1 controller.go:848] Started provisioner controller rook-ceph.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-598854d87f-jskc5_c3095dae-28b8-4585-9c42-cc756cfe832a!
E1207 05:42:54.037049       1 shared_informer.go:662] unrecognized notification: <nil>
I1207 06:13:44.904433       1 controller.go:1284] provision "dps-sample-microservice-onprem/cephfs-pvc" class "rook-cephfs": started
I1207 06:13:44.908243       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dps-sample-microservice-onprem", Name:"cephfs-pvc", UID:"4421d7c6-4cd5-4791-aabf-2f7c2ce4b80a", APIVersion:"v1", ResourceVersion:"241767410", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dps-sample-microservice-onprem/cephfs-pvc"
W1207 06:16:14.912310       1 controller.go:916] Retrying syncing claim "4421d7c6-4cd5-4791-aabf-2f7c2ce4b80a", failure 0
E1207 06:16:14.912346       1 controller.go:939] error syncing claim "4421d7c6-4cd5-4791-aabf-2f7c2ce4b80a": failed to provision volume with StorageClass "rook-cephfs": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1207 06:16:14.912376       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dps-sample-microservice-onprem", Name:"cephfs-pvc", UID:"4421d7c6-4cd5-4791-aabf-2f7c2ce4b80a", APIVersion:"v1", ResourceVersion:"241767410", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "rook-cephfs": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I1207 06:16:14.912438       1 controller.go:1284] provision "dps-sample-microservice-onprem/cephfs-pvc" class "rook-cephfs": started
I1207 06:16:14.916133       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dps-sample-microservice-onprem", Name:"cephfs-pvc", UID:"4421d7c6-4cd5-4791-aabf-2f7c2ce4b80a", APIVersion:"v1", ResourceVersion:"241767410", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dps-sample-microservice-onprem/cephfs-pvc"
W1207 06:16:14.922680       1 controller.go:916] Retrying syncing claim "4421d7c6-4cd5-4791-aabf-2f7c2ce4b80a", failure 1
E1207 06:16:14.922699       1 controller.go:939] error syncing claim "4421d7c6-4cd5-4791-aabf-2f7c2ce4b80a": failed to provision volume with StorageClass "rook-cephfs": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-5fd83ff9-835f-434c-9ccb-97d14298bfad already exists

csi-cephfsplugin-provisioner -c csi-cephfsplugin logs

$ kubectl -n rook-ceph logs -f csi-cephfsplugin-provisioner-598854d87f-jskc5 -c csi-cephfsplugin
W1207 03:42:33.556178       1 driver.go:157] EnableGRPCMetrics is deprecated
E1207 06:13:45.485756       1 omap.go:66] ID: 168 Req-ID: pvc-5fd83ff9-835f-434c-9ccb-97d14298bfad omap not found (pool="myfs-metadata", namespace="csi", name="csi.volumes.default"): rados: ret=2, No such file or directory
E1207 06:16:14.922539       1 controllerserver.go:149] ID: 171 Req-ID: pvc-5fd83ff9-835f-434c-9ccb-97d14298bfad an operation with the given Volume ID pvc-5fd83ff9-835f-434c-9ccb-97d14298bfad already exists
E1207 06:16:14.922571       1 utils.go:163] ID: 171 Req-ID: pvc-5fd83ff9-835f-434c-9ccb-97d14298bfad GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-5fd83ff9-835f-434c-9ccb-97d14298bfad already exists

Ceph Status

[root@rook-ceph-tools-b5765bff4-ljw7f /]# ceph status
  cluster:
    id:     c40d82d5-3193-457d-a628-a3db67839a37
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,d (age 4h)
    mgr: a(active, since 2h)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 2w), 3 in (since 2w)

  task status:
    scrub status:
        mds.myfs-a: idle
        mds.myfs-b: idle

  data:
    pools:   4 pools, 97 pgs
    objects: 95.73k objects, 360 GiB
    usage:   1.0 TiB used, 37 TiB / 38 TiB avail
    pgs:     97 active+clean

  io:
    client:   853 B/s rd, 542 KiB/s wr, 1 op/s rd, 8 op/s wr

[root@rook-ceph-tools-b5765bff4-ljw7f /]#

PVC stuck in the pending state forever

NAME         STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cephfs-pvc   Pending                                      rook-cephfs    43m

Rook version: v1.4.3 Ceph version: v15.2.4-20200630

nirav-chotai on Dec 7, 2020

I think I am hitting this too.

Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Warning  FailedScheduling        21m                  default-scheduler        persistentvolumeclaim "user-data-mysql-claim-qa" not found
  Warning  FailedScheduling        21m                  default-scheduler        0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 Insufficient cpu, 3 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled               21m                  default-scheduler        Successfully assigned mysql/user-data-mysql-qa-0 to k8w01.hsv.dealnews.net
  Warning  FailedScheduling        21m                  default-scheduler        persistentvolumeclaim "user-data-mysql-claim-qa" not found
  Normal   SuccessfulAttachVolume  21m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-ff9f5ec0-0626-44b3-8b62-c3f12233179e"
  Warning  FailedMount             19m                  kubelet                  Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[persistent-storage config-volume sql-initialize-volume tz-config default-token-ttvk8]: timed out waiting for the condition
  Warning  FailedMount             19m                  kubelet                  MountVolume.MountDevice failed for volume "pvc-ff9f5ec0-0626-44b3-8b62-c3f12233179e" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             14m                  kubelet                  Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[tz-config default-token-ttvk8 persistent-storage config-volume sql-initialize-volume]: timed out waiting for the condition
  Warning  FailedMount             10m                  kubelet                  Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[default-token-ttvk8 persistent-storage config-volume sql-initialize-volume tz-config]: timed out waiting for the condition
  Warning  FailedMount             9m1s (x12 over 19m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-ff9f5ec0-0626-44b3-8b62-c3f12233179e" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-082430a5-298c-11eb-bb94-a6cc32098c46 already exists
  Warning  FailedMount             8m9s (x3 over 17m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[sql-initialize-volume tz-config default-token-ttvk8 persistent-storage config-volume]: timed out waiting for the condition
  Normal   Pulled                  6m58s                kubelet                  Container image "percona/percona-xtradb-cluster:5.7" already present on machine
  Normal   Created                 6m58s                kubelet                  Created container mysql
  Normal   Started                 6m57s                kubelet                  Started container mysql

It eventually mounts but takes forever to get there.

mleklund on Nov 18, 2020

这是来自QQ邮箱的假期自动回复邮件。邮件已收到，我会尽快回复您!

yleiymei on Dec 12, 2023

Hi, in my case, the problem was … Firewall !! To be more precise : the CSI plugin is using the protocol v2 on port TCP/3300, instead of the legacy protocol on TCP/6789. This took me a while to understand, since all the other clients were using the legacy protocol and working smoothly. I was not using rook, but an external Ceph, and got the error… Revelation when I looked on the FW logs 😃…

Pivert on Dec 12, 2022

I appear to have this issue on a newly deployed cluster after wiping it and configuring it with hostNetwork true. The deployed kubernetes cluster is created with kubespray. All 3 worker nodes appear to be able to reach kubernetes services, and volumes fail on each of them (daemonset with volume attached fails to deploy and each volume remains pending).

Server Version: version.Info{Major:“1”, Minor:“23”, GitVersion:“v1.23.1”, GitCommit:“86ec240af8cbd1b60bcc4c03c20da9b98005b92e”, GitTreeState:“clean”, BuildDate:“2021-12-16T11:34:54Z”, GoVersion:“go1.17.5”, Compiler:“gc”, Platform:“linux/arm64”}

Using rook helm chart v1.8.3 with ceph image v1.8.3-31.g8fc67f7db

Let me know what debug logs or actions to provide you with to find out more regarding this issue.

efenex on Feb 4, 2022

strange. …I resolved this issue by restart mds deployment:

kubectl -n rook-ceph rollout restart deployment rook-ceph-mds-myfs-a
...

gemfield on May 26, 2021

I think i resolve the problem. Just stop & disable the firewalld systemctl stop firewalld systemctl disable firewalld Now everything be nice.

Hope this could be helpful for someone.

SxunS on Jan 7, 2021

my kubelets are running as root, with /dev having drwxr-xr-x. 17 root root 3700 Nov 25 22:34 dev so this is not the issue in my case.

mleklund on Dec 2, 2020

Here’s related Kubernetes issue: https://github.com/kubernetes/kubernetes/issues/60987

solovyevt on Jul 22, 2020

maybe you need to configure the firewall for 6789 on the host machine? cephcsi daemonset runs with host network

Madhu-1 on Apr 7, 2020

Thank you Madhu-1

k -n rook-ceph exec -it csi-rbdplugin-jqfbj -c csi-rbdplugin – /bin/sh sh-4.2# ps -ef | grep rbd root 556 23888 0 04:25 ? 00:00:00 rbd status csi-vol-a2901e37-7811-11ea-951a-52df191a1831 --pool xxx-rpl-mdpool -m 10.231.16.39:6789,10.231.214.24:6789,10.231.213.231:6789 --id csi-rbd-node --keyfile=/tmp/csi/keys/keyfile-687222524 root 18562 18464 0 04:59 pts/0 00:00:00 grep rbd root 23833 23813 0 Mar24 ? 00:00:02 /csi-node-driver-registrar --v=5 --csi-address=/csi/csi.sock --kubelet-registration-path=/var/lib/kubelet/plugins/rook-ceph.rbd.csi.ceph.com/csi.sock root 23888 23851 0 Mar24 ? 00:01:42 /usr/local/bin/cephcsi --nodeid=xxx-worker01 --endpoint=unix:///csi/csi.sock --v=5 --type=rbd --nodeserver=true --drivername=rook-ceph.rbd.csi.ceph.com --pidlimit=-1 --metricsport=9090 --metricspath=/metrics --enablegrpcmetrics=true sh-4.2# date Tue Apr 7 05:01:25 UTC 2020 sh-4.2# rbd ls unable to get monitor info from DNS SRV with service name: ceph-mon 2020-04-07 05:04:29.825 7f7601a6cb00 -1 failed for service _ceph-mon._tcp 2020-04-07 05:04:29.825 7f7601a6cb00 -1 monclient: get_monmap_and_config cannot identify monitors to contact rbd: couldn’t connect to the cluster! rbd: listing images failed: (2) No such file or directory

Your right something wrong.

I checked the ping to mon pod , it’s OK sh-4.2# ping 10.244.22.29 PING 10.244.22.29 (10.244.22.29) 56(84) bytes of data. 64 bytes from 10.244.22.29: icmp_seq=1 ttl=63 time=0.276 ms 64 bytes from 10.244.22.29: icmp_seq=2 ttl=63 time=0.247 ms

Could you recommand another check?

jeonggyuchoi on Apr 7, 2020

Some additional information, which may help: node3 ist the only node where this issue occurs. node1,2 are working appropriately. node3 is at a different physical location

NicolaiSchmid on Mar 18, 2020