rook: MountDevice failed for volume pvc-f631... An operation with the given Volume ID already exists
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: Kubernetes tries to attach the pvc to a pod and fails:
Normal SuccessfulAttachVolume 25m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-f631ef53-35d6-438b-a496-d2ba77adb57d"
Warning FailedMount 23m kubelet, node3 MountVolume.MountDevice failed for volume "pvc-f631ef53-35d6-438b-a496-d2ba77adb57d" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedMount 4m59s (x5 over 18m) kubelet, node3 Unable to attach or mount volumes: unmounted volumes=[volume], unattached volumes=[volume default-token-4dbg8]: timed out waiting for the condition
Warning FailedMount 2m41s (x5 over 23m) kubelet, node3 Unable to attach or mount volumes: unmounted volumes=[volume], unattached volumes=[default-token-4dbg8 volume]: timed out waiting for the condition
Warning FailedMount 32s (x18 over 23m) kubelet, node3 MountVolume.MountDevice failed for volume "pvc-f631ef53-35d6-438b-a496-d2ba77adb57d" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-3e7b0d61-5335-11ea-a0a0-3e8b30a597e0 already exists
On other nodes in the cluster, the attach and mount works fine and as expected. How to reproduce it (minimal and precise):
Create an example cluster with a rbd-csi storage-class. Create a PVC and a pod, attaching the pvc. I think the issue lies somewhere in mismatching configuration, software, kernel modules, etc.
Environment: of the node trying to mount:
- OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
- Kernel (e.g.
uname -a):Linux lb-173 4.15.0-88-generic #88~16.04.1-Ubuntu SMP Wed Feb 12 04:19:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux - Cloud provider or hardware configuration: Baremetal
- Rook version (use
rook versioninside of a Rook Pod):
rook: v1.1.2-44.g2c195d7
go: go1.11
- Storage backend version (e.g. for ceph do
ceph -v):ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable) - Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"archive", BuildDate:"2020-01-25T21:52:51Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:12:17Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
- Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Baremetal cluster with kubeadm
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):HEALTH_OK
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 8
- Comments: 79 (12 by maintainers)
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
I read kubelet’s log and solved this problem.
there some logs like the one below :
then
problem solved!
Warning: That’s true!
rm -rfsometimes removes persistent data if thatumountnot executed. By the way: My first intention was to describe the problem. Sorry for all…这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
Any update on this issue?
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
It seems that users are having to disable host networking to get CSI working with host networking. But as @Madhu-1 pointed out “Running CSI daemonset pods on pod networking is not suggested as it’s having another issue.” @Madhu-1 would you mind linking or clarifying the issue using pod networking for users?
What are the risks of not using host networking, so users might better make their own decision about the trade-offs when using CNI overlay/pod networking?
For users, I strongly suspect this to be an issue with networking in most cases, and we would like to collect more information about this issue so that we can include helpful information about this in a Rook “common issues” document.
Firstly, it seems that firewalls may play a part for some users: https://github.com/rook/rook/issues/4896#issuecomment-756152009
I don’t believe port conflicts are likely to cause this behavior, but I would encourage users to look into the possibility if it isn’t a firewall issue. Ceph mons use ports 6789 and 3300.
@NicolaiSchmid reported here (https://github.com/rook/rook/issues/4896#issuecomment-600649666) that their breaking node was physically separate from their working nodes. For this case, I suspect that the node may be unable to reach the Ceph mon services running in Kubernetes. (Firewall configured differently on the node?) So, for all users experiencing the issue, all nodes, check if the node is able to access Kubernetes Service IPs. Do they fail on non-working nodes but succeed on working nodes?
Any information you are able to give around this issue will be helpful for updating the documentation. And thanks to everyone who has suggested fixes and given information about the issue for us.
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
Running CSI daemonset pods on pod networking is not suggested as it’s having another issue.
Hitting the same issue
Rook operator logs
csi-cephfsplugin-provisioner -c csi-provisioner logs
csi-cephfsplugin-provisioner -c csi-cephfsplugin logs
Ceph Status
PVC stuck in the pending state forever
Rook version: v1.4.3 Ceph version: v15.2.4-20200630
I think I am hitting this too.
It eventually mounts but takes forever to get there.
这是来自QQ邮箱的假期自动回复邮件。邮件已收到,我会尽快回复您!
Hi, in my case, the problem was … Firewall !! To be more precise : the CSI plugin is using the protocol v2 on port TCP/3300, instead of the legacy protocol on TCP/6789. This took me a while to understand, since all the other clients were using the legacy protocol and working smoothly. I was not using rook, but an external Ceph, and got the error… Revelation when I looked on the FW logs 😃…
I appear to have this issue on a newly deployed cluster after wiping it and configuring it with hostNetwork true. The deployed kubernetes cluster is created with kubespray. All 3 worker nodes appear to be able to reach kubernetes services, and volumes fail on each of them (daemonset with volume attached fails to deploy and each volume remains pending).
Server Version: version.Info{Major:“1”, Minor:“23”, GitVersion:“v1.23.1”, GitCommit:“86ec240af8cbd1b60bcc4c03c20da9b98005b92e”, GitTreeState:“clean”, BuildDate:“2021-12-16T11:34:54Z”, GoVersion:“go1.17.5”, Compiler:“gc”, Platform:“linux/arm64”}
Using rook helm chart v1.8.3 with ceph image v1.8.3-31.g8fc67f7db
Let me know what debug logs or actions to provide you with to find out more regarding this issue.
strange. …I resolved this issue by restart mds deployment:
I think i resolve the problem. Just stop & disable the firewalld
systemctl stop firewalld systemctl disable firewalldNow everything be nice.Hope this could be helpful for someone.
my kubelets are running as root, with /dev having
drwxr-xr-x. 17 root root 3700 Nov 25 22:34 devso this is not the issue in my case.Here’s related Kubernetes issue: https://github.com/kubernetes/kubernetes/issues/60987
maybe you need to configure the firewall for 6789 on the host machine? cephcsi daemonset runs with host network
Thank you Madhu-1
Your right something wrong.
Could you recommand another check?
Some additional information, which may help: node3 ist the only node where this issue occurs. node1,2 are working appropriately. node3 is at a different physical location