rook: Multi-Attach error
Is this a bug report or feature request?
- Bug Report shutting down a node which runs a pod with a PV mount, requires PV re-mount by the new pod, however re-mount operation takes very long time (20 mins is the least observed, other times it’s still stuck after several hours)
Expected behavior: PV is re-mounted successfully on new pod, right after node failure is detected.
specifically, this procedure works fine on rook version 1.0.5 and ceph version 14.2.1 (FLEX driver), though other issues were experienced in such setups. on a cluster upgraded to 1.1.7/14.2.4 (CSI driver), after shutting down the node, we’re experiencing new pod that takes a lot of time to complete, waiting to mount an already mounted volume
How to reproduce it (minimal and precise):
reproducing this exact cluster is not so simple as many components are involved in the cluster creation. however, basically we’re creating a 3 nodes cluster on Azure (non AKS). complete deployment also installs Istio, Postgres, EFK and more.
File(s) to submit: events description of the pod having problem to start:
Warning FailedAttachVolume 50m attachdetach-controller Multi-Attach error for volume "pvc-6508c028-36aa-11ea-8679-000d3aad2fb7" Volume is already exclusively attached to one node and can't be attached to another
Normal Scheduled 50m default-scheduler Successfully assigned adc-controller-application/prometheus-0 to zmha122vm1
Normal SuccessfulAttachVolume 45m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-6508c028-36aa-11ea-8679-000d3aad2fb7"
Warning FailedMount 43m kubelet, zmha122vm1 MountVolume.MountDevice failed for volume "pvc-6508c028-36aa-11ea-8679-000d3aad2fb7" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedMount 42m (x7 over 43m) kubelet, zmha122vm1 MountVolume.MountDevice failed for volume "pvc-6508c028-36aa-11ea-8679-000d3aad2fb7" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-c88f4d72-36ab-11ea-8cca-0a580ae9400c already exists
Warning FailedMount 32m (x8 over 48m) kubelet, zmha122vm1 Unable to mount volumes for pod "prometheus-0_adc-controller-application(6437da13-3853-11ea-8c9c-000d3aad2fb7)": timeout expired waiting for volumes to attach or mount for pod "adc-controller-application"/"prometheus-0". list of unmounted volumes=[prometheus-data-volume]. list of unattached volumes=[prometheus-data-volume config scraping-files configmap-global-settings default-token-nxjnh istio-envoy sds-uds-path istio-token]
Normal Pulled 31m kubelet, zmha122vm1 Container image "alpine:3.9" already present on machine
Normal Created 31m kubelet, zmha122vm1 Created container
Normal Created 31m kubelet, zmha122vm1 Created container
Normal Started 31m kubelet, zmha122vm1 Started container
Normal Pulled 31m kubelet, zmha122vm1 Container image "docker.io/istio/proxyv2:1.4.2" already present on machine
Normal Started 31m kubelet, zmha122vm1 Started container
Normal Pulled 31m kubelet, zmha122vm1 Container image "prom/prometheus:v2.10.0" already present on machine
Normal Created 31m kubelet, zmha122vm1 Created container
Normal Started 31m kubelet, zmha122vm1 Started container
Normal Pulled 31m kubelet, zmha122vm1 Container image "reg.radware.com:18443/adcc/adcc_stan_prometheus/dev:100" already present on machine
Normal Created 31m kubelet, zmha122vm1 Created container
Normal Started 31m kubelet, zmha122vm1 Started container
Environment:
- OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
- Kernel (e.g.
uname -a):Linux rook-ceph-operator-778bd6f4c9-5khqs 5.0.0-1028-azure #30~18.04.1-Ubuntu SMP Fri Dec 6 11:47:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux - Cloud provider or hardware configuration: Azure, non-managed
- Rook version (use
rook versioninside of a Rook Pod): v1.1.7 - Storage backend version (e.g. for ceph do
ceph -v): 14.2.4 - Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (6 by maintainers)
Some ideas are getting discussed here https://github.com/rook/rook/issues/1507#issuecomment-1122965274