rook: Multi-Attach error

Is this a bug report or feature request?

Bug Report shutting down a node which runs a pod with a PV mount, requires PV re-mount by the new pod, however re-mount operation takes very long time (20 mins is the least observed, other times it’s still stuck after several hours)

Expected behavior: PV is re-mounted successfully on new pod, right after node failure is detected.

specifically, this procedure works fine on rook version 1.0.5 and ceph version 14.2.1 (FLEX driver), though other issues were experienced in such setups. on a cluster upgraded to 1.1.7/14.2.4 (CSI driver), after shutting down the node, we’re experiencing new pod that takes a lot of time to complete, waiting to mount an already mounted volume

How to reproduce it (minimal and precise):

reproducing this exact cluster is not so simple as many components are involved in the cluster creation. however, basically we’re creating a 3 nodes cluster on Azure (non AKS). complete deployment also installs Istio, Postgres, EFK and more.

File(s) to submit: events description of the pod having problem to start:

  Warning  FailedAttachVolume      50m                attachdetach-controller  Multi-Attach error for volume "pvc-6508c028-36aa-11ea-8679-000d3aad2fb7" Volume is already exclusively attached to one node and can't be attached to another
  Normal   Scheduled               50m                default-scheduler        Successfully assigned adc-controller-application/prometheus-0 to zmha122vm1
  Normal   SuccessfulAttachVolume  45m                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-6508c028-36aa-11ea-8679-000d3aad2fb7"
  Warning  FailedMount             43m                kubelet, zmha122vm1      MountVolume.MountDevice failed for volume "pvc-6508c028-36aa-11ea-8679-000d3aad2fb7" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             42m (x7 over 43m)  kubelet, zmha122vm1      MountVolume.MountDevice failed for volume "pvc-6508c028-36aa-11ea-8679-000d3aad2fb7" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-c88f4d72-36ab-11ea-8cca-0a580ae9400c already exists
  Warning  FailedMount             32m (x8 over 48m)  kubelet, zmha122vm1      Unable to mount volumes for pod "prometheus-0_adc-controller-application(6437da13-3853-11ea-8c9c-000d3aad2fb7)": timeout expired waiting for volumes to attach or mount for pod "adc-controller-application"/"prometheus-0". list of unmounted volumes=[prometheus-data-volume]. list of unattached volumes=[prometheus-data-volume config scraping-files configmap-global-settings default-token-nxjnh istio-envoy sds-uds-path istio-token]
  Normal   Pulled                  31m                kubelet, zmha122vm1      Container image "alpine:3.9" already present on machine
  Normal   Created                 31m                kubelet, zmha122vm1      Created container
  Normal   Created                 31m                kubelet, zmha122vm1      Created container
  Normal   Started                 31m                kubelet, zmha122vm1      Started container
  Normal   Pulled                  31m                kubelet, zmha122vm1      Container image "docker.io/istio/proxyv2:1.4.2" already present on machine
  Normal   Started                 31m                kubelet, zmha122vm1      Started container
  Normal   Pulled                  31m                kubelet, zmha122vm1      Container image "prom/prometheus:v2.10.0" already present on machine
  Normal   Created                 31m                kubelet, zmha122vm1      Created container
  Normal   Started                 31m                kubelet, zmha122vm1      Started container
  Normal   Pulled                  31m                kubelet, zmha122vm1      Container image "reg.radware.com:18443/adcc/adcc_stan_prometheus/dev:100" already present on machine
  Normal   Created                 31m                kubelet, zmha122vm1      Created container
  Normal   Started                 31m                kubelet, zmha122vm1      Started container

Environment:

OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel (e.g. uname -a): Linux rook-ceph-operator-778bd6f4c9-5khqs 5.0.0-1028-azure #30~18.04.1-Ubuntu SMP Fri Dec 6 11:47:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: Azure, non-managed
Rook version (use rook version inside of a Rook Pod): v1.1.7
Storage backend version (e.g. for ceph do ceph -v): 14.2.4
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (6 by maintainers)

Most upvoted comments

Some ideas are getting discussed here https://github.com/rook/rook/issues/1507#issuecomment-1122965274

Madhu-1 on Jan 24, 2023