vsphere-csi-driver: PV can not attach to new node if the previous node is deleted
/kind bug
What happened: We deployed k8s env in TKGI(PKS) and deployed CSI2.0 in the cluster. Deployed a statefulset. When the worker node in which the pod/pv is registered is in the error status, bosh resurrector mechanism created a new worker node to replace the error one. After a while, the statefulset pods/pv were rescheduled to the other nodes, but it got hang. The error said:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 16m (x11 over 114m) kubelet, 23752abb-c3d8-4a4f-9058-404638deeab3 Unable to attach or mount volumes: unmounted volumes=[postgredb], unattached volumes=[default-token-dthk9 postgredb]: timed out waiting for the condition
Warning FailedMount <invalid> (x41 over 111m) kubelet, 23752abb-c3d8-4a4f-9058-404638deeab3 Unable to attach or mount volumes: unmounted volumes=[postgredb], unattached volumes=[postgredb default-token-dthk9]: timed out waiting for the condition
check the logs in csi-attacher.log, it printed as below. node 5844930c-f7d5-4b8d-a618-0337da1b14c7 was the worker node which was replaced by bosh resurrector mechanism.
I0915 01:22:56.209752 1 csi_handler.go:428] Saving detach error to "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368257 1 csi_handler.go:439] Saved detach error to "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368303 1 csi_handler.go:99] Error processing "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c": failed to detach: rpc error: code = Internal desc = failed to find VirtualMachine for node:"5844930c-f7d5-4b8d-a618-0337da1b14c7". Error: node wasn't found
I0915 01:22:56.368356 1 controller.go:175] Started VA processing "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368391 1 csi_handler.go:89] CSIHandler: processing VA "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368397 1 csi_handler.go:140] Starting detach operation for "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368416 1 csi_handler.go:147] Detaching "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.368428 1 csi_handler.go:550] Can't get CSINode 5844930c-f7d5-4b8d-a618-0337da1b14c7: csinode.storage.k8s.io "5844930c-f7d5-4b8d-a618-0337da1b14c7" not found
I0915 01:22:56.508694 1 csi_handler.go:428] Saving detach error to "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.512934 1 csi_handler.go:439] Saved detach error to "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:56.512969 1 csi_handler.go:99] Error processing "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c": failed to detach: rpc error: code = Internal desc = failed to find VirtualMachine for node:"5844930c-f7d5-4b8d-a618-0337da1b14c7". Error: node wasn't found
I0915 01:22:57.368578 1 controller.go:175] Started VA processing "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:57.368615 1 csi_handler.go:89] CSIHandler: processing VA "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
I0915 01:22:57.368622 1 csi_handler.go:140] Starting detach operation for "csi-909e60899167c6190538987f558be23e3f3998a5a779542af4e9798b1fbc6a4c"
What you expected to happen: the pods/pv can be succeed to attach to the new node.
How to reproduce it (as minimally and precisely as possible):
- create k8s cluster in PKS env
- install CSI2.0
- create statefulset workload with 1 PV
- shutdown the node which PV/pod are register.
I also executed the same steps in VCP, pods can be running in another worker node.
Anything else we need to know?: before the testing, the pod-node mapping as below. csi-controller pods are not in the same worker node as the statefulset workload.
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-75dd94fdf-bx8hd 1/1 Running 0 14h 10.10.30.162 5844930c-f7d5-4b8d-a618-0337da1b14c7 <none> <none>
coredns-75dd94fdf-jndhd 1/1 Running 0 14h 10.10.30.164 239ff390-3824-478f-a644-2e71d55c1b93 <none> <none>
coredns-75dd94fdf-jrj8c 1/1 Running 0 14h 10.10.30.163 af1372f5-3763-4e69-8eb0-d146355d84d1 <none> <none>
metrics-server-6967cb5487-s67ww 1/1 Running 0 14h 10.10.30.165 239ff390-3824-478f-a644-2e71d55c1b93 <none> <none>
vsphere-csi-controller-6dbc64cb-hqkhr 6/6 Running 0 14h 10.10.30.166 226b7b06-0319-4adb-9230-080e05f4e4b4 <none> <none>
vsphere-csi-node-6d57j 3/3 Running 0 14h 10.10.30.170 af1372f5-3763-4e69-8eb0-d146355d84d1 <none> <none>
vsphere-csi-node-875qh 3/3 Running 0 14h 10.10.30.171 30db53a7-5e3c-433a-a763-9d88dfb9057c <none> <none>
vsphere-csi-node-r8l7c 3/3 Running 0 14h 10.10.30.168 5844930c-f7d5-4b8d-a618-0337da1b14c7 <none> <none>
vsphere-csi-node-shcr6 3/3 Running 0 14h 10.10.30.169 239ff390-3824-478f-a644-2e71d55c1b93 <none> <none>
vsphere-csi-node-xd72x 3/3 Running 0 14h 10.10.30.167 226b7b06-0319-4adb-9230-080e05f4e4b4 <none> <none>
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
postgres-wl-0 2/2 Running 0 19m 10.10.31.34 5844930c-f7d5-4b8d-a618-0337da1b14c7 <none> <none>
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
Environment:
- csi-vsphere version: gcr.io/cloud-provider-vsphere/csi/release/syncer:v2.0.0 gcr.io/cloud-provider-vsphere/csi/release/driver:v2.0.0
- vsphere-cloud-controller-manager version:
- Kubernetes version:1.18.5
- vSphere version:70
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others:
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 34 (13 by maintainers)
This one-liner removes the
finalizersfromvolumeattachmentswith adetachError:This issue is fixed with the PR - https://github.com/kubernetes-sigs/vsphere-csi-driver/pull/1879
We ran into this issue when performing resiliency testing on our Kubernetes cluster nodes. Steps to reproduce:
Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[config data]: timed out waiting for the condition .And the csi-controller had these errors
We were able to recover from this state by deleting the corresponding VolumeAttachment resource.
This is great news, thank you! Is there a release version you can share which is targeting #1879 for inclusion?
After you drain the node, all the pods would be deleted but VolumeAttachments still remain. It is because pods would get deleted from API server after the unmount is successful. So wait until VolumeAttachments are deleted and later delete the node. If the node is deleted before volume attachments are deleted, then the volume still remain attached to the VM in backend. There is no way for the news pods coming up in the new node can attach this volume as the pod is still not detached from old node.
So the sequence of steps would be:
This way, the pods on the new node would come up properly without any issue.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/staleis appliedlifecycle/stalewas applied,lifecycle/rottenis appliedlifecycle/rottenwas applied, the issue is closedYou can:
/remove-lifecycle stale/lifecycle rotten/closePlease send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
I just ran into this same issue. The node deleted ungracefully (VM deleted and node removed from kubernetes). All pods running on that node were automatically rescheduled over to new nodes once it was deleted from kubernetes. The CSI driver showing errors “Error: node wasn’t found” which makes since because the node was deleted completely.
We are on 1.19.9 and that above “fix” #96617 was merged in at 1.19.7 which proves it doesn’t resolve this issue.
We would like the volume to be able to be detached (remove the VolumeAttachment object) if the VM does not exist anymore so that way the disk can be moved to another VM. We cannot expect all of our nodes to be gracefully removed (drained and wait for volumeattachment to be removed).
Here is the kube controller logs when it tries to do this operation
@SandeepPissay Thanks.
We’re seeing the same behavior with pods staying in “ContainerCreating” state even when the node was drained before shutting down. Volume attachment deletion started a few mins after the new node got launched, but it never finished, And VolumeAttachment is stuck there ever since.
Tried restarting CSI controller. The controller detects that the node is gone, but is never deleting the VolumeAttachment.
What logs will help you troubleshoot this and how can I upload them? I saw support section, but not sure which product category to choose to file ticket.
@tgelter can you file a ticket with VMware GSS? We would need logs for all containers in CSI Pod and VC support bundle as well. Also please describe what steps were performed to repro the issue. Thanks.