longhorn: [BUG] Kubelet cannot finish terminating a pod that uses a PVC with volumeMode: Block when restarting the node
Describe the bug (🐛 if you encounter this issue)
This bug has upstream ticket is https://github.com/kubernetes/kubernetes/issues/119608
Open this downstream ticket to keep track of
- Reproducing step,
- Investigation effort
- Possible workaround with the Longhorn specific implementation
- QA testing effort
To Reproduce
- Deploy Kubernetes v1.25.13+rke2r1
- Deploy Longhorn v1.4.3
- Deploy a block volume and a pod that uses the block volume
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: longhorn-block-vol spec: accessModes: - ReadWriteOnce volumeMode: Block storageClassName: longhorn resources: requests: storage: 2Gi --- apiVersion: v1 kind: Pod metadata: name: block-volume-test namespace: default spec: containers: - name: block-volume-test image: nginx:stable-alpine imagePullPolicy: IfNotPresent volumeDevices: - devicePath: /dev/longhorn/testblk name: block-vol ports: - containerPort: 80 volumes: - name: block-vol persistentVolumeClaim: claimName: longhorn-block-vol - Find the worker node where the pod is running and ssh into that node
- Find the pid of kubelet process by
ps -aux | grep kubelet - Freeze the kubelet process by
kill -STOP <kubelet-pid>. This step is needed so that kubelete cannot teardown volume mount in the next step kubectl delete pod block-volume-test. The pod will stuck in terminating because kubelete is currently frozen- Reboot this worker node
- When worker node is back, following the log of kubelet
tail -f /var/lib/rancher/rke2/agent/logs/kubelet.log. Observe this error repeatedly:W1019 01:11:18.316567 967 volume_path_handler_linux.go:62] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7" W1019 01:11:18.316582 967 volume_path_handler.go:217] Warning: Unmap skipped because symlink does not exist on the path: /var/lib/kubelet/pods/19e41dfe-8cee-40c2-a39b-37c68b01c9a7/volumeDevices/kubernetes.io~csi/pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5 E1019 01:11:18.316662 967 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5 podName:19e41dfe-8cee-40c2-a39b-37c68b01c9a7 nodeName:}" failed. No retries permitted until 2023-10-19 01:13:20.316609446 +0000 UTC m=+1883.799156551 (durationBeforeRetry 2m2s). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5") pod "19e41dfe-8cee-40c2-a39b-37c68b01c9a7" (UID: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: 19e41dfe-8cee-40c2-a39b-37c68b01c9a7, bindMount: true: failed to unmap device from map path. mapPath is empty - Observe that pod
block-volume-testcan never finish terminating
Expected behavior
Pod block-volume-test is able to finish terminating
Environment
- Longhorn version: v1.4.3
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.25.13+rke2r1
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 3
- Comments: 29 (24 by maintainers)
Upstream created (not yet merged) PRs to backport to >= 1.26 (https://github.com/kubernetes/kubernetes/pull/122398) ! Harvester can upgrade to Kubernetes >= v1.26 in the next release to get the fix
Upstream is merged. Moving this one to ready-for-testing. Note that QA can test only when a fixed Kubernetes version is released.
Currently, there is only one backporting PR into Kubernetes v1.29, we are asking for backporting into v1.26, 1.27, and 1.28 https://github.com/kubernetes/kubernetes/pull/122211#issuecomment-1858372893
Verified pass on longhorn maseter(longhorn-manager a42748) and k3s distro v1.29.0+k3s1
Following test steps, I can reproduce the pod stuck at terminating status sympton on
v1.28.2+k3s1, onv1.29.0+k3s1with longhorn master, the pod terminated success after node reboot complete.Looks like upstream propose a new PR https://github.com/kubernetes/kubernetes/pull/122211 and this one is moving fast. Hope that it will be backported to v1.26 or at least v1.27
Additional context
To get familliar with the AD flow in Kubernetes, these are the steps I took:
How to build and deploy customized kubernetes component with RKE2:
TAG=v1.26.9-rke2r1-build20231017 ORG=phanle1010 makewithv1.26.9-rke2r1is the version that you are patching,build20231017is a random string to keep track of the build time,phanle1010is account in docker hub or private registrymakeIt seems Kubernetes sig-storage team propose a refactoring of volumemanager volume reconstruction with this feature gate:
NewVolumeManagerReconstructionwhich can solve this kind of issues naturally.https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3756-volume-reconstruction
Looks like rke2 ships the downstream patch version. Notice the suffix r1 as in v1.25.9+rke2r1. Somtime the sufix is increased to r2. For example, https://github.com/rancher/rke2/releases/tag/v1.26.10%2Brke2r2
Log Analysis:
Context:
block-volume-testwith uuide1edc8bb-b4b3-4c2f-83dc-096f78119be0is pending in terminating stateLog and code analysis:
pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503. It reconstructs this volume from disk’s data linknode.Status.VolumesAttachedfield (link)SetUpDevice()This operation prepares the file/directory for stagingPath and call Longhorn CSI plugin NodeStageVolume(): link. However, at this moment, the Longhorn CSI plugin has not yet re-register with the newly restarted Kubelet yet. Therefore theSetUpDevice()failed.deviceMountPath(which is also theglobalMapPath) when blockMapper.SetUpDevice failed because the volume device is currently inDeviceMountUncertainas set in the step above: linkgolableMapPath(which is thedeviceMountPaththat was cleared in theSetUpDevice()) above. This function will always failed becausegolableMapPathis empty now linkProposal
This issue looks like a race bug in Kubelet to me. Between step 8 and 9, if Longhorn CSI plugin re-registered quick enough, Kubelet would successfully reruns
SetUpDevice()and set a non-empty value for thedeviceMountPath(which is also theglobalMapPathof the volume. And subsequence UnmapBlockVolume() call would be successful.To avoid this race condition, I agree with @ZhangShuaiyi That we should not mark the device as unmounted (thus cleanup the deviceMouthPath if the volume was previously reconstructed): PR link
Changed the priority to 1, because this is related to an unresolved issue upstream.
Hi @PhanLe1010 @ZhangShuaiyi I tried the same reproduce steps on Kubernetes v1.26.0 It did show the error messges several times, but the pod got deleted successfully in the end Can you help to verify that, thanks!
@PhanLe1010 Yes. Looking forward to your feedback.
In a Harvester recreate I did NOT hit @PhanLe1010’s log of:
I DID multiple similar logs to @ZhangShuaiyi:
This was followed by many of the expected:
While investigating a similar scenario in https://github.com/longhorn/longhorn/issues/6961, I wondered why Longhorn never recovered, even though the pod consuming it on the rebooted node was (presumably --force) terminated. An orphaned volume attachment remained, and Longhorn was never instructed to detach the volume from the old node.
The attach/detach controller has a 6 minute timeout. If it thinks a volume should be detached but kubelet on a node is still reporting it as attached and/or in use (
kubectl get node -oyaml), it will wait 6 minutes and then just force detach it (call ControllerUnpublishVolume). HOWEVER, this timeout is only effective as long as the node in question is unhealthy.So, in the reboot scenario, if kubelet comes back quickly enough and hits this GitHub issue, it is never force detaches and the volume is stuck (it can never be attached to a different node, and, if it is a migratable volume, it gets into a weird migration situation).
I think if the node never comes back, the volume will be force detached and the associated volume attachment will go away. This at least allows the Longhorn volume to be used by another node.
Relevant code:
https://github.com/kubernetes/kubernetes/blob/a1a87a0a2bcd605820920c6b0e618a8ab7d117d4/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L214C1-L235C5
Confirming logs (with kube-controller-manager logs turned up to 5):
Deleting the volume attachment manually unsticks the volume and allows it to be used elsewhere. However, the kubelet on the old node continues to report it as attached and in use and continues to log these errors.
@PhanLe1010 Yes, I am experiencing the same problem. I think it’s no need to call
actualStateOfWorld.MarkDeviceAsUnmountedinmarkDeviceErrorStateif volume was previously reconstructed. I opened a PR to do this.@PhanLe1010
Does kubelet have the error
Error: MapVolume.SetUpDevice failed for volume ... : kubernetes.io/csi: blockMapper.SetUpDevice failed to get CSI client: driver name ... not found in the list of registered CSI drivers? In my environment,csi.(*csiBlockMapper).SetUpDevicecan be succeed sometimes.