longhorn: [BUG] Kubelet cannot finish terminating a pod that uses a PVC with volumeMode: Block when restarting the node

Describe the bug (🐛 if you encounter this issue)

This bug has upstream ticket is https://github.com/kubernetes/kubernetes/issues/119608

Open this downstream ticket to keep track of

Reproducing step,
Investigation effort
Possible workaround with the Longhorn specific implementation
QA testing effort

To Reproduce

Deploy Kubernetes v1.25.13+rke2r1
Deploy Longhorn v1.4.3

Deploy a block volume and a pod that uses the block volume

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: longhorn-block-vol
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  storageClassName: longhorn
  resources:
    requests:
      storage: 2Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: block-volume-test
  namespace: default
spec:
  containers:
    - name: block-volume-test
      image: nginx:stable-alpine
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - devicePath: /dev/longhorn/testblk
          name: block-vol
      ports:
        - containerPort: 80
  volumes:
    - name: block-vol
      persistentVolumeClaim:
        claimName: longhorn-block-vol

Find the worker node where the pod is running and ssh into that node
Find the pid of kubelet process by ps -aux | grep kubelet
Freeze the kubelet process by kill -STOP <kubelet-pid>. This step is needed so that kubelete cannot teardown volume mount in the next step
kubectl delete pod block-volume-test. The pod will stuck in terminating because kubelete is currently frozen
Reboot this worker node

When worker node is back, following the log of kubelet tail -f /var/lib/rancher/rke2/agent/logs/kubelet.log. Observe this error repeatedly:

W1019 01:11:18.316567     967 volume_path_handler_linux.go:62] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7"
W1019 01:11:18.316582     967 volume_path_handler.go:217] Warning: Unmap skipped because symlink does not exist on the path: /var/lib/kubelet/pods/19e41dfe-8cee-40c2-a39b-37c68b01c9a7/volumeDevices/kubernetes.io~csi/pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5
E1019 01:11:18.316662     967 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5 podName:19e41dfe-8cee-40c2-a39b-37c68b01c9a7 nodeName:}" failed. No retries permitted until 2023-10-19 01:13:20.316609446 +0000 UTC m=+1883.799156551 (durationBeforeRetry 2m2s). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5") pod "19e41dfe-8cee-40c2-a39b-37c68b01c9a7" (UID: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: 19e41dfe-8cee-40c2-a39b-37c68b01c9a7, bindMount: true: failed to unmap device from map path. mapPath is empty

Observe that pod block-volume-test can never finish terminating

Expected behavior

Pod block-volume-test is able to finish terminating

Environment

Longhorn version: v1.4.3
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.25.13+rke2r1

About this issue

Original URL
State: closed
Created 8 months ago
Reactions: 3
Comments: 29 (24 by maintainers)

Most upvoted comments

Upstream created (not yet merged) PRs to backport to >= 1.26 (https://github.com/kubernetes/kubernetes/pull/122398) ! Harvester can upgrade to Kubernetes >= v1.26 in the next release to get the fix

PhanLe1010 on Dec 19, 2023

Upstream is merged. Moving this one to ready-for-testing. Note that QA can test only when a fixed Kubernetes version is released.

Currently, there is only one backporting PR into Kubernetes v1.29, we are asking for backporting into v1.26, 1.27, and 1.28 https://github.com/kubernetes/kubernetes/pull/122211#issuecomment-1858372893

PhanLe1010 on Dec 15, 2023

Verified pass on longhorn maseter(longhorn-manager a42748) and k3s distro v1.29.0+k3s1

Following test steps, I can reproduce the pod stuck at terminating status sympton on v1.28.2+k3s1 , on v1.29.0+k3s1 with longhorn master, the pod terminated success after node reboot complete.

chriscchien on Dec 27, 2023

It seems the progress of the PR is slow from upstream.

Looks like upstream propose a new PR https://github.com/kubernetes/kubernetes/pull/122211 and this one is moving fast. Hope that it will be backported to v1.26 or at least v1.27

PhanLe1010 on Dec 13, 2023

Additional context

To get familliar with the AD flow in Kubernetes, these are the steps I took:

AD controller design https://github.com/kubernetes/kubernetes/issues/20262
Kubelet’s volumemanager design https://github.com/kubernetes/kubernetes/issues/21931
Robust VolumeManager reconstruction after kubelet restart https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3756-volume-reconstruction
Turn on the log level and read the Kubernetes code

How to build and deploy customized kubernetes component with RKE2:

Clone https://github.com/rancher/image-build-kubernetes
Modify https://github.com/rancher/image-build-kubernetes/blob/47efab06987f6d0a51424dcf5e56ba7dc86d5b23/Dockerfile#L27 to clone your fork of kubernetes
Build the image by TAG=v1.26.9-rke2r1-build20231017 ORG=phanle1010 make with v1.26.9-rke2r1 is the version that you are patching, build20231017 is a random string to keep track of the build time, phanle1010 is account in docker hub or private registry
Build the hardened-kubernetes image and push it to your docker hub account or another private registry
Clone https://github.com/rancher/rke2
Modify https://github.com/rancher/rke2/blob/master/Dockerfile#L116 to point at your hardened-kubernetes image
Build rke2 by running make
Push the resulting rke2-runtime image to your docker hub account or another private registry
Run the resulting rke2 binary

PhanLe1010 on Dec 7, 2023

It seems Kubernetes sig-storage team propose a refactoring of volumemanager volume reconstruction with this feature gate: NewVolumeManagerReconstruction which can solve this kind of issues naturally.

NewVolumeManagerReconstruction	default=false	Beta	 1.27	1.27
NewVolumeManagerReconstruction	default=true	Beta	 1.28

https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3756-volume-reconstruction

ChanYiLin on Nov 17, 2023

Does RKE2 ship any downstream customizations? If so, maybe we can patch RKE2 and have Harvester pick up the patch?

Looks like rke2 ships the downstream patch version. Notice the suffix r1 as in v1.25.9+rke2r1. Somtime the sufix is increased to r2. For example, https://github.com/rancher/rke2/releases/tag/v1.26.10%2Brke2r2

PhanLe1010 on Dec 8, 2023

Log Analysis:

Context:

Kubelet v1.26.10 with log level turned up to level 5
Pod block-volume-test with uuid e1edc8bb-b4b3-4c2f-83dc-096f78119be0 is pending in terminating state
Node restart

Log and code analysis:

Kubelet wakes up

I1205 22:38:51.835728    1085 server.go:1186] "Started kubelet"

The volume is added to the desired state of the world (DSOW) in the first run of the desiredStateOfWorldPopulator link and link because kubelet see that there are traces of pod on the node

I1205 22:39:02.237739    1085 desired_state_of_world_populator.go:494] "Found bound PV for PVC" PVC="default/longhorn-block-vol" PVCUID=0efd77fa-d7b8-4896-8b5e-6f44b95a0503 PVName="pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503"

I1205 22:39:03.837036    1085 desired_state_of_world_populator.go:505] "Extracted volumeSpec from bound PV and PVC" PVC="default/longhorn-block-vol" PVCUID=0efd77fa-d7b8-4896-8b5e-6f44b95a0503 PVName="pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503" volumeSpecName="pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503"

Next Kubelet scanned the pod directory and volume directory on hosts and found the volume pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503. It reconstructs this volume from disk’s data link

I1205 22:39:04.375325    1085 csi_plugin.go:716] plugin.ConstructBlockVolumeSpec [podUID=e1edc8bb-b4b3-4c2f-83dc-096f78119be0, specVolName=pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503, path=/var/lib/kubelet/pods/e1edc8bb-b4b3-4c2f-83dc-096f78119be0/volumeDevices/kubernetes.io~csi/pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503]
I1205 22:39:04.375335    1085 csi_util.go:78] kubernetes.io/csi: loading volume data file [/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503/data/vol_data.json]
I1205 22:39:04.377502    1085 csi_plugin.go:724] kubernetes.io/csi: plugin.ConstructBlockVolumeSpec extracted [map[string]string{"attachmentID":"csi-b98da4fde01f7048a0037a164da1bbd94aead353f073959b07e05e3ed9e9a683", "driverName":"driver.longhorn.io", "nodeName":"phan-block-vol", "specVolID":"pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503", "volumeHandle":"pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503"}]

Next Kubelet waits for the volume to be reported as attached to the node (link) which means that it is waiting for the volume to appear in the node.Status.VolumesAttached field (link)

I1205 22:39:04.482285    1085 reconciler_common.go:244] "Starting operationExecutor.VerifyControllerAttachedVolume for volume \"pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503\") pod \"block-volume-test\" (UID: \"e1edc8bb-b4b3-4c2f-83dc-096f78119be0\") " pod="default/block-volume-test"

The volume is marked as attached to the node, Kubelet mark the volume as uncertain because it skipped added the volume to actual state of the world (ASOW) during the reconstruction (Kubelet skipped the volume here) so it processed the volume and mark it as uncertain now:

I1205 22:39:17.493293    1085 reconciler.go:113] "Volume is marked as mounted in uncertain state and added to the actual state" pod="" podName=e1edc8bb-b4b3-4c2f-83dc-096f78119be0 volumeName=kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503

Since the volume is uncertain in AWOS and it appear in DSOW, Kubelet try to mount the block volume.

I1205 22:39:10.080664    1085 reconciler_common.go:217] "Starting operationExecutor.MountVolume for volume \"kube-api-access-tqwlj\" (UniqueName: \"kubernetes.io/projected/e880c824-0916-4293-a663-09fe4e4ed7a3-kube-api-access-tqwlj\") pod \"longhorn-ui-59f4f6dc7c-t7mfs\" (UID: \"e880c824-0916-4293-a663-09fe4e4ed7a3\") " pod="longhorn-system/longhorn-ui-59f4f6dc7c-t7mfs"

The first operation in the mounting process is SetUpDevice() This operation prepares the file/directory for stagingPath and call Longhorn CSI plugin NodeStageVolume(): link. However, at this moment, the Longhorn CSI plugin has not yet re-register with the newly restarted Kubelet yet. Therefore the SetUpDevice() failed.

E1205 22:39:17.508714    1085 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503 podName: nodeName:}" failed. No retries permitted until 2023-12-05 22:39:18.008686852 +0000 UTC m=+27.122034019 (durationBeforeRetry 500ms). Error: MapVolume.SetUpDevice failed for volume "pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503") pod "block-volume-test" (UID: "e1edc8bb-b4b3-4c2f-83dc-096f78119be0") : kubernetes.io/csi: blockMapper.SetUpDevice failed to get CSI client: driver name driver.longhorn.io not found in the list of registered CSI drivers

The problem is that Kubelet will clear up the deviceMountPath (which is also the globalMapPath) when blockMapper.SetUpDevice failed because the volume device is currently in DeviceMountUncertain as set in the step above: link

Next, Kubelet now get the update from API server that the pod is no longer needed as it is interminaing:

I1205 22:39:44.553021    1085 desired_state_of_world_populator.go:270] "Removing volume from desired state" pod="default/block-volume-test" podUID=e1edc8bb-b4b3-4c2f-83dc-096f78119be0 volumeName="pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503"

Kubelete starts the unmount flow for the volume: link

I1205 22:39:44.625585    1085 reconciler_common.go:162] "Starting operationExecutor.UnmountVolume for volume \"pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503\") pod \"e1edc8bb-b4b3-4c2f-83dc-096f78119be0\" (UID: \"e1edc8bb-b4b3-4c2f-83dc-096f78119be0\") "

Part of the unmount flow is to UnmapBlockVolume() link. This function try to unbind/un-link the symbol link at golableMapPath (which is the deviceMountPath that was cleared in the SetUpDevice() ) above. This function will always failed because golableMapPath is empty now link

E1205 22:39:44.626078    1085 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503 podName:e1edc8bb-b4b3-4c2f-83dc-096f78119be0 nodeName:}" failed. No retries permitted until 2023-12-05 22:39:45.12602954 +0000 UTC m=+54.239376707 (durationBeforeRetry 500ms). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503") pod "e1edc8bb-b4b3-4c2f-83dc-096f78119be0" (UID: "e1edc8bb-b4b3-4c2f-83dc-096f78119be0") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: e1edc8bb-b4b3-4c2f-83dc-096f78119be0, bindMount: true: failed to unmap device from map path. mapPath is empty

This error is repeated over and over

Proposal

This issue looks like a race bug in Kubelet to me. Between step 8 and 9, if Longhorn CSI plugin re-registered quick enough, Kubelet would successfully reruns SetUpDevice() and set a non-empty value for the deviceMountPath (which is also the globalMapPath of the volume. And subsequence UnmapBlockVolume() call would be successful.

To avoid this race condition, I agree with @ZhangShuaiyi That we should not mark the device as unmounted (thus cleanup the deviceMouthPath if the volume was previously reconstructed): PR link

PhanLe1010 on Dec 7, 2023

Changed the priority to 1, because this is related to an unresolved issue upstream.

innobead on Nov 29, 2023

Hi @PhanLe1010 @ZhangShuaiyi I tried the same reproduce steps on Kubernetes v1.26.0 It did show the error messges several times, but the pod got deleted successfully in the end Can you help to verify that, thanks!

ChanYiLin on Nov 7, 2023

@ZhangShuaiyi Do you mean this PR kubernetes/kubernetes#121443 ?

That is actually very promising. Currently busy on a different task but I will do some more reading in the next few days to understand the whole flow

@PhanLe1010 Yes. Looking forward to your feedback.

ZhangShuaiyi on Oct 31, 2023

In a Harvester recreate I did NOT hit @PhanLe1010’s log of:

E1019 00:42:39.834538     967 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5 podName: nodeName:}" failed. No retries permitted until 2023-10-19 00:43:11.834505392 +0000 UTC m=+75.317052505 (durationBeforeRetry 32s). Error: MapVolume.SetUpDevice failed for volume "pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5") pod "block-volume-test" (UID: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7") : rpc error: code = InvalidArgument desc = volume pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5 hasn't been attached yet

I DID multiple similar logs to @ZhangShuaiyi:

E1024 19:47:21.910461    2517 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d podName: nodeName:}" failed. No retries permitted until 2023-10-24 19:47:25.910437894 +0000 UTC m=+32.546967674 (durationBeforeRetry 4s). Error: MapVolume.SetUpDevice failed for volume "pvc-476a4873-d166-4e6e-8bac-ccf01083e14d" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d") pod "virt-launcher-worker1-xd2p4" (UID: "54666d5c-508e-4d2e-b9e1-cb9103982ce6") : kubernetes.io/csi: blockMapper.SetUpDevice failed to get CSI client: driver name driver.longhorn.io not found in the list of registered CSI drivers

This was followed by many of the expected:

I1024 19:47:25.540050    2517 reconciler.go:211] "operationExecutor.UnmountVolume started for volume \"pvc-476a4873-d166-4e6e-8bac-ccf01083e14d\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d\") pod \"54666d5c-508e-4d2e-b9e1-cb9103982ce6\" (UID: \"54666d5c-508e-4d2e-b9e1-cb9103982ce6\") "
E1024 19:47:25.540244    2517 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d podName:54666d5c-508e-4d2e-b9e1-cb9103982ce6 nodeName:}" failed. No retries permitted until 2023-10-24 19:47:26.040211653 +0000 UTC m=+32.676741431 (durationBeforeRetry 500ms). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-476a4873-d166-4e6e-8bac-ccf01083e14d" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d") pod "54666d5c-508e-4d2e-b9e1-cb9103982ce6" (UID: "54666d5c-508e-4d2e-b9e1-cb9103982ce6") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: 54666d5c-508e-4d2e-b9e1-cb9103982ce6, bindMount: true: failed to unmap device from map path. mapPath is empty

ejweber on Oct 25, 2023

While investigating a similar scenario in https://github.com/longhorn/longhorn/issues/6961, I wondered why Longhorn never recovered, even though the pod consuming it on the rebooted node was (presumably --force) terminated. An orphaned volume attachment remained, and Longhorn was never instructed to detach the volume from the old node.

The attach/detach controller has a 6 minute timeout. If it thinks a volume should be detached but kubelet on a node is still reporting it as attached and/or in use (kubectl get node -oyaml), it will wait 6 minutes and then just force detach it (call ControllerUnpublishVolume). HOWEVER, this timeout is only effective as long as the node in question is unhealthy.

So, in the reboot scenario, if kubelet comes back quickly enough and hits this GitHub issue, it is never force detaches and the volume is stuck (it can never be attached to a different node, and, if it is a migratable volume, it gets into a weird migration situation).

I think if the node never comes back, the volume will be force detached and the associated volume attachment will go away. This at least allows the Longhorn volume to be used by another node.

Relevant code:

https://github.com/kubernetes/kubernetes/blob/a1a87a0a2bcd605820920c6b0e618a8ab7d117d4/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L214C1-L235C5

Confirming logs (with kube-controller-manager logs turned up to 5):

I1025 15:52:45.662316       1 reconciler.go:233] "Cannot detach volume because it is still mounted" volume={AttachedVolume:{VolumeName:kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d VolumeSpec:0xc00602f998 NodeName:harvester-pnws6 PluginIsAttachable:true DevicePath: DeviceMountPath: PluginName: SELinuxMountContext:} MountedByNode:true DetachRequestedTime:2023-10-25 15:52:29.477725428 +0000 UTC m=+32.572903966}

Deleting the volume attachment manually unsticks the volume and allows it to be used elsewhere. However, the kubelet on the old node continues to report it as attached and in use and continues to log these errors.

ejweber on Oct 25, 2023

@PhanLe1010 Yes, I am experiencing the same problem. I think it’s no need to call actualStateOfWorld.MarkDeviceAsUnmounted in markDeviceErrorState if volume was previously reconstructed. I opened a PR to do this.

ZhangShuaiyi on Oct 25, 2023

@PhanLe1010
Does kubelet have the error Error: MapVolume.SetUpDevice failed for volume ... : kubernetes.io/csi: blockMapper.SetUpDevice failed to get CSI client: driver name ... not found in the list of registered CSI drivers? In my environment, csi.(*csiBlockMapper).SetUpDevice can be succeed sometimes.

ZhangShuaiyi on Oct 23, 2023