longhorn: [BUG] Kubelet cannot finish terminating a pod that uses a PVC with volumeMode: Block when restarting the node

Describe the bug (🐛 if you encounter this issue)

This bug has upstream ticket is https://github.com/kubernetes/kubernetes/issues/119608

Open this downstream ticket to keep track of

  • Reproducing step,
  • Investigation effort
  • Possible workaround with the Longhorn specific implementation
  • QA testing effort

To Reproduce

  1. Deploy Kubernetes v1.25.13+rke2r1
  2. Deploy Longhorn v1.4.3
  3. Deploy a block volume and a pod that uses the block volume
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: longhorn-block-vol
    spec:
      accessModes:
        - ReadWriteOnce
      volumeMode: Block
      storageClassName: longhorn
      resources:
        requests:
          storage: 2Gi
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: block-volume-test
      namespace: default
    spec:
      containers:
        - name: block-volume-test
          image: nginx:stable-alpine
          imagePullPolicy: IfNotPresent
          volumeDevices:
            - devicePath: /dev/longhorn/testblk
              name: block-vol
          ports:
            - containerPort: 80
      volumes:
        - name: block-vol
          persistentVolumeClaim:
            claimName: longhorn-block-vol
    
  4. Find the worker node where the pod is running and ssh into that node
  5. Find the pid of kubelet process by ps -aux | grep kubelet
  6. Freeze the kubelet process by kill -STOP <kubelet-pid>. This step is needed so that kubelete cannot teardown volume mount in the next step
  7. kubectl delete pod block-volume-test. The pod will stuck in terminating because kubelete is currently frozen
  8. Reboot this worker node
  9. When worker node is back, following the log of kubelet tail -f /var/lib/rancher/rke2/agent/logs/kubelet.log. Observe this error repeatedly:
    W1019 01:11:18.316567     967 volume_path_handler_linux.go:62] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device path: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7"
    W1019 01:11:18.316582     967 volume_path_handler.go:217] Warning: Unmap skipped because symlink does not exist on the path: /var/lib/kubelet/pods/19e41dfe-8cee-40c2-a39b-37c68b01c9a7/volumeDevices/kubernetes.io~csi/pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5
    E1019 01:11:18.316662     967 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5 podName:19e41dfe-8cee-40c2-a39b-37c68b01c9a7 nodeName:}" failed. No retries permitted until 2023-10-19 01:13:20.316609446 +0000 UTC m=+1883.799156551 (durationBeforeRetry 2m2s). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5") pod "19e41dfe-8cee-40c2-a39b-37c68b01c9a7" (UID: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: 19e41dfe-8cee-40c2-a39b-37c68b01c9a7, bindMount: true: failed to unmap device from map path. mapPath is empty
    
  10. Observe that pod block-volume-test can never finish terminating

Expected behavior

Pod block-volume-test is able to finish terminating

Environment

  • Longhorn version: v1.4.3
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.25.13+rke2r1

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Reactions: 3
  • Comments: 29 (24 by maintainers)

Most upvoted comments

Upstream created (not yet merged) PRs to backport to >= 1.26 (https://github.com/kubernetes/kubernetes/pull/122398) ! Harvester can upgrade to Kubernetes >= v1.26 in the next release to get the fix

Upstream is merged. Moving this one to ready-for-testing. Note that QA can test only when a fixed Kubernetes version is released.

Currently, there is only one backporting PR into Kubernetes v1.29, we are asking for backporting into v1.26, 1.27, and 1.28 https://github.com/kubernetes/kubernetes/pull/122211#issuecomment-1858372893

Verified pass on longhorn maseter(longhorn-manager a42748) and k3s distro v1.29.0+k3s1

Following test steps, I can reproduce the pod stuck at terminating status sympton on v1.28.2+k3s1 , on v1.29.0+k3s1 with longhorn master, the pod terminated success after node reboot complete.

It seems the progress of the PR is slow from upstream.

Looks like upstream propose a new PR https://github.com/kubernetes/kubernetes/pull/122211 and this one is moving fast. Hope that it will be backported to v1.26 or at least v1.27

Additional context

To get familliar with the AD flow in Kubernetes, these are the steps I took:

How to build and deploy customized kubernetes component with RKE2:

  1. Clone https://github.com/rancher/image-build-kubernetes
  2. Modify https://github.com/rancher/image-build-kubernetes/blob/47efab06987f6d0a51424dcf5e56ba7dc86d5b23/Dockerfile#L27 to clone your fork of kubernetes
  3. Build the image by TAG=v1.26.9-rke2r1-build20231017 ORG=phanle1010 make with v1.26.9-rke2r1 is the version that you are patching, build20231017 is a random string to keep track of the build time, phanle1010 is account in docker hub or private registry
  4. Build the hardened-kubernetes image and push it to your docker hub account or another private registry
  5. Clone https://github.com/rancher/rke2
  6. Modify https://github.com/rancher/rke2/blob/master/Dockerfile#L116 to point at your hardened-kubernetes image
  7. Build rke2 by running make
  8. Push the resulting rke2-runtime image to your docker hub account or another private registry
  9. Run the resulting rke2 binary

It seems Kubernetes sig-storage team propose a refactoring of volumemanager volume reconstruction with this feature gate: NewVolumeManagerReconstruction which can solve this kind of issues naturally.

NewVolumeManagerReconstruction	default=false	Beta	 1.27	1.27
NewVolumeManagerReconstruction	default=true	Beta	 1.28	

https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3756-volume-reconstruction

Does RKE2 ship any downstream customizations? If so, maybe we can patch RKE2 and have Harvester pick up the patch?

Looks like rke2 ships the downstream patch version. Notice the suffix r1 as in v1.25.9+rke2r1. Somtime the sufix is increased to r2. For example, https://github.com/rancher/rke2/releases/tag/v1.26.10%2Brke2r2

Log Analysis:

Context:

  • Kubelet v1.26.10 with log level turned up to level 5
  • Pod block-volume-test with uuid e1edc8bb-b4b3-4c2f-83dc-096f78119be0 is pending in terminating state
  • Node restart

Log and code analysis:

  1. Kubelet wakes up
    I1205 22:38:51.835728    1085 server.go:1186] "Started kubelet"
    
  2. The volume is added to the desired state of the world (DSOW) in the first run of the desiredStateOfWorldPopulator link and link because kubelet see that there are traces of pod on the node
    I1205 22:39:02.237739    1085 desired_state_of_world_populator.go:494] "Found bound PV for PVC" PVC="default/longhorn-block-vol" PVCUID=0efd77fa-d7b8-4896-8b5e-6f44b95a0503 PVName="pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503"
    
    I1205 22:39:03.837036    1085 desired_state_of_world_populator.go:505] "Extracted volumeSpec from bound PV and PVC" PVC="default/longhorn-block-vol" PVCUID=0efd77fa-d7b8-4896-8b5e-6f44b95a0503 PVName="pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503" volumeSpecName="pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503"
    
  3. Next Kubelet scanned the pod directory and volume directory on hosts and found the volume pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503. It reconstructs this volume from disk’s data link
    I1205 22:39:04.375325    1085 csi_plugin.go:716] plugin.ConstructBlockVolumeSpec [podUID=e1edc8bb-b4b3-4c2f-83dc-096f78119be0, specVolName=pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503, path=/var/lib/kubelet/pods/e1edc8bb-b4b3-4c2f-83dc-096f78119be0/volumeDevices/kubernetes.io~csi/pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503]
    I1205 22:39:04.375335    1085 csi_util.go:78] kubernetes.io/csi: loading volume data file [/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503/data/vol_data.json]
    I1205 22:39:04.377502    1085 csi_plugin.go:724] kubernetes.io/csi: plugin.ConstructBlockVolumeSpec extracted [map[string]string{"attachmentID":"csi-b98da4fde01f7048a0037a164da1bbd94aead353f073959b07e05e3ed9e9a683", "driverName":"driver.longhorn.io", "nodeName":"phan-block-vol", "specVolID":"pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503", "volumeHandle":"pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503"}]
    
  4. Next Kubelet waits for the volume to be reported as attached to the node (link) which means that it is waiting for the volume to appear in the node.Status.VolumesAttached field (link)
    I1205 22:39:04.482285    1085 reconciler_common.go:244] "Starting operationExecutor.VerifyControllerAttachedVolume for volume \"pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503\") pod \"block-volume-test\" (UID: \"e1edc8bb-b4b3-4c2f-83dc-096f78119be0\") " pod="default/block-volume-test"
    
  5. The volume is marked as attached to the node, Kubelet mark the volume as uncertain because it skipped added the volume to actual state of the world (ASOW) during the reconstruction (Kubelet skipped the volume here) so it processed the volume and mark it as uncertain now:
    I1205 22:39:17.493293    1085 reconciler.go:113] "Volume is marked as mounted in uncertain state and added to the actual state" pod="" podName=e1edc8bb-b4b3-4c2f-83dc-096f78119be0 volumeName=kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503
    
  6. Since the volume is uncertain in AWOS and it appear in DSOW, Kubelet try to mount the block volume.
    I1205 22:39:10.080664    1085 reconciler_common.go:217] "Starting operationExecutor.MountVolume for volume \"kube-api-access-tqwlj\" (UniqueName: \"kubernetes.io/projected/e880c824-0916-4293-a663-09fe4e4ed7a3-kube-api-access-tqwlj\") pod \"longhorn-ui-59f4f6dc7c-t7mfs\" (UID: \"e880c824-0916-4293-a663-09fe4e4ed7a3\") " pod="longhorn-system/longhorn-ui-59f4f6dc7c-t7mfs"
    
  7. The first operation in the mounting process is SetUpDevice() This operation prepares the file/directory for stagingPath and call Longhorn CSI plugin NodeStageVolume(): link. However, at this moment, the Longhorn CSI plugin has not yet re-register with the newly restarted Kubelet yet. Therefore the SetUpDevice() failed.
    E1205 22:39:17.508714    1085 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503 podName: nodeName:}" failed. No retries permitted until 2023-12-05 22:39:18.008686852 +0000 UTC m=+27.122034019 (durationBeforeRetry 500ms). Error: MapVolume.SetUpDevice failed for volume "pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503") pod "block-volume-test" (UID: "e1edc8bb-b4b3-4c2f-83dc-096f78119be0") : kubernetes.io/csi: blockMapper.SetUpDevice failed to get CSI client: driver name driver.longhorn.io not found in the list of registered CSI drivers
    
  8. The problem is that Kubelet will clear up the deviceMountPath (which is also the globalMapPath) when blockMapper.SetUpDevice failed because the volume device is currently in DeviceMountUncertain as set in the step above: link
  9. Next, Kubelet now get the update from API server that the pod is no longer needed as it is interminaing:
    I1205 22:39:44.553021    1085 desired_state_of_world_populator.go:270] "Removing volume from desired state" pod="default/block-volume-test" podUID=e1edc8bb-b4b3-4c2f-83dc-096f78119be0 volumeName="pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503"
    
  10. Kubelete starts the unmount flow for the volume: link
    I1205 22:39:44.625585    1085 reconciler_common.go:162] "Starting operationExecutor.UnmountVolume for volume \"pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503\") pod \"e1edc8bb-b4b3-4c2f-83dc-096f78119be0\" (UID: \"e1edc8bb-b4b3-4c2f-83dc-096f78119be0\") "
    
  11. Part of the unmount flow is to UnmapBlockVolume() link. This function try to unbind/un-link the symbol link at golableMapPath (which is the deviceMountPath that was cleared in the SetUpDevice() ) above. This function will always failed because golableMapPath is empty now link
    E1205 22:39:44.626078    1085 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503 podName:e1edc8bb-b4b3-4c2f-83dc-096f78119be0 nodeName:}" failed. No retries permitted until 2023-12-05 22:39:45.12602954 +0000 UTC m=+54.239376707 (durationBeforeRetry 500ms). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-0efd77fa-d7b8-4896-8b5e-6f44b95a0503") pod "e1edc8bb-b4b3-4c2f-83dc-096f78119be0" (UID: "e1edc8bb-b4b3-4c2f-83dc-096f78119be0") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: e1edc8bb-b4b3-4c2f-83dc-096f78119be0, bindMount: true: failed to unmap device from map path. mapPath is empty
    
  12. This error is repeated over and over

Proposal

This issue looks like a race bug in Kubelet to me. Between step 8 and 9, if Longhorn CSI plugin re-registered quick enough, Kubelet would successfully reruns SetUpDevice() and set a non-empty value for the deviceMountPath (which is also the globalMapPath of the volume. And subsequence UnmapBlockVolume() call would be successful.

To avoid this race condition, I agree with @ZhangShuaiyi That we should not mark the device as unmounted (thus cleanup the deviceMouthPath if the volume was previously reconstructed): PR link

Changed the priority to 1, because this is related to an unresolved issue upstream.

Hi @PhanLe1010 @ZhangShuaiyi I tried the same reproduce steps on Kubernetes v1.26.0 It did show the error messges several times, but the pod got deleted successfully in the end Can you help to verify that, thanks!

@ZhangShuaiyi Do you mean this PR kubernetes/kubernetes#121443 ?

That is actually very promising. Currently busy on a different task but I will do some more reading in the next few days to understand the whole flow

@PhanLe1010 Yes. Looking forward to your feedback.

In a Harvester recreate I did NOT hit @PhanLe1010’s log of:

E1019 00:42:39.834538     967 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5 podName: nodeName:}" failed. No retries permitted until 2023-10-19 00:43:11.834505392 +0000 UTC m=+75.317052505 (durationBeforeRetry 32s). Error: MapVolume.SetUpDevice failed for volume "pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5") pod "block-volume-test" (UID: "19e41dfe-8cee-40c2-a39b-37c68b01c9a7") : rpc error: code = InvalidArgument desc = volume pvc-37461bdd-7133-4111-8fb7-c4ddf8dacca5 hasn't been attached yet

I DID multiple similar logs to @ZhangShuaiyi:

E1024 19:47:21.910461    2517 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d podName: nodeName:}" failed. No retries permitted until 2023-10-24 19:47:25.910437894 +0000 UTC m=+32.546967674 (durationBeforeRetry 4s). Error: MapVolume.SetUpDevice failed for volume "pvc-476a4873-d166-4e6e-8bac-ccf01083e14d" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d") pod "virt-launcher-worker1-xd2p4" (UID: "54666d5c-508e-4d2e-b9e1-cb9103982ce6") : kubernetes.io/csi: blockMapper.SetUpDevice failed to get CSI client: driver name driver.longhorn.io not found in the list of registered CSI drivers

This was followed by many of the expected:

I1024 19:47:25.540050    2517 reconciler.go:211] "operationExecutor.UnmountVolume started for volume \"pvc-476a4873-d166-4e6e-8bac-ccf01083e14d\" (UniqueName: \"kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d\") pod \"54666d5c-508e-4d2e-b9e1-cb9103982ce6\" (UID: \"54666d5c-508e-4d2e-b9e1-cb9103982ce6\") "
E1024 19:47:25.540244    2517 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d podName:54666d5c-508e-4d2e-b9e1-cb9103982ce6 nodeName:}" failed. No retries permitted until 2023-10-24 19:47:26.040211653 +0000 UTC m=+32.676741431 (durationBeforeRetry 500ms). Error: UnmapVolume.UnmapBlockVolume failed for volume "pvc-476a4873-d166-4e6e-8bac-ccf01083e14d" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d") pod "54666d5c-508e-4d2e-b9e1-cb9103982ce6" (UID: "54666d5c-508e-4d2e-b9e1-cb9103982ce6") : blkUtil.DetachFileDevice failed. globalUnmapPath:, podUID: 54666d5c-508e-4d2e-b9e1-cb9103982ce6, bindMount: true: failed to unmap device from map path. mapPath is empty

While investigating a similar scenario in https://github.com/longhorn/longhorn/issues/6961, I wondered why Longhorn never recovered, even though the pod consuming it on the rebooted node was (presumably --force) terminated. An orphaned volume attachment remained, and Longhorn was never instructed to detach the volume from the old node.

The attach/detach controller has a 6 minute timeout. If it thinks a volume should be detached but kubelet on a node is still reporting it as attached and/or in use (kubectl get node -oyaml), it will wait 6 minutes and then just force detach it (call ControllerUnpublishVolume). HOWEVER, this timeout is only effective as long as the node in question is unhealthy.

So, in the reboot scenario, if kubelet comes back quickly enough and hits this GitHub issue, it is never force detaches and the volume is stuck (it can never be attached to a different node, and, if it is a migratable volume, it gets into a weird migration situation).

I think if the node never comes back, the volume will be force detached and the associated volume attachment will go away. This at least allows the Longhorn volume to be used by another node.

Relevant code:

https://github.com/kubernetes/kubernetes/blob/a1a87a0a2bcd605820920c6b0e618a8ab7d117d4/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L214C1-L235C5

Confirming logs (with kube-controller-manager logs turned up to 5):

I1025 15:52:45.662316       1 reconciler.go:233] "Cannot detach volume because it is still mounted" volume={AttachedVolume:{VolumeName:kubernetes.io/csi/driver.longhorn.io^pvc-476a4873-d166-4e6e-8bac-ccf01083e14d VolumeSpec:0xc00602f998 NodeName:harvester-pnws6 PluginIsAttachable:true DevicePath: DeviceMountPath: PluginName: SELinuxMountContext:} MountedByNode:true DetachRequestedTime:2023-10-25 15:52:29.477725428 +0000 UTC m=+32.572903966}

Deleting the volume attachment manually unsticks the volume and allows it to be used elsewhere. However, the kubelet on the old node continues to report it as attached and in use and continues to log these errors.

@PhanLe1010 Yes, I am experiencing the same problem. I think it’s no need to call actualStateOfWorld.MarkDeviceAsUnmounted in markDeviceErrorState if volume was previously reconstructed. I opened a PR to do this.

@PhanLe1010
Does kubelet have the error Error: MapVolume.SetUpDevice failed for volume ... : kubernetes.io/csi: blockMapper.SetUpDevice failed to get CSI client: driver name ... not found in the list of registered CSI drivers? In my environment, csi.(*csiBlockMapper).SetUpDevice can be succeed sometimes.