longhorn: [BUG] Block device volume failed to unmount when it is detached unexpectedly

Describe the bug

LH’s block device volume failed to unmount when it is detached unexpectedly (e.g. during Kubernetes upgrade, Docker reboot, or network disconnect), and therefore the pod is stuck in the terminating state.

To Reproduce

Steps to reproduce the behavior:

  1. create a single pod deployment and mount a PV/PVC with the block device mode.
  2. trying to mimic the unexpected detached scenario, e.g kill the longhorn instance manager that the pod is currently running on.
  3. the pod will get terminated but will stuck on the Terminating state
virt-launcher-lh-test-7gww9                          0/1     Terminating   0          3h28m
  1. check the LH longhorn manager and the other LH pods, there is not obvious errors.
  2. check the kubelet logs(e.g., pod startTime: "2022-03-23T08:11:04Z", volume remountRequestedAt: "2022-03-23T09:37:10Z") and found the volume is busy and failed to unmount.
E0323 09:36:59.035738    2734 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-b5a67cc8-a751-4ee4-ae9b-0ba3
4885be77 podName:dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4 nodeName:}" failed. No retries permitted until 2022-03-23 09:37:03.03570295 +0000 UTC m=+534617.23041224
8 (durationBeforeRetry 4s). Error: UnmapVolume.UnmapBlockVolume failed for volume "disk-0" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-b5a67cc8-a75
1-4ee4-ae9b-0ba34885be77") pod "dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4" (UID: "dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4") : blkUtil.DetachFileDevice failed. globalU
nmapPath:/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev, podUID: dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4,
bindMount: true: failed to unmount linkPath /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-01d
d-4a7b-afeb-57c8c1aed9e4: unmount failed: exit status 32
Unmounting arguments: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-01dd-4a7b-afeb-57c8c1aed9
e4
Output: umount: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4: ta
rget is busy.
E0323 09:37:00.155861    2734 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"4dbd88a6-c14b-41b5-
a65b-233c4e0b8f87\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=2
E0323 09:37:02.157018    2734 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"4dbd88a6-c14b-41b5-
a65b-233c4e0b8f87\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=2
I0323 09:37:03.058180    2734 reconciler.go:194] "operationExecutor.UnmountVolume started for volume \"disk-0\" (UniqueName: \"kubernetes.io/csi/driver.longho
rn.io^pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77\") pod \"dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4\" (UID: \"dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4\") "
W0323 09:37:03.058472    2734 volume_path_handler_linux.go:61] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device p
ath: "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4"
W0323 09:37:03.058494    2734 volume_path_handler.go:217] Warning: Unmap skipped because symlink does not exist on the path: /var/lib/kubelet/pods/dd9cedeb-01
dd-4a7b-afeb-57c8c1aed9e4/volumeDevices/kubernetes.io~csi/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77
E0323 09:37:03.074826    2734 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-b5a67cc8-a751-4ee4-ae9b-0ba3
4885be77 podName:dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4 nodeName:}" failed. No retries permitted until 2022-03-23 09:37:11.074796266 +0000 UTC m=+534625.2695055
51 (durationBeforeRetry 8s). Error: UnmapVolume.UnmapBlockVolume failed for volume "disk-0" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-b5a67cc8-a7
51-4ee4-ae9b-0ba34885be77") pod "dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4" (UID: "dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4") : blkUtil.DetachFileDevice failed. global
UnmapPath:/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev, podUID: dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4,
 bindMount: true: failed to unmount linkPath /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-01
dd-4a7b-afeb-57c8c1aed9e4: unmount failed: exit status 32
Unmounting arguments: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-01dd-4a7b-afeb-57c8c1aed9
e4
Output: umount: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4: ta
rget is busy.
E0323 09:37:04.159256    2734 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"4dbd88a6-c14b-41b5-
a65b-233c4e0b8f87\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=2
I0323 09:37:05.291550    2734 scope.go:110] "RemoveContainer" containerID="355e61c379199d6e41e7cb5361edaf6a683e39f299b4aa5536b9131a13788c9f"
E0323 09:37:06.160342    2734 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"4dbd88a6-c14b-41b5-
a65b-233c4e0b8f87\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=2
E0323 09:37:08.154310    2734 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"4dbd88a6-c14b-41b5-
a65b-233c4e0b8f87\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=2
E0323 09:37:10.160535    2734 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"4dbd88a6-c14b-41b5-
a65b-233c4e0b8f87\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=2
I0323 09:37:11.150165    2734 reconciler.go:194] "operationExecutor.UnmountVolume started for volume \"disk-0\" (UniqueName: \"kubernetes.io/csi/driver.longho
rn.io^pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77\") pod \"dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4\" (UID: \"dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4\") "
W0323 09:37:11.150620    2734 volume_path_handler_linux.go:61] couldn't find loopback device which takes file descriptor lock. Skip detaching device. device p
ath: "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4"
W0323 09:37:11.150651    2734 volume_path_handler.go:217] Warning: Unmap skipped because symlink does not exist on the path: /var/lib/kubelet/pods/dd9cedeb-01
dd-4a7b-afeb-57c8c1aed9e4/volumeDevices/kubernetes.io~csi/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77
E0323 09:37:11.167357    2734 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-b5a67cc8-a751-4ee4-ae9b-0ba3
4885be77 podName:dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4 nodeName:}" failed. No retries permitted until 2022-03-23 09:37:27.167332992 +0000 UTC m=+534641.3620422
75 (durationBeforeRetry 16s). Error: UnmapVolume.UnmapBlockVolume failed for volume "disk-0" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-b5a67cc8-a
751-4ee4-ae9b-0ba34885be77") pod "dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4" (UID: "dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4") : blkUtil.DetachFileDevice failed. globa
lUnmapPath:/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev, podUID: dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4
, bindMount: true: failed to unmount linkPath /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-0
1dd-4a7b-afeb-57c8c1aed9e4: unmount failed: exit status 32
Unmounting arguments: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-01dd-4a7b-afeb-57c8c1aed9
e4
Output: umount: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/dd9cedeb-01dd-4a7b-afeb-57c8c1aed9e4: ta
rget is busy.
E0323 09:37:12.157239    2734 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"4dbd88a6-c14b-41b5-
a65b-233c4e0b8f87\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=2

Expected behavior

Should be able to unmount the volume successfully.

Log or Support bundle

support.zip

Environment

  • Longhorn version: v1.2.4
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.22.7+rke2r1
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 1
  • Node config
    • OS type and version: Harvester v1.0.1

Additional context

umount: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-b5a67cc8-a751-4ee4-ae9b-0ba34885be77/dev/ab369da5-c126-4600-a0b0-2311088296ea: target is busy.

manual umount with -l will work.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 17 (17 by maintainers)

Most upvoted comments

Tested the upstream PR https://github.com/kubernetes/kubernetes/pull/109083 by building and deploying new kubelet using rke2. The problem no logger exists.

Upstream fix for this one has been merged https://github.com/kubernetes/kubernetes/issues/109132. (Although upstream ended up using a different PR than our proposed PR)

We can move this one to testing.

cc @bk201

The workflow of CSI block volume mounting: https://github.com/kubernetes/kubernetes/blob/b56e432f2191419647a6a13b9f5867801850f969/pkg/volume/csi/csi_block.go#L32-L42

When mapping the device to the global/podVolumeMap path, Kubernetes will use a function to create a loop device so that the block device is kept opened and is protected from being silently removed.


The unmount of block volumes is reverse to the mounting, which means Kubernetes will handle the loop device via the function Phan mentioned above as the 1st step.

But the function failed at the beginning and lead to this bug, here is the flow: Get the loopback device -> Find the matching in the koop device -> Fail to get a matching:

root@shuo-k8s-worker3:~# losetup
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE                                                                                                                                            DIO LOG-SEC
/dev/loop1         0      0         1  1 /var/lib/snapd/snaps/lxd_21835.snap                                                                                                                    0     512
/dev/loop8         0      0         0  0 / (deleted)                                                                                                                                            0     512
/dev/loop6         0      0         0  0 / (deleted)                                                                                                                                            0     512
/dev/loop4         0      0         1  1 /var/lib/snapd/snaps/snapd_14978.snap                                                                                                                  0     512
/dev/loop2         0      0         1  1 /var/lib/snapd/snaps/core20_1376.snap                                                                                                                  0     512
/dev/loop0         0      0         1  1 /var/lib/snapd/snaps/core20_1361.snap                                                                                                                  0     512
/dev/loop9         0      0         0  0 /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-417a59df-2c2b-44a1-9350-07b108bf96ec/dev/3447f8b4-b54a-49df-b677-ffab62fc68be (deleted)   0     512
/dev/loop5         0      0         1  1 /var/lib/snapd/snaps/snapd_15177.snap                                                                                                                  0     512
/dev/loop3         0      0         1  1 /var/lib/snapd/snaps/lxd_22526.snap                                                                                                                    0     512

root@shuo-k8s-worker3:~# cat /sys/block/loop9/loop/backing_file
/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-417a59df-2c2b-44a1-9350-07b108bf96ec/dev/3447f8b4-b54a-49df-b677-ffab62fc68be (deleted)

As you can see, there is a suffix (deleted) in the record.

Close this ticket as validation passed and have test skeleton reviewing now

Hi @PhanLe1010, I see this ticket have labels require/auto-e2e-test,is it OK for you to help add test skeleton for this ticket? And then we can try implement test case follow the skeleton.

Reopen this ticket for tracking, thank you.

Verified pass in longhorn master (longhorn 115edc) with test steps

I can reproduce the block volume unmount failure in k3s v1.26.4+k3s1 if volume detached unexpectedly. Using another cluster with k3s v1.27.1+k3s1 with the same step, pod terminate/restart success, data intact.