longhorn: [BUG] Memory leak in CSI plugin caused by stuck umount processes if the RWX volume is already gone

Describe the bug (πŸ› if you encounter this issue)

The memory usage of the CSI plugins has been growing steadily for the past few weeks after upgrading to version 1.4.0 image

To Reproduce

No specific steps. It has been like that since upgrading to version Longhorn 1.4.0.

Expected behavior

The CSI plugin memory usage should stay relatively small without growing steadily.

Log or Support bundle

supportbundle_413f9f8b-328c-47d2-a25d-4f7c903bf06f_2023-01-16T08-27-12Z.zip

Environment

Longhorn version: 1.4.0 Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.25.4+k3s1 Number of management node in the cluster: 1 Number of worker node in the cluster: 4 Node config OS type and version: Raspbian (5.15.76-v8) and Ubuntu server (22.04.1) CPU per node: 2-4 Memory per node: 4-8G Disk type(e.g. SSD/NVMe): SSD Network bandwidth between the nodes: 1G Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal Number of Longhorn volumes in the cluster: 15

Workaround

Force umount the stuck umount process by umoun -f.

Additional context

Initially discussed as part of another issue https://github.com/longhorn/longhorn/issues/5195#issuecomment-1383662401

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 35 (22 by maintainers)

Most upvoted comments

I am able to reproduce the issue by:

  1. Deploy Longhorn 1.4.0 and install prerequisite NFS client
  2. Create a deployment that use RWX volume. Wait for the deployment to become running
  3. Delete the Volume using Longhorn UI
  4. Scale down the deployment
  5. Observe that:
    1. the pod stuck in terminating.
    2. Longhorn CSI plugin repeatedly has the logs
      [longhorn-csi-plugin-pqmz9 longhorn-csi-plugin] time="2023-01-19T22:06:33Z" level=debug msg="trying to unmount potential mount point /var/lib/kubelet/pods/be80e2cc-b074-4223-b1c9-9d49c3ff416f/volumes/kubernetes.io~csi/pvc-be9dcbac-38fc-4754-8c12-4fb3f2572aa6/mount"
      
    3. There are many umount command stuck in longhorn-csi-plugin:
      longhorn-csi-plugin-pqmz9:/ # ps aux --sort -rss | grep umount 
      root       31502  0.0  0.0  16260  1268 ?        D    22:06   0:00 umount /var/lib/kubelet/pods/be80e2cc-b074-4223-b1c9-9d49c3ff416f/volumes/kubernetes.io~csi/pvc-be9dcbac-38fc-4754-8c12-4fb3f2572aa6/mount
      root       32954  0.0  0.0  16260  1240 ?        D    22:08   0:00 umount /var/lib/kubelet/pods/be80e2cc-b074-4223-b1c9-9d49c3ff416f/volumes/kubernetes.io~csi/pvc-be9dcbac-38fc-4754-8c12-4fb3f2572aa6/mount
      root       28483  0.0  0.0  16260  1208 ?        D    22:02   0:00 umount /var/lib/kubelet/pods/be80e2cc-b074-4223-b1c9-9d49c3ff416f/volumes/kubernetes.io~csi/pvc-be9dcbac-38fc-4754-8c12-4fb3f2572aa6/mount
      root       29991  0.0  0.0  16260  1204 ?        D    22:04   0:00 umount /var/lib/kubelet/pods/be80e2cc-b074-4223-b1c9-9d49c3ff416f/volumes/kubernetes.io~csi/pvc-be9dcbac-38fc-4754-8c12-4fb3f2572aa6/mount
      root       26874  0.0  0.0  16260  1180 ?        D    22:00   0:00 umount /var/lib/kubelet/pods/be80e2cc-b074-4223-b1c9-9d49c3ff416f/volumes/kubernetes.io~csi/pvc-be9dcbac-38fc-4754-8c12-4fb3f2572aa6/mount
      

cc @derekbit

Besides figuring out if it can be reproduced, need to use force umount as a fallback in general if graceful umount not working. WDYT or any concerns? @derekbit @joshimoo @PhanLe1010

Yeah, should follow this pattern.

mount-utils.UnmountWithForce does graceful unmount first. If failed, do force unmount.

Some notes below:

@lazar-boradzhiev Got it. Can you try to umount it by umount -f?

The main culprit seems the k3s server. BTW, a weird thing is that a tons of umount processes (~4000) are stuck on this node.

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         649 30.2 18.6 3162676 1494028 ?     Ssl  Jan03 5529:41 /usr/local/bin/k3s server
root     4183554  0.2  4.7 1369312 378676 ?      Ssl  Jan05  40:40 longhorn-manager -d csi --nodeid=amaterasu --endpoint=unix:///csi/csi.sock --drivername=driver.longhorn.io --manager-url=http://longhorn-backend:9500/v1
root     3964912  0.2  2.8 549476 226720 ?       Ssl  Jan05  38:51 python3 -m homeassistant --config /config
root     3945584  1.9  1.5 828328 125200 ?       Ssl  Jan05 318:13 longhorn-manager -d daemon --engine-image longhornio/longhorn-engine:v1.4.0 --instance-manager-image longhornio/longhorn-instance-manager:v1.4.0 --share-manager-image longhornio/longhorn-share-manager:v1.4.0 --backing-image-manager-image longhornio/backing-image-manager:v1.4.0 --support-bundle-manager-image longhornio/support-bundle-kit:v0.0.17 --manager-image longhornio/longhorn-manager:v1.4.0 --service-account longhorn-service-account
root         784  5.0  1.3 873752 109688 ?       Sl   Jan03 916:12 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
2000     3945786  0.2  1.3 825128 108348 ?       Ssl  Jan05  33:34 longhorn-manager admission-webhook --service-account longhorn-service-account
1000        5309  0.3  1.3 878068 104168 ?       Ssl  Jan03  56:13 /app/cmd/cainjector/cainjector --v=2 --leader-election-namespace=kube-system
...
root     1238506  0.0  0.0   3992  1000 ?        D    08:55   0:00 umount /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount
root     1311671  0.0  0.0   3992  1000 ?        D    09:47   0:00 umount /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount
root      184201  0.0  0.0   3992   996 ?        D    Jan15   0:00 umount /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount
root      201152  0.0  0.0   3992   996 ?        D    Jan15   0:00 umount /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount
root      212443  0.0  0.0   3992   996 ?        D    Jan15   0:00 umount /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount
root      330725  0.0  0.0   3992   996 ?        D    Jan15   0:00 umount /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount
root      353236  0.0  0.0   3992   996 ?        D    Jan15   0:00 umount /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount
root      578426  0.0  0.0   3992   996 ?        D    01:03   0:00 umount /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount
root      617795  0.0  0.0   3992   996 ?        D    01:31   0:00 umount /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount

@lazar-boradzhiev would probably need your cooperation when investigating this issue. Thanks first.

@roger-ryao Did you see CPU usage on LH nodes went high when you were validating the issue and the unmount processes were stuck?

We have a fix for the unmount stuck issue. This is because the RWX volume is mounted with hard mode from v1.4.0. If the share manager is somehow deleted before the volume detachment, the graceful unmount won’t work correctly. Before the release, you can try the workaround.

On Sat, Jan 21, 2023 at 3:11 AM Rob Landers @.***> wrote:

Also running into this πŸ˜‚

β€” Reply to this email directly, view it on GitHub https://github.com/longhorn/longhorn/issues/5296#issuecomment-1398821510, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7SNAI7BZ6MTBDXJVGQGVTWTLPO5ANCNFSM6AAAAAAT4RU6CY . You are receiving this because you were mentioned.Message ID: @.***>

Besides figuring out if it can be reproduced, need to use force umount as a fallback in general if graceful umount not working. WDYT or any concerns? @derekbit @joshimoo @PhanLe1010

@derekbit Thanks for the help! I initially suspected that the problem might be caused by improperly migrating the k3s node to a newer version as described in another issue #5195 (comment). Basically, I didn’t scale down and detach all the workloads before the migration.

I see. I cannot reproduce it from my side. @PhanLe1010 Have you succeeded to reproduce it?

For the resilience, we can force umount the mountpoint in CSI plugin as @joshimoo mentioned in https://github.com/longhorn/longhorn/issues/5296#issuecomment-1386347221.

It executed successfully! πŸš€

Logs from container:

time="2023-01-18T08:46:21Z" level=info msg="no need for unmount not a mount point /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount"
time="2023-01-18T08:46:21Z" level=info msg="cleaned up mount point /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount"
W0118 08:46:21.784060  554181 mount_helper_common.go:34] Warning: Unmount skipped because path does not exist: /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount
time="2023-01-18T08:46:21Z" level=info msg="NodeUnpublishVolume: volume pvc-c2c40895-326b-4198-ac70-27f92b6ea62d unmounted from path /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount"
time="2023-01-18T08:46:21Z" level=info msg="NodeUnpublishVolume: rsp: {}"

Memory usage is back to a healthy level image

Do you remember if pvc-c2c40895-326b-4198-ac70-27f92b6ea62d is ReadWriteMany volume or ReadWriteOnce volume? @lazar-boradzhiev

@PhanLe1010 The volume is a ReadWriteMany volume. I saw longhorn-csi-plugin tries to clean up the mountpoint repeatedly. However, the volume and the share-manager are already gone.

2023-01-16T08:27:01.793205260Z time="2023-01-16T08:27:01Z" level=info msg="trying to cleanup mount point /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount"
2023-01-16T08:27:01.793287017Z time="2023-01-16T08:27:01Z" level=debug msg="trying to unmount potential mount point /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount"

@lazar-boradzhiev Can you check if /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount is still on your Longhorn node? Can you try to force unmount the mountpoint by umount -f?

Tons of the below log in the longhorn-csi-plugin:

2023-01-16T08:26:58.785279117Z time="2023-01-16T08:26:58Z" level=info msg="NodeUnpublishVolume: req: {\"target_path\":\"/var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount\",\"volume_id\":\"pvc-c2c40895-326b-4198-ac70-27f92b6ea62d\"}"
2023-01-16T08:27:01.793205260Z time="2023-01-16T08:27:01Z" level=info msg="trying to cleanup mount point /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount"
2023-01-16T08:27:01.793287017Z time="2023-01-16T08:27:01Z" level=debug msg="trying to unmount potential mount point /var/lib/kubelet/pods/e1cf9e16-b3eb-455c-ae64-887200d6f267/volumes/kubernetes.io~csi/pvc-c2c40895-326b-4198-ac70-27f92b6ea62d/mount"

The weird thing is, there is no error reported from the unmount. And the actual unmount interface Kubernetes provides does not contain a timeout mechanism.

Besides, Kubernetes still considers this volume is in use but actually the related Longhorn volume object is gone:

- apiVersion: v1
     kind: Node
     name: amaterasu
     resourceVersion: "43439969"
     uid: 704f5e1f-4707-40d0-95b0-3961a8e0bd4c
   status:
     volumesAttached:
     - devicePath: "null"
       name: kubernetes.io/csi/driver.longhorn.io^pvc-c2c40895-326b-4198-ac70-27f92b6ea62d

Probably we can get something from the kubelet log.

@lazar-boradzhiev Can you provide the dmesg -T of each Longhorn node and kubelet logs for further investigation? Thank you.

dmesg_amaterasu.txt - master dmesg_hermes.txt - storage dmesg_horreum.txt - storage

@derekbit The kubelet logs are quite big. For the last 1 week for total of around 250 MBs, which is not allowed by github. I could spend some time later to try and filter them out a bit or upload them elsewhere.

@lazar-boradzhiev Can you provide the dmesg -T of each Longhorn node and kubelet logs for further investigation? Thank you.

@lazar-boradzhiev Can you log into the longhorn-csi-plugin container of the longhorn-csi-plugin pod and check which process is using the big memory by ps aux --sort -rss?