longhorn: [BUG] instance manager seem to screw worker nodes after running for 2-3 weeks

Describe the bug A clear and concise description of what the bug is.

After running longhorn for a while, some worker nodes ends up in this situation:


top - 05:37:39 up 4 days,  3:04,  0 users,  load average: 38.95, 50.38, 52.34
Tasks: 650 total,  13 running, 490 sleeping,   0 stopped, 147 zombie
%Cpu(s):  6.4 us, 41.1 sy,  0.0 ni, 52.2 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :  72262.8 total,   9565.5 free,  21468.9 used,  41228.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  52941.8 avail Mem 

  PID USER      PR  NI    VIRT    RES  %CPU  %MEM     TIME+ S COMMAND                                                                                                             
24355 root      20   0  712848  10216  65.6   0.0   7:29.09 R runc:[2:INIT]                                                                                                       
 3683 root      20   0  711788   7492  65.0   0.0   2:09.15 R runc:[2:INIT]                                                                                                       
14654 root      20   0  713124  13292  59.2   0.0   9:48.52 R runc:[2:INIT]                                                                                                       
24214 root      20   0  713380  13296  47.8   0.0   5:59.54 R runc:[2:INIT]                                                                                                       
 1829 root      20   0  713008  13248  44.6   0.0   2:22.86 R runc:[2:INIT]                                                                                                       
18110 root      20   0 1089956 387796  25.5   0.5  56:09.36 S longhorn-manage                                                                                                     
14208 root      20   0 1307824 569144  21.7   0.8   2404:49 S k3s-agent                                                                                                           
 8130 root      20   0   16.0g   3.9g  20.4   5.5   1533:34 S qemu-system-x86                                                                                                     
 8107 root      20   0  712088   7904   5.7   0.0   0:01.52 D runc                                                                                                                
 6953 root      20   0  711832   8700   5.1   0.0   0:02.07 D runc                                                                                                                
 8546 root      20   0  712088   5972   4.5   0.0   0:00.58 D runc                                                                                                                
18496 root      20   0 2186812   1.4g   1.9   2.0 120:13.96 S longhorn-manage                                                                                                     
 5772 root      20   0  953456 291676   1.9   0.4 140:57.24 S longhorn-manage                                                                                                     
 8651 root      20   0  711832   5900   1.9   0.0   0:00.53 D runc                                                                                                                
 6863 root      20   0  712688   9164   1.3   0.0   0:02.79 R runc                                                                                                                
 5347 root      20   0  713008  10628   1.3   0.0   0:06.86 R runc                                                                                                                
 8865 rancher   20   0    2756   1928   1.3   0.0   0:00.05 R top                                                                                                                 
14243 root      20   0 1004196 274540   1.3   0.4 259:41.84 S containerd    

This is the top output of the node, I originally report this at https://github.com/k3s-io/k3s/issues/2980, now i am suspecting it cause by longhorn. Because if I kill longhorn managers on the node, the problem then disappear.

To Reproduce Steps to reproduce the behavior:

Not sure how to do, properly let longhorn run with high load for some days. However, it case it happens, it stays. I see force kill longhorn manager helps.

Expected behavior A clear and concise description of what you expected to happen.

longhorn should run stably by time.

Log If applicable, add the Longhorn managers’ log when the issue happens.

You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment:

  • Longhorn version:
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 27 (5 by maintainers)

Most upvoted comments

@cclhsu I find similar issue with rshared mount in this other cluster, so I think it is ourself to blame. Thanks for your time! I will close this for now.

This is one k8s host containing 1 running engine and 1 running replica:

root     16423   678  0 Mar05 ?        00:01:47  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/5e85cd873e586a2b210c7d67e926bccaf59fb6fc3
root     16449 16423  0 Mar05 ?        00:00:02  |   \_ /tini -- engine-manager --debug daemon --listen 0.0.0.0:8500
root     16478 16449  0 Mar05 ?        00:00:35  |       \_ longhorn-instance-manager --debug daemon --listen 0.0.0.0:8500
root     16481 16478  0 Mar05 ?        00:00:14  |           \_ tgtd -f
root     16482 16478  0 Mar05 ?        00:00:00  |           \_ tee /var/log/tgtd.log
root     15683 16478  0 04:45 ?        00:00:00  |           \_ /engine-binaries/longhornio-longhorn-engine-master/longhorn controller vol --frontend tgt-blockdev --replica tcp://10.42.1.151
root     16517   678  0 Mar05 ?        00:01:47  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/35bc8be473cae7f922c105d74a1803cc496852021
root     16541 16517  0 Mar05 ?        00:00:03  |   \_ /tini -- longhorn-instance-manager --debug daemon --listen 0.0.0.0:8500
root     16565 16541  0 Mar05 ?        00:00:30  |       \_ longhorn-instance-manager --debug daemon --listen 0.0.0.0:8500
root     15570 16565  0 04:45 ?        00:00:00  |           \_ /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn replica /host/var/lib/longhorn/replicas/vol-
root     15575 15570  0 04:45 ?        00:00:00  |               \_ /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn sync-agent --listen 0.0.0.0:10002 --repl
root     28401   678  0 Mar05 ?        00:00:03  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8779fa3d91947f707cb372a7f625602a0b43a3b81
root     28465 28401  0 Mar05 ?        00:00:00  |   \_ /pause
root     28605   678  0 Mar05 ?        00:00:03  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/3cec1bac177e728d5bc8d762747bd6f7f506a2628
root     28639 28605  0 Mar05 ?        00:08:31      \_ longhorn-manager -d daemon --engine-image longhornio/longhorn-engine:master --instance-manager-image longhornio/longhorn-instance-mana

Typically, the CPU will be consumed by engine or replica processes during data r/w. The process is like: (replica)

root     15570 16565  0 04:45 ?        00:00:00  |           \_ /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn replica /host/var/lib/longhorn/replicas/vol-
root     15575 15570  0 04:45 ?        00:00:00  |               \_ /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master/longhorn sync-agent --listen 0.0.0.0:10002 --repl

(engine)

root     15683 16478  0 04:45 ?        00:00:00  |           \_ /engine-binaries/longhornio-longhorn-engine-master/longhorn controller vol --frontend tgt-blockdev --replica tcp://10.42.1.151

It seems that runcs in your cluster are not engine/replica processes. Not sure if it’s caused by liveness probe.