longhorn: [BUG] instance manager seem to screw worker nodes after running for 2-3 weeks
Describe the bug A clear and concise description of what the bug is.
After running longhorn for a while, some worker nodes ends up in this situation:
top - 05:37:39 up 4 days, 3:04, 0 users, load average: 38.95, 50.38, 52.34
Tasks: 650 total, 13 running, 490 sleeping, 0 stopped, 147 zombie
%Cpu(s): 6.4 us, 41.1 sy, 0.0 ni, 52.2 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 72262.8 total, 9565.5 free, 21468.9 used, 41228.4 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 52941.8 avail Mem
PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
24355 root 20 0 712848 10216 65.6 0.0 7:29.09 R runc:[2:INIT]
3683 root 20 0 711788 7492 65.0 0.0 2:09.15 R runc:[2:INIT]
14654 root 20 0 713124 13292 59.2 0.0 9:48.52 R runc:[2:INIT]
24214 root 20 0 713380 13296 47.8 0.0 5:59.54 R runc:[2:INIT]
1829 root 20 0 713008 13248 44.6 0.0 2:22.86 R runc:[2:INIT]
18110 root 20 0 1089956 387796 25.5 0.5 56:09.36 S longhorn-manage
14208 root 20 0 1307824 569144 21.7 0.8 2404:49 S k3s-agent
8130 root 20 0 16.0g 3.9g 20.4 5.5 1533:34 S qemu-system-x86
8107 root 20 0 712088 7904 5.7 0.0 0:01.52 D runc
6953 root 20 0 711832 8700 5.1 0.0 0:02.07 D runc
8546 root 20 0 712088 5972 4.5 0.0 0:00.58 D runc
18496 root 20 0 2186812 1.4g 1.9 2.0 120:13.96 S longhorn-manage
5772 root 20 0 953456 291676 1.9 0.4 140:57.24 S longhorn-manage
8651 root 20 0 711832 5900 1.9 0.0 0:00.53 D runc
6863 root 20 0 712688 9164 1.3 0.0 0:02.79 R runc
5347 root 20 0 713008 10628 1.3 0.0 0:06.86 R runc
8865 rancher 20 0 2756 1928 1.3 0.0 0:00.05 R top
14243 root 20 0 1004196 274540 1.3 0.4 259:41.84 S containerd
This is the top output of the node, I originally report this at https://github.com/k3s-io/k3s/issues/2980, now i am suspecting it cause by longhorn. Because if I kill longhorn managers on the node, the problem then disappear.
To Reproduce Steps to reproduce the behavior:
Not sure how to do, properly let longhorn run with high load for some days. However, it case it happens, it stays. I see force kill longhorn manager helps.
Expected behavior A clear and concise description of what you expected to happen.
longhorn should run stably by time.
Log If applicable, add the Longhorn managers’ log when the issue happens.
You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI.
Environment:
- Longhorn version:
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
- Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 27 (5 by maintainers)
@cclhsu I find similar issue with rshared mount in this other cluster, so I think it is ourself to blame. Thanks for your time! I will close this for now.
This is one k8s host containing 1 running engine and 1 running replica:
Typically, the CPU will be consumed by engine or replica processes during data r/w. The process is like: (replica)
(engine)
It seems that
runcs in your cluster are not engine/replica processes. Not sure if it’s caused by liveness probe.