longhorn: [BUG] High CPU usage by instance manager
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Not sure how to reproduce, as I am not sure what caused it. It’s not happening in my PROD and DEV cluster, only on Staging cluster. I assume my problems started when upgrading from v1.2.4 to v1.3.0, but this was done for all clusters.
Expected behavior
Not to spike cpu usage on a random node, taking it offline.
Log or Support bundle
#3636 https://drive.google.com/file/d/13miZCNOcBQO9z2phHhE1OuRvgHkQ9fjv/view?usp=sharing
Environment
- Longhorn version: v1.3.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm (ArgoCD)
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3
- Node config
- OS type and version: Ubuntu 22.04.1 LTS
- CPU per node: 1x AMD Epyc 7702P (worker node)
- Memory per node: 512GB (worker node)
- Disk type(e.g. SSD/NVMe): Kioxia NVMe (6x 960GB in RAID10 using mdadm, worker node)
- Network bandwidth between the nodes: 25Gbps
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: 132 so far
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
h1014 Ready <none> 86d v1.23.6+k3s1 102.xxx.xxx.19 <none> Ubuntu 22.04.1 LTS 5.15.0-46-generic containerd://1.5.11-k3s2
h1015 Ready <none> 86d v1.23.6+k3s1 102.xxx.xxx.20 <none> Ubuntu 22.04.1 LTS 5.15.0-46-generic containerd://1.5.11-k3s2
h1016 Ready <none> 86d v1.23.6+k3s1 102.xxx.xxx.21 <none> Ubuntu 22.04.1 LTS 5.15.0-46-generic containerd://1.5.11-k3s2
v1025 Ready control-plane,etcd,master 87d v1.23.6+k3s1 102.xxx.xxx.22 <none> Ubuntu 22.04.1 LTS 5.15.0-46-generic containerd://1.5.11-k3s2
v1026 Ready control-plane,etcd,master 86d v1.23.6+k3s1 102.xxx.xxx.23 <none> Ubuntu 22.04.1 LTS 5.15.0-46-generic containerd://1.5.11-k3s2
v1027 Ready control-plane,etcd,master 86d v1.23.6+k3s1 102.xxx.xxx.24 <none> Ubuntu 22.04.1 LTS 5.15.0-46-generic containerd://1.5.11-k3s2
Additional context
If the cpu spike happens, I am unable to ssh to the server. I have left a ssh session open before the spike happens and was able to capture these: (took forever though) ps aux | grep longhorn | sort -nrk 3,3 | head -n 5
root 9593 152 0.0 1204624 267536 ? Ssl Aug27 3364:48 longhorn-manager -d daemon --engine-image longhornio/longhorn-engine:v1.3.1 --instance-manager-image longhornio/longhorn-instance-manager:v1_20220808 --share-manager-image longhornio/longhorn-share-manager:v1_20220808 --backing-image-manager-image longhornio/backing-image-manager:v3_20220808 --manager-image longhornio/longhorn-manager:v1.3.1 --service-account longhorn-service-account
2000 2838944 1.8 0.0 912908 171460 ? Ssl Aug28 18:50 longhorn-manager admission-webhook --service-account longhorn-service-account
root 2392614 1.2 0.0 4492424 39088 ? Sl Aug28 12:27 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.3.1/longhorn sync-agent --listen 0.0.0.0:10662 --replica 0.0.0.0:10660 --listen-port-range 10663-10672
root 20829 0.8 0.0 17398108 58584 ? Sl Aug27 19:13 longhorn-instance-manager --debug daemon --listen 0.0.0.0:8500
root 3064100 0.4 0.0 4343936 35012 ? Sl Aug28 3:41 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.3.1/longhorn sync-agent --listen 0.0.0.0:10437 --replica 0.0.0.0:10435 --listen-port-range 10438-10447
uptime
08:18:02 up 1 day, 13:17, 1 user, load average: 6790.60, 6838.29, 6811.07
kill -9 9593
ps aux | sort -nrk 3,3 | head -n 5
root 2648 39.8 1.1 6987480 6001412 ? Ssl Aug27 891:04 /usr/local/bin/k3s agent
root 2776 22.7 0.2 2062824 1260512 ? Sl Aug27 508:30 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
10000 2850902 16.6 0.0 825404 111168 ? Ssl Aug28 170:57 /moco-controller
vaimoro+ 950618 13.6 0.5 21552916 2818648 ? Ssl Aug28 130:39 /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --storage.tsdb.retention.time=14d --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --web.enable-lifecycle --web.route-prefix=/ --web.config.file=/etc/prometheus/web_config/web-config.yaml
www-data 1770670 9.3 0.0 203580 127724 ? S 08:18 0:00 php /var/www/magento2/bin/magento cron:run
I cannot run top or htop, as I have not had the patience to wait for it to open…
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 32 (10 by maintainers)
Issue can be closed. Moved to rook-ceph.
Any news please?
So far it seems that disabling the swap on the worker nodes have made our installation even more stable.
I do not want to talk too soon, but it seems our issue was loki and Prometheus. We removed these from the clusters and since then, they have been stable as a rock. (One cluster running longhorn and the other rook-ceph).
After a quick look at the log, just saw lots of replica rebuilding/replenishment and snapshot purge around
2022-09-01T05:39:59. The system was also busy at2022-08-31T18:01.2022-08-31T18:01are caused by network disconnection.Is it possible that the rebuldings are caused by the network outage or the engine crash and lead to high CPU usage?
@LarsBingBong, first off, our nodes were crashing all together. Do you experience the same or do you just have high cpu usage? What does
kubectl describe nodeshow for limits and requests?We also had high cpu for the instance manager, but we changed
guaranteedEngineManagerCPUandguaranteedReplicaManagerCPUto 6, where default is 12 (percent). It has to be noted that our worker nodes has 32 vCPUs.Sure of course.
Here’s a
topscreenshot:and
a screenshot from
k9sshowing that it’sinstance-managerwith aneso theLonghorn Engineconsuming A LOT ofcpu.Debugging
crictl psto see thecontainer idof thecontainerusing CPU. We know it’s the engine so:crictl exec -ti a19f55e89fbd1 /bin/bashwhere a19f55e89fbd1 is thecontainer IDof theLonghorn Engine instanceSCSI target daemon:tail /var/log/tgtd.logWe see that it’s the
PVCnamed pvc-596ff411-6d32-4602-87ab-8b409cc78ca9 that’s “targeted” - nothing more really of interest in there. ThePVCis backing aKafkacluster. The otherPVC'ssupporting otherKafkanodes in theKafkaCluster is not having issues / giving issues for theLonghorn Engine Manager.What we’re really trying to pinpoint is what the
Engine Manageris consumingCPUHere’s the
Longhornsupport bundle:longhorn-support-bundle_70f8d0a6-2bd8-4b44-9530-06e770a3c5dc_2023-01-04T15-20-03Z.zip
Overall system info
ext4filesystemVMWarevSANat a minimumSSDtypeThank you
Hi @LarsBingBong,
Maybe you should explain your setup (so we have more information)? We have since moved all the nodes to VMs on the same hardware and we are also using v1.3.2 and have not had any issues again.
@wbarnard81 cool - no surprise though. Don’t have swap enabled on
Kubernetesnodes.Just to update. Switched to rook-ceph and I am still experiencing this issue, obviously not the longhorn spiking part.
Also check: Michaelpalacce https://github.com/longhorn/longhorn/issues/3396
I am now checking the health checks, but are also suspecting Prometheus now, since my issue started again when I added ceph metrics to Prometheus.