longhorn: [BUG] High CPU usage by instance manager

Describe the bug

A clear and concise description of what the bug is.

To Reproduce

Not sure how to reproduce, as I am not sure what caused it. It’s not happening in my PROD and DEV cluster, only on Staging cluster. I assume my problems started when upgrading from v1.2.4 to v1.3.0, but this was done for all clusters.

Expected behavior

Not to spike cpu usage on a random node, taking it offline.

Log or Support bundle

#3636 https://drive.google.com/file/d/13miZCNOcBQO9z2phHhE1OuRvgHkQ9fjv/view?usp=sharing

Environment

  • Longhorn version: v1.3.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm (ArgoCD)
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: Ubuntu 22.04.1 LTS
    • CPU per node: 1x AMD Epyc 7702P (worker node)
    • Memory per node: 512GB (worker node)
    • Disk type(e.g. SSD/NVMe): Kioxia NVMe (6x 960GB in RAID10 using mdadm, worker node)
    • Network bandwidth between the nodes: 25Gbps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 132 so far
NAME    STATUS   ROLES                       AGE   VERSION        INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
h1014   Ready    <none>                      86d   v1.23.6+k3s1   102.xxx.xxx.19   <none>        Ubuntu 22.04.1 LTS   5.15.0-46-generic   containerd://1.5.11-k3s2
h1015   Ready    <none>                      86d   v1.23.6+k3s1   102.xxx.xxx.20   <none>        Ubuntu 22.04.1 LTS   5.15.0-46-generic   containerd://1.5.11-k3s2
h1016   Ready    <none>                      86d   v1.23.6+k3s1   102.xxx.xxx.21   <none>        Ubuntu 22.04.1 LTS   5.15.0-46-generic   containerd://1.5.11-k3s2
v1025   Ready    control-plane,etcd,master   87d   v1.23.6+k3s1   102.xxx.xxx.22   <none>        Ubuntu 22.04.1 LTS   5.15.0-46-generic   containerd://1.5.11-k3s2
v1026   Ready    control-plane,etcd,master   86d   v1.23.6+k3s1   102.xxx.xxx.23   <none>        Ubuntu 22.04.1 LTS   5.15.0-46-generic   containerd://1.5.11-k3s2
v1027   Ready    control-plane,etcd,master   86d   v1.23.6+k3s1   102.xxx.xxx.24   <none>        Ubuntu 22.04.1 LTS   5.15.0-46-generic   containerd://1.5.11-k3s2

Additional context

If the cpu spike happens, I am unable to ssh to the server. I have left a ssh session open before the spike happens and was able to capture these: (took forever though) ps aux | grep longhorn | sort -nrk 3,3 | head -n 5

root        9593  152  0.0 1204624 267536 ?      Ssl  Aug27 3364:48 longhorn-manager -d daemon --engine-image longhornio/longhorn-engine:v1.3.1 --instance-manager-image longhornio/longhorn-instance-manager:v1_20220808 --share-manager-image longhornio/longhorn-share-manager:v1_20220808 --backing-image-manager-image longhornio/backing-image-manager:v3_20220808 --manager-image longhornio/longhorn-manager:v1.3.1 --service-account longhorn-service-account
2000     2838944  1.8  0.0 912908 171460 ?       Ssl  Aug28  18:50 longhorn-manager admission-webhook --service-account longhorn-service-account
root     2392614  1.2  0.0 4492424 39088 ?       Sl   Aug28  12:27 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.3.1/longhorn sync-agent --listen 0.0.0.0:10662 --replica 0.0.0.0:10660 --listen-port-range 10663-10672
root       20829  0.8  0.0 17398108 58584 ?      Sl   Aug27  19:13 longhorn-instance-manager --debug daemon --listen 0.0.0.0:8500
root     3064100  0.4  0.0 4343936 35012 ?       Sl   Aug28   3:41 /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.3.1/longhorn sync-agent --listen 0.0.0.0:10437 --replica 0.0.0.0:10435 --listen-port-range 10438-10447

uptime

 08:18:02 up 1 day, 13:17,  1 user,  load average: 6790.60, 6838.29, 6811.07

kill -9 9593

ps aux | sort -nrk 3,3 | head -n 5

root        2648 39.8  1.1 6987480 6001412 ?     Ssl  Aug27 891:04 /usr/local/bin/k3s agent
root        2776 22.7  0.2 2062824 1260512 ?     Sl   Aug27 508:30 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
10000    2850902 16.6  0.0 825404 111168 ?       Ssl  Aug28 170:57 /moco-controller
vaimoro+  950618 13.6  0.5 21552916 2818648 ?    Ssl  Aug28 130:39 /bin/prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --storage.tsdb.retention.time=14d --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --web.enable-lifecycle --web.route-prefix=/ --web.config.file=/etc/prometheus/web_config/web-config.yaml
www-data 1770670  9.3  0.0 203580 127724 ?       S    08:18   0:00 php /var/www/magento2/bin/magento cron:run

I cannot run top or htop, as I have not had the patience to wait for it to open…

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 32 (10 by maintainers)

Most upvoted comments

Issue can be closed. Moved to rook-ceph.

Any news please?

So far it seems that disabling the swap on the worker nodes have made our installation even more stable.

I do not want to talk too soon, but it seems our issue was loki and Prometheus. We removed these from the clusters and since then, they have been stable as a rock. (One cluster running longhorn and the other rook-ceph).

After a quick look at the log, just saw lots of replica rebuilding/replenishment and snapshot purge around 2022-09-01T05:39:59. The system was also busy at 2022-08-31T18:01.

  • Some messages in longhorn-manager show the replica ERR and rebuilding around 2022-08-31T18:01 are caused by network disconnection.
  • But around 2022-09-01T05:39:59 , the log shows the instance-manager-e are somehow crashed.

Is it possible that the rebuldings are caused by the network outage or the engine crash and lead to high CPU usage?

@LarsBingBong, first off, our nodes were crashing all together. Do you experience the same or do you just have high cpu usage? What does kubectl describe node show for limits and requests?

We also had high cpu for the instance manager, but we changed guaranteedEngineManagerCPU and guaranteedReplicaManagerCPU to 6, where default is 12 (percent). It has to be noted that our worker nodes has 32 vCPUs.

Sure of course.

Here’s a top screenshot:

image

and

image

a screenshot from k9s showing that it’s instance-manager with an e so the Longhorn Engine consuming A LOT of cpu.


Debugging

  1. ssh to the worker whereon the instance manager is consuming A LOT of CPU
  2. crictl ps to see the container id of the container using CPU. We know it’s the engine so:
  3. crictl exec -ti a19f55e89fbd1 /bin/bash where a19f55e89fbd1 is the container ID of the Longhorn Engine instance
  4. Looking at the SCSI target daemon: tail /var/log/tgtd.log
tgtd: bs_longhorn_request(105) fail to read at 4662026240 for 4096
tgtd: bs_longhorn_request(150) io error 0xeaaa10 28 -14 4096 4662026240, Success
lh_client_close_conn: Closing connection
lh_client_close_conn: Connection close complete
tgtd: device_mgmt(246) sz:109 params:path=/var/run/longhorn-pvc-596ff411-6d32-4602-87ab-8b409cc78ca9.sock,bstype=longhorn,bsopts=size=16106127360
tgtd: bs_thread_open(409) 16
lh_client_close_conn: Closing connection
lh_client_close_conn: Connection close complete
tgtd: device_mgmt(246) sz:109 params:path=/var/run/longhorn-pvc-596ff411-6d32-4602-87ab-8b409cc78ca9.sock,bstype=longhorn,bsopts=size=16106127360
tgtd: bs_thread_open(409) 16

We see that it’s the PVC named pvc-596ff411-6d32-4602-87ab-8b409cc78ca9 that’s “targeted” - nothing more really of interest in there. The PVC is backing a Kafka cluster. The other PVC's supporting other Kafka nodes in the Kafka Cluster is not having issues / giving issues for the Longhorn Engine Manager.

What we’re really trying to pinpoint is what the Engine Manager is consuming CPU


Here’s the Longhorn support bundle:

longhorn-support-bundle_70f8d0a6-2bd8-4b44-9530-06e770a3c5dc_2023-01-04T15-20-03Z.zip


Overall system info

  • Longhorn v1.3.2
  • K3s v1.24.6+k3s1
  • OS: Ubuntu 20.04.4
    • Kernel: 5.13.0-37-generic
  • CPU’s per worker: 8 cpu’s
  • Memory: 16GB
  • Installation method: Helm
  • Disk type
    • LVM2 stripe 2
    • RAW block devices attached to VM’s that are then “getting” an ext4 filesystem
    • Underlying infra. is VMWare vSAN at a minimum SSD type
  • Volume count: 8

Thank you

Hi @LarsBingBong,

Maybe you should explain your setup (so we have more information)? We have since moved all the nodes to VMs on the same hardware and we are also using v1.3.2 and have not had any issues again.

@wbarnard81 cool - no surprise though. Don’t have swap enabled on Kubernetes nodes.

Just to update. Switched to rook-ceph and I am still experiencing this issue, obviously not the longhorn spiking part.

Also check: Michaelpalacce https://github.com/longhorn/longhorn/issues/3396

I am now checking the health checks, but are also suspecting Prometheus now, since my issue started again when I added ceph metrics to Prometheus.