longhorn: [BUG] High CPU usage by instance manager

Describe the bug

This is a continuation of this bug report. I’m seeing 100% cpu usage by a tgtd process.

To Reproduce

  1. Install longhorn via helm
  2. Provision volumes with the longhorn storage class
  3. Volumes will be successfully bound and created; however, a pod tried to mount the volume it will fail
  4. The node’s instance manager will then expand to use 100% of the CPU

Expected behavior

Have a low CPU usage from pod instance-manager-e-xxxx (lower than consuming 4 CPU)

Log or Support bundle

Here is a list of all of my longhorn pods and their resource usage (note instance-manager-e-167eff13 ):

NAME                                        CPU(cores)   MEMORY(bytes)
csi-attacher-6454556647-6xdl4               0m           15Mi
csi-attacher-6454556647-z52zh               0m           9Mi
csi-provisioner-869bdc4b79-l6t6w            0m           9Mi
csi-provisioner-869bdc4b79-nwrnm            0m           10Mi
csi-provisioner-869bdc4b79-znn4g            2m           14Mi
csi-resizer-6d8cf5f99f-gsx4p                0m           8Mi
csi-resizer-6d8cf5f99f-qm5h7                0m           15Mi
csi-resizer-6d8cf5f99f-zngh7                0m           12Mi
csi-snapshotter-588457fcdf-6pnbx            2m           11Mi
csi-snapshotter-588457fcdf-kd9mt            0m           13Mi
csi-snapshotter-588457fcdf-nr5zf            0m           9Mi
engine-image-ei-fa2dfbf0-dh46h              5m           9Mi
engine-image-ei-fa2dfbf0-f2c54              4m           12Mi
engine-image-ei-fa2dfbf0-w5db5              1m           6Mi
instance-manager-e-167eff13                 3433m        71Mi
instance-manager-e-50d209a9                 3m           17Mi
instance-manager-e-54542bca                 9m           22Mi
instance-manager-r-26157459                 3m           44Mi
instance-manager-r-b8b0cead                 89m          37Mi
instance-manager-r-e0818bc8                 32m          58Mi
longhorn-csi-plugin-cvm9s                   0m           15Mi
longhorn-csi-plugin-rbcvn                   2m           19Mi
longhorn-csi-plugin-t5cx8                   0m           22Mi
longhorn-driver-deployer-7dddcdd5bb-zxdzl   0m           8Mi
longhorn-manager-6mmxp                      40m          69Mi
longhorn-manager-7lcrs                      8m           87Mi
longhorn-manager-qmdhj                      59m          75Mi
longhorn-ui-7648d6cd69-zqkql                0m           2Mi

When viewing the logs of the instance I’m seeing alot of the following two snippets:

[longhorn-instance-manager] time="2022-02-17T17:05:37Z" level=info msg="wait for process pvc-d410a8a2-c885-4e95-a122-1d84c1ed5427-e-f398b362 to shutdown"
[longhorn-instance-manager] time="2022-02-17T16:42:51Z" level=debug msg="Process Manager: wait for process pvc-21fddd09-3d84-4cfb-8bc4-94300d22a2c5-e-d2e573b4 to shutdown before unregistering process"
[longhorn-instance-manager] time="2022-02-17T16:42:51Z" level=debug msg="Process Manager: wait for process pvc-fde0f0df-9391-430f-af5c-6023139337c0-e-71e498e1 to shutdown before unregistering process"
[longhorn-instance-manager] time="2022-02-17T16:42:51Z" level=debug msg="Process Manager: wait for process pvc-21fddd09-3d84-4cfb-8bc4-94300d22a2c5-e-d2e573b4 to shutdown before unregistering process"
[longhorn-instance-manager] time="2022-02-17T16:42:52Z" level=debug msg="Process Manager: wait for process pvc-fde0f0df-9391-430f-af5c-6023139337c0-e-71e498e1 to shutdown before unregistering process"
[longhorn-instance-manager] time="2022-02-17T16:42:52Z" level=debug msg="Process Manager: wait for process pvc-21fddd09-3d84-4cfb-8bc4-94300d22a2c5-e-d2e573b4 to shutdown before unregistering process"
[longhorn-instance-manager] time="2022-02-17T16:42:52Z" level=debug msg="Process Manager: wait for process pvc-21fddd09-3d84-4cfb-8bc4-94300d22a2c5-e-d2e573b4 to shutdown before unregistering process"

and I’m also seeing alot of the following tgtd errors

tgtd: abort_cmd(1325) found 10000068 6
tgtd: iscsi_tcp_show(552) Transport endpoint is not connected
tgtd: conn_close_admin(253) close 1f4 0
tgtd: tgt_event_modify(271) Cannot find event 515
tgtd: iscsi_event_modify(575) tgt_event_modify failed

The main command taking up all of the resources is the following tgtd process:

tgtd -f

Environment

  • Longhorn version: 1.2.3
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K8s v1.23.1
    • Number of management node in the cluster: 1 (marked as schedulable)
    • Number of worker node in the cluster: 2
  • Node config
    • OS type and version: Ubuntu 20.04
    • CPU per node: 4CPU on management, 2CPU on workers
    • Memory per node: 2gb
    • Disk type(e.g. SSD/NVMe): HDD
    • Network bandwidth between the nodes: all are VMs on the same host
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Proxmox
  • Number of Longhorn volumes in the cluster: 4

Additional Information

This issue can be temporary resolved by restarting the node; however, the issue will always return on one of the nodes.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 4
  • Comments: 26 (9 by maintainers)

Most upvoted comments

By attaching gdb, it looks like one thread in tgtd is stuck calling read_full from liblonghorn. I’m seeing SIGPIPE too which makes sense.

Do you mean the iowait caused the CPU usage high?

No, I recompiled liblonghorn with debugging symbols and believe I figured out exactly what’s going on.

  1. The longhorn engine has a signal handler that is executed when the engine (the longhorn controiller command) is signaled to exit gracefully, i.e. SIGTERM or SIGINT. https://github.com/longhorn/longhorn-engine/blob/051ebde20b02cdeb11ff49588f5bf8e7d527c7eb/app/cmd/controller.go#L105 established the handler and https://github.com/longhorn/longhorn-engine/blob/051ebde20b02cdeb11ff49588f5bf8e7d527c7eb/pkg/controller/control.go#L787 will execute the tgtadm command to remove the volume from tgtd. If the frontend isn’t shutdown gracefully, tgt will continue to try to read and write from the Unix domain socket between tgtd and the longhorn engine.

  2. https://github.com/rancher/liblonghorn/blob/9eaf16c13dccc7f3bdb111d4e21662ae753bdccf/src/longhorn_rpc_protocol.c#L13 alway returns 0 when the socket is closed on the other end. liblonghorn creates a thread that reads from the Unix domain socket continuously: https://github.com/rancher/liblonghorn/blob/9eaf16c13dccc7f3bdb111d4e21662ae753bdccf/src/longhorn_rpc_client.c#L310 It relies on the read syscall either blocking or returning a number > 0. When the socket is closed on the other end, read returns 0 and the loop in read_full is an infinite loop utilizing 100% of the CPU because readed will never be greater than len

  3. tgtd is also using liblonghorn to write when is it in this 100% CPU state. The write_full function in liblonghorn is calling the write system call which is failing by returning -1 and setting errno to EPIPE. tgtd logs this problem with a message like this:

send_msg: fail to write magic version
tgtd: bs_longhorn_request(105) fail to read at 5368643584 for 4096
tgtd: bs_longhorn_request(149) io error 0x5555555dc0f0 28 -22 4096 5368643584, Broken pipe

That message was in one of the instance-manager-e logs and I could recreate it.

We need to handle read returning 0 in read_full in a more appropriate manner. We need to detect the disconnected socket and stop reading and writing to it.

Having this issue also, following.

By attaching gdb, it looks like one thread in tgtd is stuck calling read_full from liblonghorn. I’m seeing SIGPIPE too which makes sense.

@HonakerM Can you look at dmesg and see if any processes on the nodes with tgtd using 100% of the CPU are killed because of an Out of Memory issue? Looking at the support bundle logs it looks like tgtd may be reading and writing to a partially closed socket. The longhorn-engine controller is the other side of the tgtd socket. I was able to induce the 100% CPU usage behavior by kill -9ing the longhorn-engine controller process which is similar to the kernel killing it by running out of memory.

I am using longhorn 1.3.1 and I am still experiencing this issue. It does not look like there is a fix yet, except for restarting the worker node?

How can I help to resolve this issues, as I am experiencing it in production…