longhorn: [BUG] High CPU usage by instance manager
Describe the bug
This is a continuation of this bug report. I’m seeing 100% cpu usage by a tgtd process.
To Reproduce
- Install longhorn via helm
- Provision volumes with the
longhornstorage class - Volumes will be successfully bound and created; however, a pod tried to mount the volume it will fail
- The node’s instance manager will then expand to use 100% of the CPU
Expected behavior
Have a low CPU usage from pod instance-manager-e-xxxx (lower than consuming 4 CPU)
Log or Support bundle
Here is a list of all of my longhorn pods and their resource usage (note instance-manager-e-167eff13 ):
NAME CPU(cores) MEMORY(bytes)
csi-attacher-6454556647-6xdl4 0m 15Mi
csi-attacher-6454556647-z52zh 0m 9Mi
csi-provisioner-869bdc4b79-l6t6w 0m 9Mi
csi-provisioner-869bdc4b79-nwrnm 0m 10Mi
csi-provisioner-869bdc4b79-znn4g 2m 14Mi
csi-resizer-6d8cf5f99f-gsx4p 0m 8Mi
csi-resizer-6d8cf5f99f-qm5h7 0m 15Mi
csi-resizer-6d8cf5f99f-zngh7 0m 12Mi
csi-snapshotter-588457fcdf-6pnbx 2m 11Mi
csi-snapshotter-588457fcdf-kd9mt 0m 13Mi
csi-snapshotter-588457fcdf-nr5zf 0m 9Mi
engine-image-ei-fa2dfbf0-dh46h 5m 9Mi
engine-image-ei-fa2dfbf0-f2c54 4m 12Mi
engine-image-ei-fa2dfbf0-w5db5 1m 6Mi
instance-manager-e-167eff13 3433m 71Mi
instance-manager-e-50d209a9 3m 17Mi
instance-manager-e-54542bca 9m 22Mi
instance-manager-r-26157459 3m 44Mi
instance-manager-r-b8b0cead 89m 37Mi
instance-manager-r-e0818bc8 32m 58Mi
longhorn-csi-plugin-cvm9s 0m 15Mi
longhorn-csi-plugin-rbcvn 2m 19Mi
longhorn-csi-plugin-t5cx8 0m 22Mi
longhorn-driver-deployer-7dddcdd5bb-zxdzl 0m 8Mi
longhorn-manager-6mmxp 40m 69Mi
longhorn-manager-7lcrs 8m 87Mi
longhorn-manager-qmdhj 59m 75Mi
longhorn-ui-7648d6cd69-zqkql 0m 2Mi
When viewing the logs of the instance I’m seeing alot of the following two snippets:
[longhorn-instance-manager] time="2022-02-17T17:05:37Z" level=info msg="wait for process pvc-d410a8a2-c885-4e95-a122-1d84c1ed5427-e-f398b362 to shutdown"
[longhorn-instance-manager] time="2022-02-17T16:42:51Z" level=debug msg="Process Manager: wait for process pvc-21fddd09-3d84-4cfb-8bc4-94300d22a2c5-e-d2e573b4 to shutdown before unregistering process"
[longhorn-instance-manager] time="2022-02-17T16:42:51Z" level=debug msg="Process Manager: wait for process pvc-fde0f0df-9391-430f-af5c-6023139337c0-e-71e498e1 to shutdown before unregistering process"
[longhorn-instance-manager] time="2022-02-17T16:42:51Z" level=debug msg="Process Manager: wait for process pvc-21fddd09-3d84-4cfb-8bc4-94300d22a2c5-e-d2e573b4 to shutdown before unregistering process"
[longhorn-instance-manager] time="2022-02-17T16:42:52Z" level=debug msg="Process Manager: wait for process pvc-fde0f0df-9391-430f-af5c-6023139337c0-e-71e498e1 to shutdown before unregistering process"
[longhorn-instance-manager] time="2022-02-17T16:42:52Z" level=debug msg="Process Manager: wait for process pvc-21fddd09-3d84-4cfb-8bc4-94300d22a2c5-e-d2e573b4 to shutdown before unregistering process"
[longhorn-instance-manager] time="2022-02-17T16:42:52Z" level=debug msg="Process Manager: wait for process pvc-21fddd09-3d84-4cfb-8bc4-94300d22a2c5-e-d2e573b4 to shutdown before unregistering process"
and I’m also seeing alot of the following tgtd errors
tgtd: abort_cmd(1325) found 10000068 6
tgtd: iscsi_tcp_show(552) Transport endpoint is not connected
tgtd: conn_close_admin(253) close 1f4 0
tgtd: tgt_event_modify(271) Cannot find event 515
tgtd: iscsi_event_modify(575) tgt_event_modify failed
The main command taking up all of the resources is the following tgtd process:
tgtd -f
Environment
- Longhorn version: 1.2.3
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K8s v1.23.1
- Number of management node in the cluster: 1 (marked as schedulable)
- Number of worker node in the cluster: 2
- Node config
- OS type and version: Ubuntu 20.04
- CPU per node: 4CPU on management, 2CPU on workers
- Memory per node: 2gb
- Disk type(e.g. SSD/NVMe): HDD
- Network bandwidth between the nodes: all are VMs on the same host
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Proxmox
- Number of Longhorn volumes in the cluster: 4
Additional Information
This issue can be temporary resolved by restarting the node; however, the issue will always return on one of the nodes.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 4
- Comments: 26 (9 by maintainers)
No, I recompiled liblonghorn with debugging symbols and believe I figured out exactly what’s going on.
The longhorn engine has a signal handler that is executed when the engine (the
longhorn controillercommand) is signaled to exit gracefully, i.e. SIGTERM or SIGINT. https://github.com/longhorn/longhorn-engine/blob/051ebde20b02cdeb11ff49588f5bf8e7d527c7eb/app/cmd/controller.go#L105 established the handler and https://github.com/longhorn/longhorn-engine/blob/051ebde20b02cdeb11ff49588f5bf8e7d527c7eb/pkg/controller/control.go#L787 will execute thetgtadmcommand to remove the volume fromtgtd. If the frontend isn’t shutdown gracefully,tgtwill continue to try toreadandwritefrom the Unix domain socket betweentgtdand the longhorn engine.https://github.com/rancher/liblonghorn/blob/9eaf16c13dccc7f3bdb111d4e21662ae753bdccf/src/longhorn_rpc_protocol.c#L13 alway returns 0 when the socket is closed on the other end. liblonghorn creates a thread that reads from the Unix domain socket continuously: https://github.com/rancher/liblonghorn/blob/9eaf16c13dccc7f3bdb111d4e21662ae753bdccf/src/longhorn_rpc_client.c#L310 It relies on the
readsyscall either blocking or returning a number > 0. When the socket is closed on the other end,readreturns 0 and the loop inread_fullis an infinite loop utilizing 100% of the CPU becausereadedwill never be greater thanlentgtdis also using liblonghorn to write when is it in this 100% CPU state. Thewrite_fullfunction in liblonghorn is calling thewritesystem call which is failing by returning -1 and settingerrnoto EPIPE.tgtdlogs this problem with a message like this:That message was in one of the instance-manager-e logs and I could recreate it.
We need to handle
readreturning 0 inread_fullin a more appropriate manner. We need to detect the disconnected socket and stop reading and writing to it.Having this issue also, following.
By attaching gdb, it looks like one thread in tgtd is stuck calling read_full from liblonghorn. I’m seeing SIGPIPE too which makes sense.
@HonakerM Can you look at
dmesgand see if any processes on the nodes withtgtdusing 100% of the CPU are killed because of an Out of Memory issue? Looking at the support bundle logs it looks liketgtdmay be reading and writing to a partially closed socket. The longhorn-engine controller is the other side of the tgtd socket. I was able to induce the 100% CPU usage behavior bykill -9ing the longhorn-engine controller process which is similar to the kernel killing it by running out of memory.I am using longhorn 1.3.1 and I am still experiencing this issue. It does not look like there is a fix yet, except for restarting the worker node?
How can I help to resolve this issues, as I am experiencing it in production…
@HonakerM You can refer to https://longhorn.io/kb/troubleshooting-create-support-bundle-with-curl/.