longhorn: [BUG] High CPU usage on one node.

Describe the bug (🐛 if you encounter this issue)

We have a cluster with 3 worker nodes running 3 environments + Loki + Prometheus + Grafana + Minio, one of the servers have high cpu usage from the Instance manager, it basically consumes over 1 CPU when in the other nodes it consumes < 100m.

We also detected the following log in the instance manager pod

[longhorn-instance-manager] time="2023-08-23T02:22:11Z" level=error msg="Failed to receive next item in process watch" error="rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"

To Reproduce

N/A

Expected behavior

All nodes behave in the same maner and consume a reasonable amount of CPU

Support bundle for troubleshooting

supportbundle_a1cc04ad-a7b1-44c1-a239-b8a956ed43fc_2023-08-23T02-28-23Z.zip

Environment

Longhorn version: v1.5.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K8 1.28
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
Node config
- OS type and version: CentOS 7
- Kernel version: 3.10.0-1160.95.1.el7.x86_64
- CPU per node: 3
- Memory per node: 16g
- Disk type(e.g. SSD/NVMe/HDD): ssd
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Azure VM
Number of Longhorn volumes in the cluster: 18

Additional context

Some additional Info:

kubectl top po instance-manager-5b4f8e9293673b61843ec90d86572317 62m 466Mi instance-manager-cd6e4595e2282572b9b2580deb2b9024 1163m 460Mi instance-manager-de8c80e633236a07311ae611a2e7adc8 87m 866Mi

This is our Grafana monitor showing CPU usage (note that 90% of this CPU usage comes from one node):

Workaround

Before the release v1.5.2, please refer to the steps and temporary longhorn-instance-manager with the fix in https://github.com/longhorn/longhorn/issues/6645#issuecomment-1756807554.

About this issue

Original URL
State: closed
Created 10 months ago
Reactions: 1
Comments: 38 (17 by maintainers)

Most upvoted comments

@derekbit One week passed and the fix seems to work

m3nax on Oct 27, 2023

@m3nax Same issue, but the root cause of the broken keepalive between gRPC server and client is unknown from the support bundle. I’m working on the error handling, and it should be included in the upcoming v1.5.2. Before that, you can apply the workaround described in https://github.com/longhorn/longhorn/issues/6578#issuecomment-1718816096.

derekbit on Sep 15, 2023

The error returned, so any help you can give would be greatly appreciated.

fmq on Aug 25, 2023

@m3nax Thank you for the update. The update makes us more confident for the fix.

derekbit on Oct 27, 2023

[BUG] longhorn-manager process is consuming high memory on nodes #6645

@cianomutech Please refer to the steps in #6645 (comment). Thank you.

@derekbit hello. We installed the modified version of instance manager. We’ll monitor if the problem may rise again.

cianomutech on Oct 19, 2023

@fmq @m3nax I’ve fixed the issue. Can you help us test if the fix work in your env? If yes, I can build a custom image for you.

derekbit on Sep 25, 2023

@derekbit Sent

m3nax on Sep 14, 2023

@m3nax A support bundle is appreciated. I didn’t see any clue from @fmq’s support bundle for now. Thank you.

derekbit on Sep 13, 2023

Reopen the ticket. Let’s keep the issue open and improve the error handling. Thanks @fmq for raising the issue.

derekbit on Aug 25, 2023

So I did as proposed and the error disappeared, I would have loved to understand the cause of the error in order to be able to prevent it.

Thanks for your help

fmq on Aug 25, 2023