longhorn: [BUG] High CPU usage on one node.
Describe the bug (🐛 if you encounter this issue)
We have a cluster with 3 worker nodes running 3 environments + Loki + Prometheus + Grafana + Minio, one of the servers have high cpu usage from the Instance manager, it basically consumes over 1 CPU when in the other nodes it consumes < 100m.
We also detected the following log in the instance manager pod
[longhorn-instance-manager] time="2023-08-23T02:22:11Z" level=error msg="Failed to receive next item in process watch" error="rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"
To Reproduce
N/A
Expected behavior
All nodes behave in the same maner and consume a reasonable amount of CPU
Support bundle for troubleshooting
supportbundle_a1cc04ad-a7b1-44c1-a239-b8a956ed43fc_2023-08-23T02-28-23Z.zip
Environment
- Longhorn version: v1.5.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K8 1.28
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
- Node config
- OS type and version: CentOS 7
- Kernel version: 3.10.0-1160.95.1.el7.x86_64
- CPU per node: 3
- Memory per node: 16g
- Disk type(e.g. SSD/NVMe/HDD): ssd
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Azure VM
- Number of Longhorn volumes in the cluster: 18
Additional context
Some additional Info:
kubectl top po instance-manager-5b4f8e9293673b61843ec90d86572317 62m 466Mi instance-manager-cd6e4595e2282572b9b2580deb2b9024 1163m 460Mi instance-manager-de8c80e633236a07311ae611a2e7adc8 87m 866Mi
This is our Grafana monitor showing CPU usage (note that 90% of this CPU usage comes from one node):
Workaround
Before the release v1.5.2, please refer to the steps and temporary longhorn-instance-manager with the fix in https://github.com/longhorn/longhorn/issues/6645#issuecomment-1756807554.
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 1
- Comments: 38 (17 by maintainers)
@derekbit One week passed and the fix seems to work
@m3nax Same issue, but the root cause of the broken keepalive between gRPC server and client is unknown from the support bundle. I’m working on the error handling, and it should be included in the upcoming v1.5.2. Before that, you can apply the workaround described in https://github.com/longhorn/longhorn/issues/6578#issuecomment-1718816096.
The error returned, so any help you can give would be greatly appreciated.
@m3nax Thank you for the update. The update makes us more confident for the fix.
@derekbit hello. We installed the modified version of instance manager. We’ll monitor if the problem may rise again.
@fmq @m3nax I’ve fixed the issue. Can you help us test if the fix work in your env? If yes, I can build a custom image for you.
@derekbit Sent
@m3nax A support bundle is appreciated. I didn’t see any clue from @fmq’s support bundle for now. Thank you.
Reopen the ticket. Let’s keep the issue open and improve the error handling. Thanks @fmq for raising the issue.
So I did as proposed and the error disappeared, I would have loved to understand the cause of the error in order to be able to prevent it.
Thanks for your help