longhorn: [BUG] Volume operations take long time during automatic upgrading the engines in a big cluster
Describe the bug Volume operations (create new PVC/ attach/ detach) take long time during automatic upgrading the engines in a big cluster(400+ volume). This seems to come out of the scale as we couldn’t reproduce it on a much smaller scale staging clusters. It doesn’t seem to be related to the available resources as no OOM, throttling, iowaiting happens. The issue seems to temporarily resolve after rolling the longhorn-manager DaemonSet in the cluster. It even persists after stopping the engine upgrade. After the upgrade is complete it seems to work fine.
Recollect the event chain:
- At about 9pm UTC yesterday we upgraded the chart to 1.1.1 without issues and started the automatic upgrade with concurrency set to 1
- Within an hour or so it started exhibiting the issues we had earlier on another cluster — new volumes were stuck due to time outs
- The upgrade slowed down too
- Throughout the nigh we tried values of 1 and 2 for concurrency
- Eventually by today’s noon we had about 80 volumes left unupgraded and it almost stopped moving down while all the attach/detach operations in the system were very slow
- Soon we found out that upgrades were completely stuck on one node
- After trying several things (restarts of some LH deployments) we stopped the auto upgrade, cordoned it and restarted the instance manager on it
- This caused forced re-scheduling of all the pods with attached volumes to different nodes
- In several hours (as attach/detach was still taking too much time) all but one pod were successfully running on other nodes and all stuck volumes were upgraded
- The node was still showing issues communicating with the instance manager even after this, so we just drained and terminated it
- Then we decided to upgrade the remaining 50 or so volumes manually in batches
- This was mostly fast (5-6m per volume), except one node once again (this time a different one) — it’s the one that’s currently stuck with 13 volumes
Expected behavior Automatic engine upgrade should finish and should not slow down volume operations
Log
Support bundles is sent to email longhorn-support-bundle@rancher.com
Environment:
- Longhorn version: 1.1.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher App
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: EKS 1.17
- Number of management node in the cluster: unknown
- Number of worker node in the cluster: 25
- Node config
- OS type and version: EKS-optimized Amazon Linux 2
- CPU per node: varies
- Memory per node: varies
- Disk type(e.g. SSD/NVMe): NVMe
- Network bandwidth between the nodes: varies
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
- Number of Longhorn volumes in the cluster: 400+
Additional context
- Created on behalf of @excieve
- More information is on slack: https://cloud-native.slack.com/archives/CNVPEL9U3/p1621918309019000
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 3
- Comments: 20 (10 by maintainers)
@excieve When this fix (v1.1.2) is out and you upgrade to v1.1.2, there will be 2 different
instance-managerversions running side by side in your cluster (1 pod for the old version and 1 new pod for the new version). The engine/replica processes will continue to live inside the oldinstance-managerpods. Now, if for some reason the volumes are detached (because you scale down the workload or because you rolling out the workload), Longhorn will clean stop the engine/replica processes in the old pods and start them on the newinstance-managerpods when the volumes are reattached. After all volumes’ engine/replica processes move to the newinstance-managerpods, Longhorn will delete the oldinstance-managerpods.Then, you can start upgrading engine image to v1.1.2 as usual.
I have tested Joshua’s PR with the following steps:
Concurrent Automatic Engine Upgrade Per Node Limitto 3Thanks @joshimoo for fixing the bug!
Sure thing, but we’re not clear on the timeline yet since our current mix of v1.1.1 and v1.1.0 is fairly stable at the moment.