longhorn: [BUG] Volume operations take long time during automatic upgrading the engines in a big cluster

Describe the bug Volume operations (create new PVC/ attach/ detach) take long time during automatic upgrading the engines in a big cluster(400+ volume). This seems to come out of the scale as we couldn’t reproduce it on a much smaller scale staging clusters. It doesn’t seem to be related to the available resources as no OOM, throttling, iowaiting happens. The issue seems to temporarily resolve after rolling the longhorn-manager DaemonSet in the cluster. It even persists after stopping the engine upgrade. After the upgrade is complete it seems to work fine.

Recollect the event chain:

At about 9pm UTC yesterday we upgraded the chart to 1.1.1 without issues and started the automatic upgrade with concurrency set to 1
Within an hour or so it started exhibiting the issues we had earlier on another cluster — new volumes were stuck due to time outs
The upgrade slowed down too
Throughout the nigh we tried values of 1 and 2 for concurrency
Eventually by today’s noon we had about 80 volumes left unupgraded and it almost stopped moving down while all the attach/detach operations in the system were very slow
Soon we found out that upgrades were completely stuck on one node
After trying several things (restarts of some LH deployments) we stopped the auto upgrade, cordoned it and restarted the instance manager on it
This caused forced re-scheduling of all the pods with attached volumes to different nodes
In several hours (as attach/detach was still taking too much time) all but one pod were successfully running on other nodes and all stuck volumes were upgraded
The node was still showing issues communicating with the instance manager even after this, so we just drained and terminated it
Then we decided to upgrade the remaining 50 or so volumes manually in batches
This was mostly fast (5-6m per volume), except one node once again (this time a different one) — it’s the one that’s currently stuck with 13 volumes

Expected behavior Automatic engine upgrade should finish and should not slow down volume operations

Log Support bundles is sent to email longhorn-support-bundle@rancher.com

Environment:

Longhorn version: 1.1.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher App
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: EKS 1.17
- Number of management node in the cluster: unknown
- Number of worker node in the cluster: 25
Node config
- OS type and version: EKS-optimized Amazon Linux 2
- CPU per node: varies
- Memory per node: varies
- Disk type(e.g. SSD/NVMe): NVMe
- Network bandwidth between the nodes: varies
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
Number of Longhorn volumes in the cluster: 400+

Additional context

Created on behalf of @excieve
More information is on slack: https://cloud-native.slack.com/archives/CNVPEL9U3/p1621918309019000

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 3
Comments: 20 (10 by maintainers)

Most upvoted comments

@excieve When this fix (v1.1.2) is out and you upgrade to v1.1.2, there will be 2 different instance-manager versions running side by side in your cluster (1 pod for the old version and 1 new pod for the new version). The engine/replica processes will continue to live inside the old instance-manager pods. Now, if for some reason the volumes are detached (because you scale down the workload or because you rolling out the workload), Longhorn will clean stop the engine/replica processes in the old pods and start them on the new instance-manager pods when the volumes are reattached. After all volumes’ engine/replica processes move to the new instance-manager pods, Longhorn will delete the old instance-manager pods.

Then, you can start upgrading engine image to v1.1.2 as usual.

PhanLe1010 on Jun 23, 2021

I have tested Joshua’s PR with the following steps:

Create a cluster of 3 worker nodes
Create 90 volumes. Attach 30 volumes to each worker node
Deploy v1.1.1 engine image
Select all 90 volumes and try to upgrade engine at the same time
While the upgrade is happing try to detach, attach some volumes. Create a few new volumes
Verify that the upgrades finish within 30 min
Set Concurrent Automatic Engine Upgrade Per Node Limit to 3
wait for automatic engine upgrade to start
While the upgrade is happing try to detach, attach some volumes. Create a few new volumes
Verify that the upgrades finish within 30 min
Repeat the whole test a few times.

Thanks @joshimoo for fixing the bug!

PhanLe1010 on Jun 22, 2021

@excieve Could you please let us know the result when you upgrade to v1.1.2?

Sure thing, but we’re not clear on the timeline yet since our current mix of v1.1.1 and v1.1.0 is fairly stable at the moment.

excieve on Jul 19, 2021