longhorn: [BUG] Volume operations take long time during automatic upgrading the engines in a big cluster

Describe the bug Volume operations (create new PVC/ attach/ detach) take long time during automatic upgrading the engines in a big cluster(400+ volume). This seems to come out of the scale as we couldn’t reproduce it on a much smaller scale staging clusters. It doesn’t seem to be related to the available resources as no OOM, throttling, iowaiting happens. The issue seems to temporarily resolve after rolling the longhorn-manager DaemonSet in the cluster. It even persists after stopping the engine upgrade. After the upgrade is complete it seems to work fine.

Recollect the event chain:

  • At about 9pm UTC yesterday we upgraded the chart to 1.1.1 without issues and started the automatic upgrade with concurrency set to 1
  • Within an hour or so it started exhibiting the issues we had earlier on another cluster — new volumes were stuck due to time outs
  • The upgrade slowed down too
  • Throughout the nigh we tried values of 1 and 2 for concurrency
  • Eventually by today’s noon we had about 80 volumes left unupgraded and it almost stopped moving down while all the attach/detach operations in the system were very slow
  • Soon we found out that upgrades were completely stuck on one node
  • After trying several things (restarts of some LH deployments) we stopped the auto upgrade, cordoned it and restarted the instance manager on it
  • This caused forced re-scheduling of all the pods with attached volumes to different nodes
  • In several hours (as attach/detach was still taking too much time) all but one pod were successfully running on other nodes and all stuck volumes were upgraded
  • The node was still showing issues communicating with the instance manager even after this, so we just drained and terminated it
  • Then we decided to upgrade the remaining 50 or so volumes manually in batches
  • This was mostly fast (5-6m per volume), except one node once again (this time a different one) — it’s the one that’s currently stuck with 13 volumes

Expected behavior Automatic engine upgrade should finish and should not slow down volume operations

Log Support bundles is sent to email longhorn-support-bundle@rancher.com

Environment:

  • Longhorn version: 1.1.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher App
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: EKS 1.17
    • Number of management node in the cluster: unknown
    • Number of worker node in the cluster: 25
  • Node config
    • OS type and version: EKS-optimized Amazon Linux 2
    • CPU per node: varies
    • Memory per node: varies
    • Disk type(e.g. SSD/NVMe): NVMe
    • Network bandwidth between the nodes: varies
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster: 400+

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 3
  • Comments: 20 (10 by maintainers)

Most upvoted comments

@excieve When this fix (v1.1.2) is out and you upgrade to v1.1.2, there will be 2 different instance-manager versions running side by side in your cluster (1 pod for the old version and 1 new pod for the new version). The engine/replica processes will continue to live inside the old instance-manager pods. Now, if for some reason the volumes are detached (because you scale down the workload or because you rolling out the workload), Longhorn will clean stop the engine/replica processes in the old pods and start them on the new instance-manager pods when the volumes are reattached. After all volumes’ engine/replica processes move to the new instance-manager pods, Longhorn will delete the old instance-manager pods.

Then, you can start upgrading engine image to v1.1.2 as usual.

I have tested Joshua’s PR with the following steps:

  1. Create a cluster of 3 worker nodes
  2. Create 90 volumes. Attach 30 volumes to each worker node
  3. Deploy v1.1.1 engine image
  4. Select all 90 volumes and try to upgrade engine at the same time
  5. While the upgrade is happing try to detach, attach some volumes. Create a few new volumes
  6. Verify that the upgrades finish within 30 min
  7. Set Concurrent Automatic Engine Upgrade Per Node Limit to 3
  8. wait for automatic engine upgrade to start
  9. While the upgrade is happing try to detach, attach some volumes. Create a few new volumes
  10. Verify that the upgrades finish within 30 min
  11. Repeat the whole test a few times.

Thanks @joshimoo for fixing the bug!

@excieve Could you please let us know the result when you upgrade to v1.1.2?

Sure thing, but we’re not clear on the timeline yet since our current mix of v1.1.1 and v1.1.0 is fairly stable at the moment.