longhorn: [BUG] Volume attachment take long time, which may be caused by the replica controller queue being flooded by the backing image events

Question there is no problem when there are few resources but very slow for attach detach on more backup and restore pod 40+, some pod includes multi volume, exp: 10+ every pod has one volume base on backing storage class, backing file like os image, size 800M+ or 6G+ volume 250+ backup 40+ restore 40+ Environment:

  • Longhorn version: 1.2.2(upgrade from 1.1.2)
  • Kubernetes version: 1.18+
  • Node config * 4
    • OS type and version centos7.6
    • CPU per node: 32
    • Memory per node: 128G
    • Disk type longhorn default disk: hdd 512G
    • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal

Additional context Tried if modify queue.AddRateLimited --> queue.Add on replicas、engine and volume controller’s enqueue func(longhorn-manager) example: https://github.com/longhorn/longhorn-manager/blob/b661af0c6315d0f910a3400b650b0e9fba2df577/controller/replica_controller.go#L377 the performance will be normal, look like good why use queue.AddRateLimited? Isn’t it just on because of failure?

delay add log to record limit delay(just for longhorn-replica controller) and use go vendor mod build, find resource delay is very long time log position: https://github.com/kubernetes/client-go/blob/e627be7959e730f91a38c18674c29b0256c4bb2e/util/workqueue/delaying_queue.go#L243 time

metrics image

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22 (11 by maintainers)

Most upvoted comments

I just tested the patch by:

  1. Launch a 3-node cluster. The longhorn manager image is shuowu/longhorn-manager:bi-scalability, which is based on the master version as well as including the patch & some logs.
  2. Launch 10 backing images and 5 of the backing images are used by replicas
  3. Launch 35 pods with 265 volumes in total. Each volume contains 3 replicas scheduled to different nodes. And each volume contains one backing image.

Screen Shot 2021-12-02 at 4 36 44 PM Screen Shot 2021-12-02 at 4 36 28 PM

After all volumes become running, attaching a newly created volume with a backing image set takes around 12 seconds.

[longhorn-manager-bxrj4] 61.222.202.238 - - [02/Dec/2021:08:12:39 +0000] "POST /v1/volumes/vol-test?action=attach HTTP/1.1" 200 2785 "https://137.184.3.241/k8s/clusters/c-spnmq/api/v1/namespaces/longhorn-system/services/http:longhorn-frontend:80/proxy/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36"
......
[longhorn-manager-f5hwn] time="2021-12-02T08:12:51Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"vol-test\", UID:\"5107211e-ec43-4ce0-b732-0e2e23c0c315\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"86710\", FieldPath:\"\"}): type: 'Normal' reason: 'Attached' volume vol-test has been attached to shuo-k8s-worker1"

As for the 1-hour enqueuing count, in the env mentioned above, the backing-image-related enqueuing count is around one-tenth of the total count:

time="2021-12-02T07:12:02Z" level=debug msg="Replica controller BI-related enqueuing count 86146, total enqueuing count 731114"
......
time="2021-12-02T08:12:07Z" level=debug msg="Replica controller BI-related enqueuing count 117346, total enqueuing count 953114"

[Updated] When I switch the longhorn manager image to master for this env, creating or deleting a volume will get stuck. This means the patch does fix the issue.

I just checked the backing image enqueuing logic in the replica controller. It does enqueue too many irrelevant replicas for each backing image. Here is the patch. If possible, can you take a quick try and see if it works? Or I can create a PR then request QAs to help test it.

diff --git a/controller/replica_controller.go b/controller/replica_controller.go
index 21eeea91..30a95b00 100644
--- a/controller/replica_controller.go
+++ b/controller/replica_controller.go
@@ -778,14 +778,16 @@ func (rc *ReplicaController) enqueueBackingImageChange(obj interface{}) {
 		}
 	}

+	replicas, err := rc.ds.ListReplicasByNodeRO(rc.controllerID)
+	if err != nil {
+		utilruntime.HandleError(fmt.Errorf("failed to list replicas on node %v for backing image %v: %v", rc.controllerID, backingImage.Name, err))
+		return
+	}
 	for diskUUID := range backingImage.Status.DiskFileStatusMap {
-		replicas, err := rc.ds.ListReplicasByDiskUUID(diskUUID)
-		if err != nil {
-			utilruntime.HandleError(fmt.Errorf("failed to list replicas in disk %v for backing image %v: %v", diskUUID, backingImage.Name, err))
-			continue
-		}
 		for _, r := range replicas {
-			rc.enqueueReplica(r)
+			if r.Spec.DiskID == diskUUID && r.Spec.BackingImage == backingImage.Name {
+				rc.enqueueReplica(r)
+			}
 		}
 	}

@shuo-wu hdd emmmm…

replica count = volume count * 3, count is ok but the result is now, not then image image

each node 3.6T hdd image