longhorn: [BUG] Scalability issue of volumes in Longhorn

Describe the bug

According to the scalability benchmark for Longhorn v1.0.1 and the scalability benchmark for Longhorn v1.2.0 , Longhorn v1.2.0 has worse scalability performance than Longhorn v1.0.1.

In Longhorn v1.0.1 the graph looks linear util 900 volumes. However, in v1.2.0 the graph looks linear for the first 410 volumes and shoot up.

Additional context Support bundle and Kubectl top are provided in the comment https://github.com/longhorn/longhorn/issues/2986#issuecomment-929751211

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 18 (11 by maintainers)

Most upvoted comments

Reproducing steps:

Run the scalability test script with Longhorn v1.2.3 using small profile with 10 worker nodes. See https://confluence.suse.com/display/LON/Scalability+issue+in+Longhorn+manager for examples.
Verify that Longhorn slows down significantly when the at near 500 volumes.
Volume operation becomes very slow.

Test plan:

Run e2e tests. This PR touches many operations in Longhorn.
Run the scalability test to verify the improvement. The script is at https://github.com/longhorn/longhorn-tests/tree/master/scalability_test. See https://confluence.suse.com/display/LON/Scalability+issue+in+Longhorn+manager for examples.

PhanLe1010 on Feb 11, 2022

The Longhorn installation part will be improved when the new design for https://github.com/longhorn/longhorn/issues/2582 is introduced. I will write down the design as a LEP once I start to work on it. In brief, before each manager pod becomes running, it needs to acquire the leadership and walk through all upgrade paths. This behavior greatly slows down the installation.

Once both this and the issues mentioned by Joshua are fixed, we can re-visit the scalability.

shuo-wu on Oct 21, 2021

The numbers are looking good. We are able to handle thousands of volumes now. After this issue, the future improvement could be:

Reduce the CPU utilization of manager pods while still allowing Longhorn to handle thousands of volumes.
Longhorn UI is slowdown significantly pass 1k volume point -> Longhorn UI becomes the bottleneck at this scale

PhanLe1010 on Mar 8, 2022

The estimates for the scalability issues are in the air and shouldn’t be considered final (rough guideline only), we already identified some specific scalability issues which we plan on addressing afterwards we can do another evaluation pass to see if there are additional issues. (i.e engine-binary invocation, installation / upgrade process)

joshimoo on Oct 25, 2021

Validation - PASSED

Tested with v1.3.0-rc2 image , longhorn-manager no longer slows down after 500 volumes.

Notes for QAs,

Test workload:

deploy with scale-test.py: io_1, and non_io for reference
initial sts replicas: 70

Node spec:

instance type: m5.2xlarge
node nubmer: 3 control nodes, 10 worker nodes
gp3 volume with size in 1200G, iops = 6000, throughput = 250

kaxing on Jun 6, 2022

Validation Status Update: non_io

Testing with master-head with non_io test from scale_test.py on 3+10 rke2 cluster with following steps:

scale up to 80 replicas per node
scale up again to 90 replicas per node
scale down to 0 replica per node

Resource change can be referenced in the figures below:

Pod-Time chart: master-head-90to0

Total workqueue chart:

kaxing on Mar 8, 2022

Just leaving a note here: I previously discussed the workqueue issue with @PhanLe1010 and asked him to see if he can expose the metrics, that come with the work queue, that should allow us to identify the culprits and cases that lead to long queue times. I further asked him to start his testing with a long resync period since the resync period currently hides issues as well as could lead to issues if the calls are not idempotent or the processing of all the resources takes longer than the resync period eventually.

The binary invocations which we are addressing in #3546 as well as some resource monitoring refactorings we need to do remove the resource modifications and eval out of the controller workers ref #2441 should reduce the time per resource eval loop in the controllers since at the moment their might be some slow operations which need to be done outside of the controller and only synced against the states, we did part of this for the new backup monitoring routines.

joshimoo on Jan 26, 2022

The biggest scalability issue that is currently present is the engine binary invocation which leads to many socket opening/closings in a short time frame. #2778 https://github.com/longhorn/longhorn/issues/2818#issuecomment-887865452

joshimoo on Oct 5, 2021