cluster-api: Slow cluster creation when working with hundreds of clusters

This is not really a bug, more like a performance issue. We hope to scale CAPI to thousands of workload clusters managed by a single management cluster. Before doing this with real hardware we are trying to check that the controllers can handle it. For a single workload cluster with 1000 Machines, this was no issue at all, but when trying to create hundreds of single node clusters things get very slow.

What steps did you take and what happened:

The experiment setup, including all scripts can be found here. In short, this is how it works:

  • Management cluster: A normal KinD cluster with 3 nodes
  • CAPI + CAPM3 are installed using clusterctl as normal
  • BareMetalOperator is deployed using static manifests, configured to run in test-mode.
  • The workload cluster’s API are faked by a k8s API server and etcd pod running in the management cluster (one kube-apiserver pod for all the workload clusters)
  • For each cluster
    • The cluster, KCP, BMH and relevant templates are created
    • Pre-generated CAs for etcd and k8s are added (this is for faking the workload cluster API server)
    • (optional) Pre-generated etcd client certificate is also added. This is for running in external etcd mode, which helps speed things up a bit.
    • The workload cluster (fake) node is added to the workload cluster API with correct provider ID
    • The workload cluster static pods (fake) are added to the workload cluster API

The simulation is not perfect, and perhaps this is is impacting the performance. I have not been able to confirm or rule this out. What I have found is this:

  • Since all workload clusters share one API, they will be able to see each others nodes. This makes the control planes a bit “confused” since they see nodes that does not have correlated Machines.
  • The Kubeadm control plane provider is trying to reach the (fake) static pods to check certificate expiration. To try to mitigate this, we attempted to set the expiration annotation on the KubeadmConfig, but unfortunately this caused some KCPs to start rollouts. It is unclear what is causing this.

Performance:

  • Scale to 100 clusters in ~15 minutes.
  • Scale to 300 clusters in ~135 minutes. Adding a single cluster at this point takes more than 8 minutes. In the experiment we create 10 clusters in parallel.

The bottleneck seems to be the Kubeadm control plane provider. There is a long pause after the KCP is created before the Machines appear. To mitigate this, I tried sharding by running one kubeadm control plane controller for each namespace, and grouping the workload clusters into these namespaces (10 namespaces with 10 clusters each). They basically ate the CPU and everything became slow. Maybe it is still the way to go, just with more CPU or fewer shards?

What did you expect to happen:

I was hoping to be able to reach 1000 workload clusters in a “reasonable” time and that creating new clusters would not take several minutes.

Anything else you would like to add:

I just want to highlight again that that the simulation is not perfect. If you have ideas for how to improve it, or ways to check if it is impacting the performance, I would be very happy to hear about it.

Environment:

  • Cluster-api version: v1.3.2
  • minikube/kind version: v0.17.0 (kind)
  • Kubernetes version: (use kubectl version): v1.25.3 (kind cluster) v1.26.1 (kubectl)
  • OS (e.g. from /etc/os-release): Ubuntu 22.04

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 22 (22 by maintainers)

Most upvoted comments

Thanks! I’ll close this then. It will be easier to track specific issues that way. This has been a great discussion and exploration, thanks!

@lentzi90 those are great insights!

And we definitely need to join forces given that this work is relevant for the entire community cc @sbueringer @killianmuldoon

If you have any ideas for how to lower the memory usage of these API servers, or maybe other ways to fake them, I would love to hear about it!

I don’t think we can fit as much as we want in a single machine, and continuing down this path also introduces some other issues like the noise from the fake workload clusters which can affect the management cluster and test results.

What we are doing in the kubemark provider is introducing the idea of a “backing clusters”, which are one or more external clusters which are providing the necessary computing power to run all the fake workloads clusters (also K8s scale tests are using a simular approach).

By moving fake clusters to external clusters you can potentially scale indefinitely, and you can always fallback on testing everything on one machine when you are working with less power angry test targets or for smoke tests.

Then to conserve resources I set up a multi-tenant etcd to back all the API servers. (Learning a lot here! 😅 )

That’s a great idea, we should definitely embed this on https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/issues/63 as soon as I get to implementing the control plane part, if you don’t get there before me 😜

/triage accepted Great work of research! The next step is to translate this into actionable improvements on KCP or other controllers. Metrics and the work on logs will help in doing so, but this is a awesome start

For me it is ok to close it but if you want to keep it for more discussions that is also fine 🙂 I know there is work on-going with adding scalability e2e tests, maybe that can be tracked here? I marked the PR as fixing this since it solved the main blocker for my use case, but there is definitely more to do

This is extremely useful/helpful to me, thank you for the hard work!

cc @richardcase who might be interested in the discussion as well

Thanks @fabriziopandini , very timely. This is gold @lentzi90 , great work and super helpful.

Might be worth looking if the work queue length metrics shows supicious behavior (e.g. goes up non-linear at some point)

Those are again great insights, @lentzi90 thanks for sharing. I agree that kcp cpu and memory consumption is something to be investigated. I’m not sure about the correlation between the number of clusters and qps, because my assumption is that a cluster will “stop” being reconciled as soon as it is provisioned, so it should not clog the reconcile queue (with the exception of resync event every 10 minutes). But this is when the work on metrics becomes relevant for finding bottlenecks and also to explaining why provisioning time is degrading.

If this is ok for you It would be great to set up some time to discuss the possible next step of this work, and possibly how to upstream it. We can also discuss this at KubeCon if you are planning to make it, otherwise, I will be happy to set up something using the CAPI project zoom, so we can also record and share with the others members of the community.

Thanks for the comment @sbueringer !

My first guess would be that increasing --kubeadmcontrolplane-concurrency should improve the situation.

This is a good point. I should probably have tried that before sharding… I will try it now and see if it helps. 🙂

I assume with appear you mean that the Machine objects are not even created at this point?

Exactly! Here I managed to capture what it looks like. When the KubeadmControlPlane is 63 seconds old, there is no Machine and the SA and proxy secrets have just been created. After this the Machine (and Metal3Machine) appears.

$ kubectl -n test-50 get kcp,machine,m3m,secret,m3d
NAME                                                        CLUSTER   INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/test-50   test-50                                                                                   63s   v1.25.3

NAME                                   TYPE                      DATA   AGE
secret/test-50-apiserver-etcd-client   kubernetes.io/tls         2      65s
secret/test-50-ca                      kubernetes.io/tls         2      66s
secret/test-50-etcd                    kubernetes.io/tls         2      66s
secret/test-50-proxy                   cluster.x-k8s.io/secret   2      1s
secret/test-50-sa                      cluster.x-k8s.io/secret   2      1s
secret/worker-1-bmc-secret             Opaque                    2      66s
$ kubectl -n test-50 get kcp,machine,m3m,secret,m3d
NAME                                                        CLUSTER   INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/test-50   test-50                                                                                   67s   v1.25.3

NAME                                     CLUSTER   NODENAME   PROVIDERID   PHASE   AGE   VERSION
machine.cluster.x-k8s.io/test-50-fg498   test-50                                   0s    v1.25.3

NAME                                                                       AGE   PROVIDERID   READY   CLUSTER   PHASE
metal3machine.infrastructure.cluster.x-k8s.io/test-50-controlplane-x6rnj   0s                         test-50

NAME                                   TYPE                      DATA   AGE
secret/test-50-apiserver-etcd-client   kubernetes.io/tls         2      69s
secret/test-50-ca                      kubernetes.io/tls         2      70s
secret/test-50-etcd                    kubernetes.io/tls         2      70s
secret/test-50-kubeconfig              cluster.x-k8s.io/secret   1      3s
secret/test-50-proxy                   cluster.x-k8s.io/secret   2      5s
secret/test-50-sa                      cluster.x-k8s.io/secret   2      5s
secret/worker-1-bmc-secret             Opaque                    2      70s

First of all, great work. Nice to see that folks are starting to test CAPI at scale! 😃

There is a long pause after the KCP is created before the Machines appear.

I assume with appear you mean that the Machine objects are not even created at this point?

It looks to me like you are running Cluster API in the default configuration. My first guess would be that increasing --kubeadmcontrolplane-concurrency should improve the situation. The default is 10, which means KCP can only reconcile 10 KCP objects at the same time. All others have to wait.

The next question would be what the 10 workers are actually doing. It might be that they are blocked in some way.

Thank you @killianmuldoon ! I’m following https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/issues/63 with great interest! Should have thought to include it in the issue directly… Let me know if you have time to try it, and if there are any issues!

Most of these tests have been on a cloud VM with 32 GB memory and 8 core CPU. It didn’t look like resource contention (except when I did the sharding, that maxed out the CPU).

This is amazing work! There’s been a lot of questions around scaling recently so this is really useful, and definitely the best attempt so far at reproducible scale tests. I’m excited to see if I can get this running when I get time.

It would be interesting to be able to profile the CAPI controller code while this is running. BTW - what size of a machine was this running on, and was it resource contention or just slowness in the controller that caused the scaling issue?

Here’s a couple of related issues for reference. https://github.com/kubernetes-sigs/cluster-api/issues/7308 https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/issues/63