crossplane: API Server (and clients) becomes unresponsive with too many CRDs

What problem are you facing?

As part of the ongoing Terrajet-based providers effort, we have observed that registering 100s of CRDs has some performance impacts on the cluster. Observations from two sets of experiments are described here and here. As discussed in the results of the first experiment, Kubernetes scalability thresholds document currently does not consider #-of-CRDs (per cluster) as a dimension. However, sig-api-machinery folks suggest a maximum limit of 500 CRDs. And the reason for this suggested limit is not API call latency SLOs but rather, as we also identified in our experiments, due to the latency in the OpenAPI spec publishing. As the results of the second experiment demonstrate, the marginal cost of adding a new CRD increases as more and more CRDs exist in the cluster.

Although not considered as a scalability dimension officially yet, it looks like we need to be careful for the #-of-CRDs we install in a cluster. And with Terrajet-based providers we would like to be able to ship 100s of CRDs per provider package. Currently, for the initial releases of these providers, we are including a small subset (less than 10% of the total count) of all the managed resources we can generate. We would like to be able to ship all supported resources in a provider package.

How could Crossplane help solve your problem?

As part of the v1.Provider, v1.ProviderRevision, v1.Configuration and v1.ConfigurationRevision specs, we could have a new configuration field that tells the GVKs of package objects to be installed onto the cluster. To behave backwards-compatible, if this new API is not used, all objects defined in the package manifest get installed. If the new field has a non-zero value, then it’s enabled and only selected objects get installed. In order to make the UX around installing packages using the new configuration field easier, we can also add new options for the install provider and install configuration commands of the Crossplane kubectl plugin.

As described in the experiment results, when a large #-of-CRDs are registered in a relatively short period of time, the background task that prepares the OpenAPI spec to be published from the /openapi/v2 endpoint may cause a spike in the API server’s CPU utilization, and this may saturate the CPU resources allocated to the API server. Similar to what controller-runtime client does, we may also consider implementing a throttling mechanism in the package revision controller to prevent this. However, because of the reasons discussed above and in the experiment results (especially the latency introduced in OpenAPI spec publishing), it looks like we will need a complementary mechanism like the one suggested above in addition to a throttling implementation in the package revision reconciler.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 53 (49 by maintainers)

Most upvoted comments

All the patch release PRs are merged. Now it’s guaranteed that the fix will be included in the following releases of Kubernetes:

  • 1.23 (December 7th)
  • 1.22.4 (November 17th)
  • 1.21.7 (November 17th)
  • 1.20.13 (November 17th)

I think we can close this issue once the kubectl PR https://github.com/kubernetes/kubernetes/pull/106016 is merged as well.

Executive Summary

  • Rate limiting CRD installs doesn’t reliably alleviate the problems we’re seeing.
  • Upstream fixes are in progress for the problems we’re seeing and may be available as patch releases in ~1 month.
  • I support breaking Terrajet into smaller providers rather than adding filtering support.

Sadly I’m feeling convinced that at the moment the only way to reliably avoid the API server and kubectl performance issues we’re seeing - apart from waiting for the upstream fixes - is to avoid creating ‘too many’ CRDs. It seems like what constitutes too many varies depending on the resources available to the control plane so there’s not a reliable number we can cap it at, but ~300 is my conservative estimate.

Rate Limiting Experiments

GKE continues to be unable to scale to 2,000 CRDs (regardless of how we rate limit) without becoming unresponsive. API discovery begins to suffer from ~20 second long client-side rate limiting with as few as 200 CRDs in the cluster. The most successful strategy I’ve found so far for GKE is batches of 50 CRDs spread 30 seconds apart.

EKS on the other hand will happily accept 2,000 CRDs applied all at once with no rate limiting at all or other immediately discernible performance degradation but will then exhibit client-side rate limiting of up to six minutes before some kubectl commands will complete.

In my experience kind does seem to work (I got it up to 4,000 CRDs) but uses a huge amount of resources (2 cores, GBs of memory) which is not a great experience for someone wanting to try out Crossplane for the first time on their laptop or workstation.

I’ve noticed that CPU consumption seems to drop off about an hour after a huge number of CRDs are applied; presumably this is how long the API server takes to recompute its OpenAPI schema over and over again. Memory consumption doesn’t seem to drop unless the API server process restarts. The API server has a special clause to load many CRDs more efficiently at startup as compared to those same CRDs being added at runtime.

Impending Upstream Fixes

Both of the key upstream issues we’re facing (OpenAPI processing and kubectl discovery rate limiting) have PRs open (https://github.com/kubernetes/kube-openapi/pull/251, https://github.com/kubernetes/kubernetes/pull/105520). I would expect these fixes to be merged imminently but there are no guarantees. If they’re candidates for patch releases it’s possible fixes will be available within a month per the patch release schedule. I’ve reached out for clarification around whether the folks working on the issues expect them to be backported and available as patch releases, or whether they’ll need to wait until the next minor release.

If the upstream fixes do indeed become available as patch releases I personally would feel comfortable requiring that Crossplane users be on a supported version of Kubernetes at the latest patch release in order to support large providers. Asking users to update to the latest minor version of Kubernetes seems like a taller order and not something many enterprises would be able to do easily.

Options for Reducing Installed CRDs

Personally I still don’t buy that there’s any real value in reducing the number of CRDs (used or not) in the system except to workaround these performance issues, but it seems like said performance issues alone may force our hand.

have we considered just breaking these large providers into smaller, group-based providers? It would require no code changes, just some different flags in the package build process to only include some CRDs and maybe set some flag on the entrypoint of the controller image.

Of the two reduction avenues I’m aware of (smaller providers vs filtering what CRDs are enabled for a provider) I support the approach @hasheddan proposed above. It sounds like we’d need to work through a few things technically to make it work though, as @muvaf mentioned. For example:

  • Do we keep one ProviderConfig per provider (now group scoped) or try to share across providers?
  • What do we do with cross resource references, which would now mostly be to other providers (i.e. API groups)?

I feel the benefits of smaller providers (vs filtering) are:

  • It requires no changes to upstream Crossplane, decoupling it from the Crossplane review and release cycle.
  • Given the goal is to reduce the ratio of installed to actively used CRDs it seems (IMO) much more intuitive to “install the providers for the functionality you need”. The alternative feels more to me like requiring users to carefully and actively avoid installing too many CRDs.
  • It’s not something we need to commit to; for example we could return to larger providers (if so desired) or simply not worry about providers growing larger once the upstream issues were fixed.

Of course, neither smaller providers nor filtering providers will actually fix the scalability issues - they’ll just reduce the likelihood that folks will run into them so ultimately we are going to need to continue working with upstream Kubernetes to ensure the API server can meet our needs.

We (@muvaf, @hasheddan and me) attended today’s sig-api-machinery meeting held at 11:00am PST to discuss whether it’s possible to cherry-pick the lazy marshaling PR to active release branches.

So the general consensus is that the lazy marshaling PR is close to be merged after some remaining issues with the PR are addressed. But it looks like the PR will be merged soon with a high probability.

Regarding the cherry-picks to active release branches, the sig-api-machinery folks are positive about those cherry-picks but they also mentioned that after the PR is merged, the associated risks with backporting need to be evaluated. But they also mentioned that given the small size of the PR, and given that it only tweaks when the aggregated OpenAPI spec is computed (it’s now computed lazily when a request arrives), the risks should not be high.

Here are some initial results from two sets of experiments performed on kind clusters running on a GCP e2-standard-4 VM.

The first experiment does not include the lazy marshaling PR under consideration and is included here as a reference. Similar to our previous results reported here and in the comments above, we have the expected high CPU/memory utilization period following the provisioning of the 658 CRDs from provider-tf-azure with no client-side throttling. The kind node image used for this experiment is ulucinar/node-amd64:no-lazy-marshaling (built from https://github.com/kubernetes/kubernetes/commit/8fd95902da533e521970edbf10108ff46739d5b9):

no-lazy-marshal-metrics

The second experiment has the lazy marshaling PR incorporated and is run on the same VM. The kind node image built for this experiment is ulucinar/node-amd64:lazy-marshaling (built from https://github.com/kubernetes/kube-openapi/pull/251/commits/8c165785553e390aaeee03baff9cff0c0e67e5fa on top of the commit referred above). We can clearly observe improvements in both CPU and memory usage here:

lazy-marshal-metrics

The fix I posted in the comment below will make it into kubectl 1.24, so that’ll be a slight improvement for kubectl. Any other client is still affected. Basically anything using client-go. Check out this blog post for a in-depth explanation of the issue: https://jonnylangefeld.com/blog/the-kubernetes-discovery-cache-blessing-and-curse.

Unfortunately a real solution can only be achieved through a server change as well. But if we want to keep building controllers we’ll have to fix this issue.

This is a real blocker for us. We’re on 1.22.4 so applying the ~650 CRDs of provider-jet-azure is quite fast with the lazy marshaling change. On the client-side however, it’s not just kubectl which is quite unusable with the throttling but also other API clients like management dashboards (we’re using Rancher) which get extremely slow.

Can we maybe revisit the idea of reducing installed CRDs?

Also wanted to ask more generally, have we considered just breaking these large providers into smaller, group-based providers?

@hasheddan I think if we see that pooling the creations doesn’t solve the problem we’ll come back to selective installation and consider both options. There are some caveats to both approaches, for example, we need to handle different Providers installing the same ProviderConfig CRD. And in both cases, I believe we’ll need throttling if user does decide that they actually need 1000 CRDs; either filtered or installed as different providers under one Configuration.

Opened an upstream issue to hopefully initiate a discussion here: https://github.com/kubernetes/kubernetes/issues/105932

Hi @apelisse, I’m also investigating these issues and their implications. Would love to share ideas, gain a deeper understanding and contribute to the implementations. Thank you for your support!

A simple GET request via raw API call wasn’t the issue before though right? That was always fast. Only if you do a kubectl get, then the discovery cache ran and that is what took long. Here a comparison of the two:

time kubectl get pods -n default
kubectl get pods -n default  0.50s user 1.19s system 11% cpu 14.174 total

time kubectl get --raw  "/api/v1/namespaces/default/pods"
kubectl get --raw "/api/v1/namespaces/default/pods"  0.06s user 0.03s system 23% cpu 0.398 total

In the other post I describe how kubectl get does 170 requests to the API server if the doesn’t cache exists and 4 requests if it does exist (time kubectl get --raw "/api/v1/namespaces/default/pods" is only 1 request).

Just left a comment in the upstream PR to signal the direction we would like proceed with: https://github.com/kubernetes/kubernetes/pull/106181#issuecomment-963527818

Summarizing my tests from today:

I tested on an EKS cluster for the first time, and it seems pretty resilient to the issue. You need to get up to around 3,000 CRDs created consecutively before it starts to see performance issues, and it will happily accept 2,000 CRDs all applied at once. I’m guessing their control planes are fairly powerful. They also are definitely running multiple API server replicas - in some cases I saw new replicas coming and going during my tests presumably in response to the increased load.

Unfortunately I also repeated my tests on GKE several times, and could not reproduce the success I had yesterday. Despite managing to get up to ~1,200 CRDs without issue yesterday, today GKE clusters - even regional ones - consistently exhibit various different kinds of errors while attempting to connect to the API server using kubectl after around 500 CRDs are created. I’ve tried three times today:

  • Once I attempted to create 2,000 CRDs without any rate limiting.
  • Twice I attempted to create 2,000 CRDs in batches of 100 with 60 second pauses between batches.

In all cases I saw all kinds of crazy errors, from etcd leaders changing to the API server reporting that there was no such kind as CustomResourceDefinition, to HTTP 401 Unauthorized despite my credentials working on subsequent requests. In some cases the clusters went into the ‘repairing’ state while I was seeing the errors, in others they did not (as far as I noticed, at least).

In each test I used a different cluster, but one created using the same Composition as the cluster I tested on yesterday. The only thing I can think of that was different between the cluster I tested on yesterday and the ones I created today was that yesterday’s cluster was running a few more pods (e.g. Crossplane), and was older having been online for 2-3 weeks.

I also happened to (accidentally) test creating 4,000 CRDs on a kind cluster running on my e2-custom-4-16640 GCP development VM. The CRDs created successfully but discovery began to take upward of 10 minutes - i.e. 10 minute pauses before kubectl commands would run.

This all means that unfortunately I’m not feeling confident about there being any way to accommodate installing several very large providers simultaneously - e.g. Terrajet generated AWS, Azure, and GCP providers would together be around 2,000 CRDs.

I’m repeating the batch_size=1 experiment and in parallel conducting a new experiment in which we do not check the Established condition but instead just sleep 1 second before create requests.