crossplane: API Server (and clients) becomes unresponsive with too many CRDs
What problem are you facing?
As part of the ongoing Terrajet-based providers effort, we have observed that registering 100s of CRDs has some performance impacts on the cluster. Observations from two sets of experiments are described here and here. As discussed in the results of the first experiment, Kubernetes scalability thresholds document currently does not consider #-of-CRDs (per cluster) as a dimension. However, sig-api-machinery
folks suggest a maximum limit of 500 CRDs. And the reason for this suggested limit is not API call latency SLOs but rather, as we also identified in our experiments, due to the latency in the OpenAPI spec publishing. As the results of the second experiment demonstrate, the marginal cost of adding a new CRD increases as more and more CRDs exist in the cluster.
Although not considered as a scalability dimension officially yet, it looks like we need to be careful for the #-of-CRDs we install in a cluster. And with Terrajet-based providers we would like to be able to ship 100s of CRDs per provider package. Currently, for the initial releases of these providers, we are including a small subset (less than 10% of the total count) of all the managed resources we can generate. We would like to be able to ship all supported resources in a provider package.
How could Crossplane help solve your problem?
As part of the v1.Provider
, v1.ProviderRevision
, v1.Configuration
and v1.ConfigurationRevision
specs, we could have a new configuration field that tells the GVKs of package objects to be installed onto the cluster. To behave backwards-compatible, if this new API is not used, all objects defined in the package manifest get installed. If the new field has a non-zero value, then it’s enabled and only selected objects get installed. In order to make the UX around installing packages using the new configuration field easier, we can also add new options for the install provider
and install configuration
commands of the Crossplane kubectl
plugin.
As described in the experiment results, when a large #-of-CRDs are registered in a relatively short period of time, the background task that prepares the OpenAPI spec to be published from the /openapi/v2
endpoint may cause a spike in the API server’s CPU utilization, and this may saturate the CPU resources allocated to the API server. Similar to what controller-runtime client does, we may also consider implementing a throttling mechanism in the package revision controller to prevent this. However, because of the reasons discussed above and in the experiment results (especially the latency introduced in OpenAPI spec publishing), it looks like we will need a complementary mechanism like the one suggested above in addition to a throttling implementation in the package revision reconciler.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 53 (49 by maintainers)
All the patch release PRs are merged. Now it’s guaranteed that the fix will be included in the following releases of Kubernetes:
I think we can close this issue once the kubectl PR https://github.com/kubernetes/kubernetes/pull/106016 is merged as well.
Executive Summary
Sadly I’m feeling convinced that at the moment the only way to reliably avoid the API server and
kubectl
performance issues we’re seeing - apart from waiting for the upstream fixes - is to avoid creating ‘too many’ CRDs. It seems like what constitutes too many varies depending on the resources available to the control plane so there’s not a reliable number we can cap it at, but ~300 is my conservative estimate.Rate Limiting Experiments
GKE continues to be unable to scale to 2,000 CRDs (regardless of how we rate limit) without becoming unresponsive. API discovery begins to suffer from ~20 second long client-side rate limiting with as few as 200 CRDs in the cluster. The most successful strategy I’ve found so far for GKE is batches of 50 CRDs spread 30 seconds apart.
EKS on the other hand will happily accept 2,000 CRDs applied all at once with no rate limiting at all or other immediately discernible performance degradation but will then exhibit client-side rate limiting of up to six minutes before some kubectl commands will complete.
In my experience
kind
does seem to work (I got it up to 4,000 CRDs) but uses a huge amount of resources (2 cores, GBs of memory) which is not a great experience for someone wanting to try out Crossplane for the first time on their laptop or workstation.I’ve noticed that CPU consumption seems to drop off about an hour after a huge number of CRDs are applied; presumably this is how long the API server takes to recompute its OpenAPI schema over and over again. Memory consumption doesn’t seem to drop unless the API server process restarts. The API server has a special clause to load many CRDs more efficiently at startup as compared to those same CRDs being added at runtime.
Impending Upstream Fixes
Both of the key upstream issues we’re facing (OpenAPI processing and kubectl discovery rate limiting) have PRs open (https://github.com/kubernetes/kube-openapi/pull/251, https://github.com/kubernetes/kubernetes/pull/105520). I would expect these fixes to be merged imminently but there are no guarantees. If they’re candidates for patch releases it’s possible fixes will be available within a month per the patch release schedule. I’ve reached out for clarification around whether the folks working on the issues expect them to be backported and available as patch releases, or whether they’ll need to wait until the next minor release.
If the upstream fixes do indeed become available as patch releases I personally would feel comfortable requiring that Crossplane users be on a supported version of Kubernetes at the latest patch release in order to support large providers. Asking users to update to the latest minor version of Kubernetes seems like a taller order and not something many enterprises would be able to do easily.
Options for Reducing Installed CRDs
Personally I still don’t buy that there’s any real value in reducing the number of CRDs (used or not) in the system except to workaround these performance issues, but it seems like said performance issues alone may force our hand.
Of the two reduction avenues I’m aware of (smaller providers vs filtering what CRDs are enabled for a provider) I support the approach @hasheddan proposed above. It sounds like we’d need to work through a few things technically to make it work though, as @muvaf mentioned. For example:
ProviderConfig
per provider (now group scoped) or try to share across providers?I feel the benefits of smaller providers (vs filtering) are:
Of course, neither smaller providers nor filtering providers will actually fix the scalability issues - they’ll just reduce the likelihood that folks will run into them so ultimately we are going to need to continue working with upstream Kubernetes to ensure the API server can meet our needs.
We (@muvaf, @hasheddan and me) attended today’s sig-api-machinery meeting held at 11:00am PST to discuss whether it’s possible to cherry-pick the lazy marshaling PR to active release branches.
So the general consensus is that the lazy marshaling PR is close to be merged after some remaining issues with the PR are addressed. But it looks like the PR will be merged soon with a high probability.
Regarding the cherry-picks to active release branches, the sig-api-machinery folks are positive about those cherry-picks but they also mentioned that after the PR is merged, the associated risks with backporting need to be evaluated. But they also mentioned that given the small size of the PR, and given that it only tweaks when the aggregated OpenAPI spec is computed (it’s now computed lazily when a request arrives), the risks should not be high.
Here are some initial results from two sets of experiments performed on kind clusters running on a GCP
e2-standard-4
VM.The first experiment does not include the lazy marshaling PR under consideration and is included here as a reference. Similar to our previous results reported here and in the comments above, we have the expected high CPU/memory utilization period following the provisioning of the 658 CRDs from
provider-tf-azure
with no client-side throttling. The kind node image used for this experiment isulucinar/node-amd64:no-lazy-marshaling
(built from https://github.com/kubernetes/kubernetes/commit/8fd95902da533e521970edbf10108ff46739d5b9):The second experiment has the lazy marshaling PR incorporated and is run on the same VM. The kind node image built for this experiment is
ulucinar/node-amd64:lazy-marshaling
(built from https://github.com/kubernetes/kube-openapi/pull/251/commits/8c165785553e390aaeee03baff9cff0c0e67e5fa on top of the commit referred above). We can clearly observe improvements in both CPU and memory usage here:The fix I posted in the comment below will make it into kubectl 1.24, so that’ll be a slight improvement for kubectl. Any other client is still affected. Basically anything using client-go. Check out this blog post for a in-depth explanation of the issue: https://jonnylangefeld.com/blog/the-kubernetes-discovery-cache-blessing-and-curse.
Unfortunately a real solution can only be achieved through a server change as well. But if we want to keep building controllers we’ll have to fix this issue.
This is a real blocker for us. We’re on 1.22.4 so applying the ~650 CRDs of provider-jet-azure is quite fast with the lazy marshaling change. On the client-side however, it’s not just kubectl which is quite unusable with the throttling but also other API clients like management dashboards (we’re using Rancher) which get extremely slow.
Can we maybe revisit the idea of reducing installed CRDs?
@hasheddan I think if we see that pooling the creations doesn’t solve the problem we’ll come back to selective installation and consider both options. There are some caveats to both approaches, for example, we need to handle different
Provider
s installing the sameProviderConfig
CRD. And in both cases, I believe we’ll need throttling if user does decide that they actually need 1000 CRDs; either filtered or installed as different providers under oneConfiguration
.Opened an upstream issue to hopefully initiate a discussion here: https://github.com/kubernetes/kubernetes/issues/105932
Hi @apelisse, I’m also investigating these issues and their implications. Would love to share ideas, gain a deeper understanding and contribute to the implementations. Thank you for your support!
A simple
GET
request via raw API call wasn’t the issue before though right? That was always fast. Only if you do akubectl get
, then the discovery cache ran and that is what took long. Here a comparison of the two:In the other post I describe how
kubectl get
does 170 requests to the API server if the doesn’t cache exists and 4 requests if it does exist (time kubectl get --raw "/api/v1/namespaces/default/pods"
is only 1 request).Just left a comment in the upstream PR to signal the direction we would like proceed with: https://github.com/kubernetes/kubernetes/pull/106181#issuecomment-963527818
Summarizing my tests from today:
I tested on an EKS cluster for the first time, and it seems pretty resilient to the issue. You need to get up to around 3,000 CRDs created consecutively before it starts to see performance issues, and it will happily accept 2,000 CRDs all applied at once. I’m guessing their control planes are fairly powerful. They also are definitely running multiple API server replicas - in some cases I saw new replicas coming and going during my tests presumably in response to the increased load.
Unfortunately I also repeated my tests on GKE several times, and could not reproduce the success I had yesterday. Despite managing to get up to ~1,200 CRDs without issue yesterday, today GKE clusters - even regional ones - consistently exhibit various different kinds of errors while attempting to connect to the API server using
kubectl
after around 500 CRDs are created. I’ve tried three times today:In all cases I saw all kinds of crazy errors, from etcd leaders changing to the API server reporting that there was no such kind as
CustomResourceDefinition
, to HTTP 401 Unauthorized despite my credentials working on subsequent requests. In some cases the clusters went into the ‘repairing’ state while I was seeing the errors, in others they did not (as far as I noticed, at least).In each test I used a different cluster, but one created using the same
Composition
as the cluster I tested on yesterday. The only thing I can think of that was different between the cluster I tested on yesterday and the ones I created today was that yesterday’s cluster was running a few more pods (e.g. Crossplane), and was older having been online for 2-3 weeks.I also happened to (accidentally) test creating 4,000 CRDs on a
kind
cluster running on mye2-custom-4-16640
GCP development VM. The CRDs created successfully but discovery began to take upward of 10 minutes - i.e. 10 minute pauses beforekubectl
commands would run.This all means that unfortunately I’m not feeling confident about there being any way to accommodate installing several very large providers simultaneously - e.g. Terrajet generated AWS, Azure, and GCP providers would together be around 2,000 CRDs.
I’m repeating the
batch_size=1
experiment and in parallel conducting a new experiment in which we do not check theEstablished
condition but instead just sleep 1 second before create requests.