envoy: CDS Updates with many clusters often fail

https://github.com/envoyproxy/envoy/blob/2966597391b9c7743dab1186f214229ca95e0243/source/common/upstream/cds_api_impl.cc#L52-L101

This is a performance issue, not a bug per se.

When doing CDS updates with many clusters, Envoy will often get “stuck” evaluating the CDS update. This manifests as EDS failing, and in more extreme cases, the envoy ceases to receive any XDS updates. When this happens, Envoy needs to be restarted to get it updating again.

In our case we’re seeing issues with the current implementation of void CdsApiImpl::onConfigUpdate with a number of clusters in the 3000-7000 range. If envoy could speedily evaluate a CDS update with 10000 clusters this would represent a HUGE improvement in Envoy’s behavior for us. Right now, only around 2500 clusters in a CDS update seems to evaluate in reasonable amount of time.

Because the function void CdsApiImpl::onConfigUpdate pauses EDS while doing CDS evaluation, envoy’s config will drift. With many clusters in CDS, this can mean envoy is hundreds of seconds behind what is current, and results in 503’s.

Some context:

Envoy in my test environment is being run without constraints on 8 core VMs that are running at a max of 30% CPU utilization and max 40% memory (of 32 GB):

        resources:
          limits:
            memory: "32212254720"
          requests:
            cpu: 100m
            memory: 256M

We’re using Project Contour as our Ingress Controller in K8s.
- https://projectcontour.io/
- https://github.com/projectcontour/contour

Contour currently doesn’t use the incremental APIs of Envoy, so when K8s services change in the cluster it sends ALL of the current config again to Envoy, which means a small change like adding or removing a K8s SVC which maps to an envoy cluster results in Envoy having to re-evaluate ALL clusters.

With enough K8s Services behind an Ingress (7000+) envoy can spontaneously cease to receive any new updates indefinitely, and will fail to do EDS because it gets stuck in the CDS evaluation.

Given a high enough number of clusters this would be a Very Hard Problem, but given that we’re in the low thousands, I’m hoping there are some things that could be done to improve performance without resorting to exotic methods.

Please let me know if there’s any information I can provide that could help!

About this issue

Original URL
State: open
Created 4 years ago
Comments: 29 (23 by maintainers)

Commits related to this issue

upstream: avoid copies of all cluster endpoints for every resolve target (#15013) Currently Envoy::Upstream::StrictDnsClusterImpl::ResolveTarget when instantiated for every endpoint also creates a fu... — committed to envoyproxy/envoy by rojkov 3 years ago
upstream: avoid double hashing of protos in CDS init (#15241) Commit Message: upstream: avoid double hashing of protos in CDS init Additional Description: Currently Cluster messages are hashed unco... — committed to envoyproxy/envoy by rojkov 3 years ago

Most upvoted comments

New findings:

when Envoy restarts it triggers this code path upon CDS initialization

    // Remove the previous cluster before the cluster object is destroyed.
    secondary_init_clusters_.remove_if(
        [name_to_remove = cluster.info()->name()](ClusterManagerCluster* cluster_iter) {
          return cluster_iter->cluster().info()->name() == name_to_remove;
        });
    secondary_init_clusters_.push_back(&cm_cluster);

Here secondary_init_clusters_ is std::list. In case Envoy loads 10k clusters this secondary_init_clusters_.remove_if() line takes 4 sec, in case of 30k it takes about 70 sec. Probably hash map would be a better choice here.

Hashes for every Cluster message are calculated twice. The first time is here and the second time is here. Though performance impact is not that huge. The initial hashing of 30k messages takes 400msec. When an update arrives it takes about 90msec for the same 30k messages.

rojkov on Feb 25, 2021

@rojkov this is already supported for SotW (but not delta) xDS, see https://github.com/envoyproxy/envoy/blob/07c4c17be61c77d87d2c108b0775f2e606a7ae12/api/envoy/config/core/v3/config_source.proto#L107.

I’m thinking this is something else we would prefer default true (but need to do a deprecation dance to move to BoolValue).

@adisuissa this might be another potential cause of buffer bloat in the issue you are looking at.

htuch on Mar 5, 2021

@ramaraochavali Alright, I’ll drop the close tag from the PR’s description to keep this issue open for now.

rojkov on Feb 11, 2021