argo-cd: Load between controllers (argocd-application-controller) is not evenly distributed

Describe the bug

I have a ArgoCD High Availability setup where I have also scaled the number of replicas in argocd-application-controller as shown in the documentation.

To Reproduce

  • Follow the steps to deploy ArgoCD in HA mode
  • Edit the argocd-application-controller as below
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-application-controller
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: argocd-application-controller
        env:
        - name: ARGOCD_CONTROLLER_REPLICAS
          value: "3"

Expected behavior

I was expecting the controller to distribute the load to all three controllers but only one took up all the load, rest two are sitting idle.

Screenshots

All pods running in HA mode

❯ k get po                        
NAME                                      READY   STATUS    RESTARTS   AGE
argocd-application-controller-0           1/1     Running   0          160m
argocd-application-controller-1           1/1     Running   0          160m
argocd-application-controller-2           1/1     Running   0          161m
argocd-dex-server-7b6f9b7f-qh4kv          1/1     Running   0          3h6m
argocd-redis-ha-haproxy-d6dbf6695-4q5cj   1/1     Running   0          3h4m
argocd-redis-ha-haproxy-d6dbf6695-4sh7k   1/1     Running   0          3h5m
argocd-redis-ha-haproxy-d6dbf6695-hjn2d   1/1     Running   0          3h4m
argocd-redis-ha-server-0                  2/2     Running   0          176m
argocd-redis-ha-server-1                  2/2     Running   0          177m
argocd-redis-ha-server-2                  2/2     Running   0          179m
argocd-repo-server-5f4d4775d4-4mw4j       1/1     Running   0          173m
argocd-repo-server-5f4d4775d4-vhgxk       1/1     Running   0          174m
argocd-server-86896bd76f-gz48t            1/1     Running   0          173m
argocd-server-86896bd76f-k5r9h            1/1     Running   0          174m

Screenshot of the pods resources 2021-04-29_15-46-09

Version

❯ argocd version                                                                                                                        
argocd: v2.0.1+33eaf11.dirty
  BuildDate: 2021-04-17T04:23:35Z
  GitCommit: 33eaf11e3abd8c761c726e815cbb4b6af7dcb030
  GitTreeState: dirty
  GoVersion: go1.16.3
  Compiler: gc
  Platform: darwin/amd64

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 57
  • Comments: 30 (7 by maintainers)

Most upvoted comments

Having an option to shard by something else than cluster would be much appreciated. Because I really dislike having single points of failure I have an ArgoCD stack in each of my clusters and due to this limitation I can only vertically scale the Application Controller, which is far from ideal.

Version ArgoCD 2.6.3 has still the same issue. A single instance of app controller uses all CPU while other instances are not loaded at all.

We have the same use case as well where we usually have 3 to 6k of ArgoCD applications in our staging environment. The fact that the sharding is on a per-cluster basis instead of per-app is not helping much because we deploy everything into the same cluster (in staging).

There’s no current effort to split the load of a single cluster across multiple shards.

This is clearly stated but its sad news.

Having one “hot” cluster very often overloads the single application-controller replica handling it, while the other replicas are idle. Scaling up resources for the single application-controller replica will also beef up the other replicas as each replica has the same resource request.

It would be great to balance the load of a single cluster - other than dedicating a completely separate argocd installation for it / each “hot” cluster.

Same scenario here, we have a very heavy application deployed on a single cluster and adding another application controller replica does not distribute the load evenly.

There are efforts and features which may be able to “cool down” a cluster by cutting out unnecessary work. ignoreResourceUpdates is one such feature: https://argo-cd.readthedocs.io/en/release-2.8/operator-manual/reconcile/

But splitting a single cluster across multiple shards will require deep knowledge of core Argo CD code and a proposal describing how that work could be effectively split. I expect that efforts to minimize unnecessary work on hot clusters will take priority, at least in the short- to medium-term.

Ah yeah, we are running on factories so we have to use ArgoCD in a pull-based fashion, so each cluster has its own ArgoCD instance that deploys to the local cluster. This means we only have one cluster and can only use ArgoCD in a hot-standby fashion rather than distribute the workloads.

Which is a shame because we would like to distribute the workload a bit more, especially because of issues where a single node dies in a way where the pod responds as healthy but is unable to perform its workloads so failover to the standby might not happen as quickly as we’d like.

Thanks for the input though! It’s incredibly valuable for us

I wonder if we could in theory load balance this by creating many cluster configs to the same local cluster.

I think if you have a large number of apps then a random hash-mod sharding distribution within each cluster agent should on average level out a mix of large and small apps between different agent pods.

Statistically the more apps the more the spread should even out due to the natural bell curve distribution, and since this scaling problem is going to be caused by more apps, this should be fine in practice. I guess we’ll see when it’s implemented!