argo-cd: Load between controllers (argocd-application-controller) is not evenly distributed
Describe the bug
I have a ArgoCD High Availability setup where I have also scaled the number of replicas in argocd-application-controller
as shown in the documentation.
To Reproduce
- Follow the steps to deploy ArgoCD in HA mode
- Edit the
argocd-application-controller
as below
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: argocd-application-controller
spec:
replicas: 3
template:
spec:
containers:
- name: argocd-application-controller
env:
- name: ARGOCD_CONTROLLER_REPLICAS
value: "3"
Expected behavior
I was expecting the controller to distribute the load to all three controllers but only one took up all the load, rest two are sitting idle.
Screenshots
All pods running in HA mode
❯ k get po
NAME READY STATUS RESTARTS AGE
argocd-application-controller-0 1/1 Running 0 160m
argocd-application-controller-1 1/1 Running 0 160m
argocd-application-controller-2 1/1 Running 0 161m
argocd-dex-server-7b6f9b7f-qh4kv 1/1 Running 0 3h6m
argocd-redis-ha-haproxy-d6dbf6695-4q5cj 1/1 Running 0 3h4m
argocd-redis-ha-haproxy-d6dbf6695-4sh7k 1/1 Running 0 3h5m
argocd-redis-ha-haproxy-d6dbf6695-hjn2d 1/1 Running 0 3h4m
argocd-redis-ha-server-0 2/2 Running 0 176m
argocd-redis-ha-server-1 2/2 Running 0 177m
argocd-redis-ha-server-2 2/2 Running 0 179m
argocd-repo-server-5f4d4775d4-4mw4j 1/1 Running 0 173m
argocd-repo-server-5f4d4775d4-vhgxk 1/1 Running 0 174m
argocd-server-86896bd76f-gz48t 1/1 Running 0 173m
argocd-server-86896bd76f-k5r9h 1/1 Running 0 174m
Screenshot of the pods resources
Version
❯ argocd version
argocd: v2.0.1+33eaf11.dirty
BuildDate: 2021-04-17T04:23:35Z
GitCommit: 33eaf11e3abd8c761c726e815cbb4b6af7dcb030
GitTreeState: dirty
GoVersion: go1.16.3
Compiler: gc
Platform: darwin/amd64
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 57
- Comments: 30 (7 by maintainers)
Having an option to shard by something else than cluster would be much appreciated. Because I really dislike having single points of failure I have an ArgoCD stack in each of my clusters and due to this limitation I can only vertically scale the Application Controller, which is far from ideal.
Version ArgoCD 2.6.3 has still the same issue. A single instance of app controller uses all CPU while other instances are not loaded at all.
We have the same use case as well where we usually have 3 to 6k of ArgoCD applications in our staging environment. The fact that the sharding is on a per-cluster basis instead of per-app is not helping much because we deploy everything into the same cluster (in staging).
This is clearly stated but its sad news.
Having one “hot” cluster very often overloads the single application-controller replica handling it, while the other replicas are idle. Scaling up resources for the single application-controller replica will also beef up the other replicas as each replica has the same resource request.
It would be great to balance the load of a single cluster - other than dedicating a completely separate argocd installation for it / each “hot” cluster.
Same scenario here, we have a very heavy application deployed on a single cluster and adding another application controller replica does not distribute the load evenly.
There are efforts and features which may be able to “cool down” a cluster by cutting out unnecessary work.
ignoreResourceUpdates
is one such feature: https://argo-cd.readthedocs.io/en/release-2.8/operator-manual/reconcile/But splitting a single cluster across multiple shards will require deep knowledge of core Argo CD code and a proposal describing how that work could be effectively split. I expect that efforts to minimize unnecessary work on hot clusters will take priority, at least in the short- to medium-term.
Ah yeah, we are running on factories so we have to use ArgoCD in a pull-based fashion, so each cluster has its own ArgoCD instance that deploys to the local cluster. This means we only have one cluster and can only use ArgoCD in a hot-standby fashion rather than distribute the workloads.
Which is a shame because we would like to distribute the workload a bit more, especially because of issues where a single node dies in a way where the pod responds as healthy but is unable to perform its workloads so failover to the standby might not happen as quickly as we’d like.
Thanks for the input though! It’s incredibly valuable for us
I wonder if we could in theory load balance this by creating many cluster configs to the same local cluster.
I think if you have a large number of apps then a random hash-mod sharding distribution within each cluster agent should on average level out a mix of large and small apps between different agent pods.
Statistically the more apps the more the spread should even out due to the natural bell curve distribution, and since this scaling problem is going to be caused by more apps, this should be fine in practice. I guess we’ll see when it’s implemented!