argo-workflows: Unexpected number of Get workflowtemplates API calls
hi all,
just wondering one thing about the call from workflow-controller
that gets workflow templates.
I give the data then state the question/propposal:
kubectl get wftmpl -n argo -o name | wc -l
gives 35 alive templates (I do not have cluster workflow templates). Some of those 35 templates are nested as some are DAG calling another templates in several tasks that have also DAGs inside, with a nesting depth with a maximum of 2.
When I get logs from workflow-controller
, I get this:
cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
rg "Get workflowtemplates" | head -n 1
cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
rg "Get workflowtemplates" | tail -n 1
gives
time="2023-05-25T21:56:28.052Z" level=info msg="Get workflowtemplates 200"
time="2023-05-25T22:20:48.421Z" level=info msg="Get workflowtemplates 200"
and
cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
rg "Get workflowtemplates" | wc -l
gives 31229
calls.
So, in about 24 minutes, there have been 31 K calls to the K8S api for getting workflowtemplates, in a deployment that has 35 workflow templates alive. It looks too much for a k8s object being updated in periods much bigger (depending on each use cases), but surely looks like too many calls for 35 templates that are static through those 24 minutes ( in my case).
Statement (in the case my guessings are not bad):
It would be possible to cache those calls from workflow-controller inside it in a coroutine that gets the templates and creates shared variables with them, in such a way the other coroutines go and consult this shared workflow templates information instead punishing K8S API so much?
A flag --workflow-template-cache 0s
for workflow-controller
would be very nice, and by having this, at each deployment it could be decided the balance about how much “limited K8S API calls money” is spent in updating workflow templates.
Here the logs -> logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log
Here the workflow-controller configmap
data:
resourceRateLimit: |
limit: 30
burst: 5
parallelism: "1000"
namespaceParallelism: "1000"
nodeEvents: |
enabled: false
workflowDefaults: |
spec:
podGC:
strategy: OnPodCompletion
ttlStrategy:
secondsAfterCompletion: 21600
secondsAfterSuccess: 21600
secondsAfterFailure: 21600
artifactRepository: |
archiveLogs: true
s3:
bucket: ${S3_ARGO_BUCKET}
endpoint: s3.amazonaws.com
persistence: |
connectionPool:
maxIdleConns: 100
maxOpenConns: 0
connMaxLifetime: 10000s
nodeStatusOffLoad: true
archive: true
archiveTTL: 90d
skipMigration: false
postgresql:
host: argo-server-postgresql-backend.argo
port: 5432
database: postgres
tableName: argo_workflows
# the database secrets must be in the same namespace of the controller
userNameSecret:
name: argo-postgres-config
key: username
passwordSecret:
name: argo-postgres-config
key: password
ssl: false
# sslMode must be one of: disable, require, verify-ca, verify-full
# you can find more information about those ssl options here: https://godoc.org/github.com/lib/pq
sslMode: disable
Here the workflow-controller deployment patch
spec:
revisionHistoryLimit: 3
template:
spec:
nodeSelector:
nodegroup_name: core
containers:
- name: workflow-controller
command:
- workflow-controller
- --qps=30
- --burst=60
- --workflow-ttl-workers=8
- --workflow-workers=50
env:
- name: DEFAULT_REQUEUE_TIME
value: 1m
resources:
requests:
memory: "6100Mi"
cpu: "1700m"
limits:
memory: "6100Mi"
cpu: "1700m"
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 5
- Comments: 23 (13 by maintainers)
Commits related to this issue
- fix: Correct limit in controller List API calls. Fixes #11134 (#11343) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> — committed to argoproj/argo-workflows by terrytangyuan a year ago
- fix: Correct limit in controller List API calls. Fixes #11134 (#11343) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> — committed to argoproj/argo-workflows by terrytangyuan a year ago
please consider this bug as a high severity one.
This phenomenon is probably stealing kubernetes api QPS quota to the workflow controller in such a way the cron worfklows are not launched sometimes . In any system, reliability problems on cron daemon is catastrophic, cron daemon need to be as reliable as the sun raising on the morning
Added https://github.com/argoproj/argo-workflows/issues/11372
Sent a PR for this. https://github.com/argoproj/argo-workflows/pull/11343
Please continue monitoring it and let us know if you see additional issues.
No logging changes. Those calls may not be necessary.
I think so
Sounds like a bug given high amount of API calls. Updated to bug so we can track and eventually diagnose the issue.