argo-workflows: Unexpected number of Get workflowtemplates API calls

hi all,

just wondering one thing about the call from workflow-controller that gets workflow templates.

I give the data then state the question/propposal:

kubectl get wftmpl -n argo -o name  | wc -l

gives 35 alive templates (I do not have cluster workflow templates). Some of those 35 templates are nested as some are DAG calling another templates in several tasks that have also DAGs inside, with a nesting depth with a maximum of 2.

When I get logs from workflow-controller, I get this:

cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | head -n 1
cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | tail -n 1

gives

time="2023-05-25T21:56:28.052Z" level=info msg="Get workflowtemplates 200"
time="2023-05-25T22:20:48.421Z" level=info msg="Get workflowtemplates 200"

and

cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | wc -l

gives 31229 calls.

So, in about 24 minutes, there have been 31 K calls to the K8S api for getting workflowtemplates, in a deployment that has 35 workflow templates alive. It looks too much for a k8s object being updated in periods much bigger (depending on each use cases), but surely looks like too many calls for 35 templates that are static through those 24 minutes ( in my case).

Statement (in the case my guessings are not bad):

It would be possible to cache those calls from workflow-controller inside it in a coroutine that gets the templates and creates shared variables with them, in such a way the other coroutines go and consult this shared workflow templates information instead punishing K8S API so much?

A flag --workflow-template-cache 0s for workflow-controller would be very nice, and by having this, at each deployment it could be decided the balance about how much “limited K8S API calls money” is spent in updating workflow templates.

Here the logs -> logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log

Here the workflow-controller configmap

data:
  resourceRateLimit: |
    limit: 30
    burst: 5
  parallelism: "1000"
  namespaceParallelism: "1000"
  nodeEvents: |
    enabled: false
  workflowDefaults: |
    spec:
      podGC:
        strategy: OnPodCompletion
      ttlStrategy:
        secondsAfterCompletion: 21600
        secondsAfterSuccess: 21600
        secondsAfterFailure: 21600
  artifactRepository: |
    archiveLogs: true
    s3:
      bucket: ${S3_ARGO_BUCKET}
      endpoint: s3.amazonaws.com
  persistence: |
    connectionPool:
      maxIdleConns: 100
      maxOpenConns: 0
      connMaxLifetime: 10000s
    nodeStatusOffLoad: true
    archive: true
    archiveTTL: 90d
    skipMigration: false
    postgresql:
      host: argo-server-postgresql-backend.argo
      port: 5432
      database: postgres
      tableName: argo_workflows
      # the database secrets must be in the same namespace of the controller
      userNameSecret:
        name: argo-postgres-config
        key: username
      passwordSecret:
        name: argo-postgres-config
        key: password
      ssl: false
      # sslMode must be one of: disable, require, verify-ca, verify-full
      # you can find more information about those ssl options here: https://godoc.org/github.com/lib/pq
      sslMode: disable

Here the workflow-controller deployment patch

spec:
  revisionHistoryLimit: 3
  template:
    spec:
      nodeSelector:
        nodegroup_name: core
      containers:
        - name: workflow-controller
          command:
            - workflow-controller
            - --qps=30
            - --burst=60
            - --workflow-ttl-workers=8
            - --workflow-workers=50
          env:
            - name: DEFAULT_REQUEUE_TIME
              value: 1m
          resources:
            requests:
              memory: "6100Mi"
              cpu: "1700m"
            limits:
              memory: "6100Mi"
              cpu: "1700m"

About this issue

Original URL
State: closed
Created a year ago
Reactions: 5
Comments: 23 (13 by maintainers)

Commits related to this issue

fix: Correct limit in controller List API calls. Fixes #11134 (#11343) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> — committed to argoproj/argo-workflows by terrytangyuan a year ago
fix: Correct limit in controller List API calls. Fixes #11134 (#11343) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> — committed to argoproj/argo-workflows by terrytangyuan a year ago

Most upvoted comments

please consider this bug as a high severity one.

This phenomenon is probably stealing kubernetes api QPS quota to the workflow controller in such a way the cron worfklows are not launched sometimes . In any system, reliability problems on cron daemon is catastrophic, cron daemon need to be as reliable as the sun raising on the morning

Guillermogsjc on Jul 10, 2023

Added https://github.com/argoproj/argo-workflows/issues/11372

juliev0 on Jul 17, 2023

Sent a PR for this. https://github.com/argoproj/argo-workflows/pull/11343

Please continue monitoring it and let us know if you see additional issues.

terrytangyuan on Jul 12, 2023

No logging changes. Those calls may not be necessary.

terrytangyuan on Jul 12, 2023

I think so

terrytangyuan on Jul 10, 2023

Sounds like a bug given high amount of API calls. Updated to bug so we can track and eventually diagnose the issue.

caelan-io on Jun 21, 2023