argo-workflows: Unexpected number of Get workflowtemplates API calls

hi all,

just wondering one thing about the call from workflow-controller that gets workflow templates.

I give the data then state the question/propposal:

kubectl get wftmpl -n argo -o name  | wc -l

gives 35 alive templates (I do not have cluster workflow templates). Some of those 35 templates are nested as some are DAG calling another templates in several tasks that have also DAGs inside, with a nesting depth with a maximum of 2.

When I get logs from workflow-controller, I get this:

cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | head -n 1
cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | tail -n 1

gives

time="2023-05-25T21:56:28.052Z" level=info msg="Get workflowtemplates 200"
time="2023-05-25T22:20:48.421Z" level=info msg="Get workflowtemplates 200"

and

cat logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log | \
    rg "Get workflowtemplates" | wc -l

gives 31229 calls.

So, in about 24 minutes, there have been 31 K calls to the K8S api for getting workflowtemplates, in a deployment that has 35 workflow templates alive. It looks too much for a k8s object being updated in periods much bigger (depending on each use cases), but surely looks like too many calls for 35 templates that are static through those 24 minutes ( in my case).

Statement (in the case my guessings are not bad):

It would be possible to cache those calls from workflow-controller inside it in a coroutine that gets the templates and creates shared variables with them, in such a way the other coroutines go and consult this shared workflow templates information instead punishing K8S API so much?

A flag --workflow-template-cache 0s for workflow-controller would be very nice, and by having this, at each deployment it could be decided the balance about how much “limited K8S API calls money” is spent in updating workflow templates.

Here the logs -> logs-from-workflow-controller-in-workflow-controller-67bb779df-254qg.log

Here the workflow-controller configmap
data:
  resourceRateLimit: |
    limit: 30
    burst: 5
  parallelism: "1000"
  namespaceParallelism: "1000"
  nodeEvents: |
    enabled: false
  workflowDefaults: |
    spec:
      podGC:
        strategy: OnPodCompletion
      ttlStrategy:
        secondsAfterCompletion: 21600
        secondsAfterSuccess: 21600
        secondsAfterFailure: 21600
  artifactRepository: |
    archiveLogs: true
    s3:
      bucket: ${S3_ARGO_BUCKET}
      endpoint: s3.amazonaws.com
  persistence: |
    connectionPool:
      maxIdleConns: 100
      maxOpenConns: 0
      connMaxLifetime: 10000s
    nodeStatusOffLoad: true
    archive: true
    archiveTTL: 90d
    skipMigration: false
    postgresql:
      host: argo-server-postgresql-backend.argo
      port: 5432
      database: postgres
      tableName: argo_workflows
      # the database secrets must be in the same namespace of the controller
      userNameSecret:
        name: argo-postgres-config
        key: username
      passwordSecret:
        name: argo-postgres-config
        key: password
      ssl: false
      # sslMode must be one of: disable, require, verify-ca, verify-full
      # you can find more information about those ssl options here: https://godoc.org/github.com/lib/pq
      sslMode: disable
Here the workflow-controller deployment patch
spec:
  revisionHistoryLimit: 3
  template:
    spec:
      nodeSelector:
        nodegroup_name: core
      containers:
        - name: workflow-controller
          command:
            - workflow-controller
            - --qps=30
            - --burst=60
            - --workflow-ttl-workers=8
            - --workflow-workers=50
          env:
            - name: DEFAULT_REQUEUE_TIME
              value: 1m
          resources:
            requests:
              memory: "6100Mi"
              cpu: "1700m"
            limits:
              memory: "6100Mi"
              cpu: "1700m"

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 5
  • Comments: 23 (13 by maintainers)

Commits related to this issue

Most upvoted comments

please consider this bug as a high severity one.

This phenomenon is probably stealing kubernetes api QPS quota to the workflow controller in such a way the cron worfklows are not launched sometimes . In any system, reliability problems on cron daemon is catastrophic, cron daemon need to be as reliable as the sun raising on the morning

Sent a PR for this. https://github.com/argoproj/argo-workflows/pull/11343

Please continue monitoring it and let us know if you see additional issues.

No logging changes. Those calls may not be necessary.

I think so

Sounds like a bug given high amount of API calls. Updated to bug so we can track and eventually diagnose the issue.