argo-workflows: Controller crashes when there are a lot of completed workflows

The argo controller seems to be crashing when there are a lot of completed workflows (~6000). I know I could use a workflowTTL and archiving to not keep around a lot of completed workflows, but 6k doesn’t seem like a crazy number. Is this a known issue?

Argo v3.2.4

Here are the controller logs:

2022-03-29 15:13:33.173 PDT
controller
time="2022-03-29T22:13:33.173Z" level=info msg="List workflows 200"
Info
2022-03-29 15:14:01.767 PDT
controller
"Trace[518776447]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.5/tools/cache/reflector.go:167 (29-Mar-2022 22:13:01.766) (total time: 60001ms):"
Error
2022-03-29 15:14:01.767 PDT
controller
Trace[518776447]: [1m0.00142874s] [1m0.00142874s] END
Error
2022-03-29 15:14:01.767 PDT
controller
"pkg/mod/k8s.io/client-go@v0.21.5/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 33; INTERNAL_ERROR"
Error
2022-03-29 15:14:28.974 PDT
controller
time="2022-03-29T22:14:28.974Z" level=info msg="List workflows 200"
Error
2022-03-29 15:14:29.045 PDT
controller
time="2022-03-29T22:14:29.045Z" level=info msg=healthz age=5m0s err="workflow never reconciled: cluster-alignment-template-2chwh" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=
Error
2022-03-29 15:14:35.538 PDT
controller
time="2022-03-29T22:14:35.537Z" level=info msg="List workflows 200"
Info
2022-03-29 15:15:03.964 PDT
controller
"Trace[1414262326]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.5/tools/cache/reflector.go:167 (29-Mar-2022 22:14:03.963) (total time: 60001ms):"
Error
2022-03-29 15:15:03.964 PDT
controller
Trace[1414262326]: [1m0.001385813s] [1m0.001385813s] END
Error
2022-03-29 15:15:03.964 PDT
controller
"pkg/mod/k8s.io/client-go@v0.21.5/tools/cache/reflector.go:167: Failed

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 17
Comments: 22 (11 by maintainers)

Most upvoted comments

We faced a similar problem. A workflow controller in our cluster kept restarting by failing the liveness probe. There are over 6000 Workflow objects in our cluster. Argo Workflow version: v3.3.1

We decided to turn off the liveness probe. The implementation of the healthz endpoint used by liveness probe implementation frequently sends very high-load requests to the k8s API server. It lists all Workloads without informers.

https://github.com/argoproj/argo-workflows/blob/61211f9db1568190dd46b7469fa79eb6530bba73/workflow/controller/healthz.go#L33

This implementation causes not only liveness probe failure but also an unstable state of k8s API by high-load requests.

I would like options to use light-weight healthz endpoint implementation like skipping requesting listing workflows or using informer to list workflows.

shioshiota on Jun 30, 2022