argo-workflows: High RAM Usage and Frequent Crashes with Argo Workflows Controller on Large Scale Workflows
Pre-requisites
- I have double-checked my configuration
- I can confirm the issues exists when I tested with
:latest
- I’d like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
Description: We are observing significant issues with the Argo Workflows Controller while handling a large number of workflows in parallel.
Environment: Argo Workflows version: 3.4.11 Nodes count: 300+ Parallel workflows: 5000+
What happened: The Argo Workflows Controller’s memory consumption increases exponentially, sometimes surpassing 100GB. Despite this excessive memory usage, the controller crashes frequently. Notably, despite workflows being archived, they aren’t deleted post-archiving, possibly contributing to the memory usage. It does not log any specific error messages prior to these crashes, making it challenging to pinpoint the cause or underlying issue.
What you expected to happen: We expected the Argo Workflows Controller to handle the parallel execution of 5000+ workflows across 300+ nodes without such a drastic increase in RAM consumption. We also expected a more resilient behavior, not prone to unexpected crashes, and better error logging for troubleshooting.
How to reproduce it (as minimally and precisely as possible): Set up an environment with 300+ nodes. Launch 5000+ workflows in parallel. Monitor the RAM usage of the Argo Workflows Controller and note any unexpected crashes.
Additional context: Given the scale at which we are operating, it’s critical for our operations that Argo can handle such workloads efficiently. Any assistance in resolving this issue or guidance on potential optimizations would be greatly appreciated.
Version
v3.4.11
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.
-
Logs from the workflow controller
-
Logs from in your workflow’s wait container
-
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Reactions: 4
- Comments: 38 (16 by maintainers)
Commits related to this issue
- fix: Resource version incorrectly overridden for wfInformer list requests. Fixes #11948 Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> — committed to argoproj/argo-workflows by terrytangyuan 8 months ago
- fix: Resource version incorrectly overridden for wfInformer list requests. Fixes #11948 Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> — committed to argoproj/argo-workflows by terrytangyuan 8 months ago
- fix: Resource version incorrectly overridden for wfInformer list requests. Fixes #11948 Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> — committed to argoproj/argo-workflows by terrytangyuan 8 months ago
- fix: Resource version incorrectly overridden for wfInformer list requests. Fixes #11948 (#12133) — committed to akuity/argo-workflows by terrytangyuan 8 months ago
- fix: Resource version incorrectly overridden for wfInformer list requests. Fixes #11948 (#12133) — committed to argoproj/argo-workflows by terrytangyuan 8 months ago
- fix: Resource version incorrectly overridden for wfInformer list requests. Fixes #11948 (#12133) — committed to argoproj/argo-workflows by terrytangyuan 8 months ago
Please try out v3.4.14 which fixes both this issue and the node missing issue.
We can revert that in future 3.4 patches. It doesn’t affect usage. Tracking in https://github.com/argoproj/argo-workflows/issues/11851.
@terrytangyuan sounds good!
After the build is ready, I can arrange the version upgrade maybe Tomorrow (11/7). With this new version, I will check if the k8s API calls and the function works as expected or not and get back to you. However, our high loading appears start from Thursday night (Taipei time), so we might need to wait until 11/10 to confirm if the succeeded workflow increased issue is gone.
@terrytangyuan
I discovered a discrepancy in the watch requests generated by wfInformer in workflow/controller/controller compared to workflow/cron/controller when communicating with the k8s API server. Specifically, the resourceVersion parameter is missing in the requests from workflow/controller/controller. This omission seems to prompt the k8s API server to return WatchEvents from the latest version, potentially leading to missed WatchEvents by wfInformer, such as delete events. As a result, succeeded workflows may remain in wfInformer indefinitely.
workflow/controller/controller missing resourceVersion:
workflow/cron/controller:
This behavior appears to have been introduced in PR #11343 (v3.4.9) where tweakListOptions is utilized for both list and watch requests. To validate this, we reverted to v3.4.8 and observed that the watch requests now include the resourceVersion parameter:
We are currently monitoring to see if the count of succeeded workflows continues to rise over time.
Adding the following settings to the controller config fixed the cleanup of the already archived workflows and the RAM consuption:
Anyways, the controller is still repeatingly restarting. I will update k8s to
1.27.3
and give you feedback if that fixed this issue as well.Thanks for pointing this out! It looks like the v3.5 upgrade note got accidentally cherry-picked into 3.4.12+: https://github.com/argoproj/argo-workflows/commit/46297cad798cb48627baf23d548a2c7e595ed316 @terrytangyuan not sure how you want to handle that since the history and 2 versions are now affected.
Thank you! We can track the node missing issue separately as I don’t think that’s related to this issue. https://github.com/argoproj/argo-workflows/issues/12165
Hi @terrytangyuan, here are some updates: I’ve confirmed both
argo-server
andworkflow-controller
are using the same image version -dev-fix-informer-3.4.13
. I’m curious if this note also appears in v3.4.13, so I tried and found inv3.4.13
, argo-ui does display the release note of v3.5.In addition to this weird behavior, we noticed there exists another UI bug in this new version. Some nodes of the workflows are missing in Argo UI graph view, but actually the workflows looks running correctly and we can see the pods in timeline view. I guess this issue was introduced recently since we didn’t see it when using v3.4.11.
Sorry that with above concern which will impact our service users, I’m afraid that I cannot arrange the version upgrade to our production environment at this moment to monitor if it fix the workflow succeeded incremental issue or not. But with build
dev-fix-informer-3.4.13
in our dev env, I did see the k8s api query for workflow watch hasresourceVersion
added.@carolkao I guess you forgot to update the image for argo-server. Here’s the only difference from v3.4.13 https://github.com/argoproj/argo-workflows/compare/v3.4.13...dev-fix-informer-3.4.13
@carolkao Good catch. It looks like not all changes are included when cherry-picking. I am building
dev-fix-informer-3.4.13
. Once all builds finish, a new image tagdev-fix-informer-3.4.13
should be available.This issue becomes worse, when the number of workflow resources keeps increasing while the archiving can´t keep up. At a certain point, the k8s API is overwhelmed and calling
get workflows.argoproj.io
is very slow. When the workflows controller restarts, it fetches all workflow resources. If the k8s API is overwhelmed like this, the controller fails:Because the request to the k8s API timeouts, the workflow controller crashes and tries to restart again. The same problem occurs again and the workflows controller is in a restart loop. The only way to recover from this, is to manually delete workflow resources until the k8s api response times for
get workflows.argoproj.io
decrease.@jkldrr
I observed a similar issue in our site after upgrading from 3.3.6 to 3.4.11:
Based on above observations, I have a hypothesis:
I made some changes and it looks normal now:
Hope these changes help in your case. Good luck.
---- Update ----- The above change didn’t work well. After a few days, the succeeded workflows started increasing and workflow controller try to delete them every 20 minutes.
Here is an example of workflow controller keeps deleting a deleted workflow. workflow_controller_logs_delete_workflows.txt
We analyzed the issue further after applying your suggested changes. Now we see frequent restarts of the controller and found out that it is restarted due to failing health checks, which are caused by the same issue described here: Liveness probe fails with 500 and “workflow never reconciled”
Running out of CPU don’t seems to be the problem. It using (at max) 20% of the CPU resources. We adjusted the workflow workers to fit the cleanup workers (both are now at 32). Didn’t changed the behaviour. Some new
fatal
logs appeared this night:Do you have any idea how to debug this further? I’m running out of ideas…
It seems that, over the weekend, the worklfows contoller, again, stopped cleaning up the archived workflows… After it crashed and restarted it was cleaing up the workflows. Found nothing uncommon at log levels
error
&fatal
in the logs.PS: I think it was OOM killed by K8s.
Yes, it seems to be crashing now after this log message.
Yea GC/TTL sounds like what you’d want to tune for deleting archived workflows from your cluster. There are more detailed docs in the Operator Guide, such as for GC: https://argoproj.github.io/argo-workflows/cost-optimisation/#limit-the-total-number-of-workflows-and-pods and other scalability tuneables: https://argoproj.github.io/argo-workflows/scaling/.
I’m also working on horizontally scaling the Controller in #9990
I think the root cause can caused by GC not clearing some data. Maybe this will help you https://argoproj.github.io/argo-workflows/running-at-massive-scale/
But in my use-case we trying to migrate from Azkaban to Argo-Workflow. Before that we run concurrent workflow just like what the pipe article do and i think it manage the resource pretty well (using GKE 1.24).
But yeah you can try to run test scenario to load test it first and track where is the issue come from
What kind of workflow do you run ? do you load test it maybe with simple 5000 cowsay workflow concurrent ?
I think this may help you with that https://pipekit.io/blog/upgrade-to-kubernetes-127-take-advantage-of-performance-improvements