argo-workflows: ListWorkflows causes server to hang when there are lots of archived workflows
Pre-requisites
- I have double-checked my configuration
- I can confirm the issues exists when I tested with
:latest
- I’d like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
We had >200,000 rows in the workflow archive table, and when trying to view the new combined workflow/archived workflow list page in the UI, the server times out
scanning the code, it looks like the LoadWorkflows
code loads all rows from the archive table, combines them with the k8s results and then applies sorting and limiting.
as a workaround, we’ve reduced the archive ttl from 14 days to 1 day, and the endpoint now responds before timing out, but is still pretty slow.
Version
v3.5.0
About this issue
- Original URL
- State: open
- Created 8 months ago
- Reactions: 21
- Comments: 50 (40 by maintainers)
@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).
I propose that we add a dropdown to allow users to select whether to display:
Additional requirements:
Motivations for this proposal:
Any thoughts?
I am reverting related change https://github.com/argoproj/argo-workflows/pull/12068 for now since otherwise the UI is not usable when there are many workflows. In the meantime, we can continue discussing proposal https://github.com/argoproj/argo-workflows/issues/12025#issuecomment-1774346636 here. We can probably release a patch version next week https://github.com/argoproj/argo-workflows/issues/11997#issuecomment-1775681491.
I think the underlying reason is to try to show normal workflows and archived workflows together, but I don’t agree to show both together. Even in api, the two workflows are separated, but it feels rather confusing to show them together on the UI.
With the unified workflow list API for both live + archived workflows we have to do the following:
Number 1 is a challenge because performing LIST all workflows for every request will be taxing on K8s API server as users use the UI. Listing all workflows for the purposes of only showing 20 is extremely inefficient. To get around this, I propose a technique we use in Argo CD:
Since LIST all workflows is the requirement and also the problem, we can use an informer cache on Argo API server to avoid the additional LIST calls to k8s and have all workflows ready to be returned by API server from in-memory informer cache. When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.
Note that this does balloon the memory requirements of argo-server because of the informer cache. But I feel this is manageable with archive policy and gc. And the benefits of a unified live+archive API, as well as reducing load on K8s API server, outweigh the extra memory requirements of argo-server.
If we are worried about expensive processing / sorting of the results of List() calls we make against the informer cache, we could consider maintaining our own ordered workflow list (by creationTimestamp) that is automatically updated with UpdateFunc, AddFunc, DeleteFunc registered to the informer. But I consider this an optimization.
I mentioned above that I think checkboxes in the left sidebar would make more sense then a select box. I also mentioned that we could hide it as “Extra Filters” in a modal as well.
Checkboxes and label being:
WORKFLOW STORAGE
We could have a little info icon next to the “Archived” checkbox linking to the Workflow Archive docs as well. We could also explain within info icons that “In-cluster” means as a k8s resource stored in etcd and that “Archived” means stored as a row in an Archive DB
The plan is to make them scale properly, this issue is still very much listed as a bug. We also did discuss this in the most recent Contributor Meeting and a solid use-case did pop up there for filtering out Archived Workflows: CI wherein new commits cancel existing Workflows which are then immediately archived – quick sequential commits can cause a lot of clutter. Similar fast-updating use-cases fit that too. The converse I’m not as sure of on a use-case, showing only Archived Workflows and no In-cluster Workflows
And currently PR only has a page that shows
Workflows
andArchived Workflows
combined.Would it be a good idea to add a filter or page that only shows
Workflows
andArchived Workflows
separately?Just reverted back to
v3.5.0-rc1
due to this. Perhaps we can add the default started time back?I’m checking the logic, but regarding the archived object’s label being
Persisted
, is it possible to solve the duplicate issue by filtering the workflows?I didn’t want to intrude initially. Wanted to make sure you had time to work with it. But if we are now considering what are effectively total reverts, then I think it would be wise to have more contributors take a look before going down that direction.
This is unfortunately an overly optimistic assumption that doesn’t hold up for all cases. It is mostly correct as a simplifying assumption, but not entirely. For example, even in “robust” deployments, it is very common to keep completed Workflows in k8s/etcd for some period of time before GC’ing them. This is especially the case for failed or errored Workflows, where someone may want to manually investigate the Pods etc (similarly to a dead letter queue). There’s also the race conditions when Workflows are in the process of being Archived. Both can exist simultaneously during that time. Similarly there is the case of retrying an Archived Workflow – retries are a mutable operation (compared to resubmit, which creates a new Workflow), so the Workflow is no longer “static” in that case.
There is actually specific code to deal with this edge case of having a Workflow in both the Archive and in k8s/etcd, c.f. #11336, #11371, etc.
And then of course there is also the case of not having configured a Workflow Archive at all.
I’m not so sure that this would be considered a “feature regression”. The API and UI were intentionally unified in #11121. As per the original issue #10781, an Argo user (not necessarily an operator) generally does not know about or care about the difference between the two.
Having optional checkboxes (not a dropdown) could make sense as those are then just “filters”. Those could also be hidden behind a modal/menu on the UI as most users won’t change the default. That doesn’t solve this issue though for when both are checked.
I am also wondering if it may be simpler to keep the APIs decoupled and have the UI do merging. This was added as a backward-compatible option to the API, so the
/archived-workflows/
API is still available, andargo archive list
is also still available in the CLI. EDIT: This option was previously rejected in https://github.com/argoproj/argo-workflows/issues/10781#issuecomment-1489263668I mentioned above that I feel a drop-down seems like clunkier UX than a separate page (the previous behavior) to me.
If this is the primary option we are considering, I would ask if I could take some time to take a look through in-depth myself (I was not previously involved with this change). I’ve built lists with complex sorting, filtering, and pagination with efficient implementations in the past. If needed, usually the worst-case scenario is that you have to send to the API a list of all IDs to exclude (e.g. duplicates).
Fetching all is pretty much never an option. There is always an amount at which it can overload a server and is also a big load spike.
Even with an Informer cache (which only handles in-cluster Workflows), it can cause spikiness when the Informer rebuilds the cache periodically as well. Spikiness is generally a behavior to avoid from an operations perspective (my day-to-day work these days is primarily SRE-focused), if possible. Informers are a bit more transparent than generic caches, but even then, it adds a layer of complexity (i.e. more potential for bugs, which we have quite a bit of, some due to Informer misconfigurations) to the Server that we should avoid if possible.
I think it’s worth noting that this balloons the memory requirements of each Argo Server Pod. If you have a standard, 3 replica HA set-up, your memory usage is going to go way up.
@sjhewitt thank you for filing this. I’m having the same issue. 30+ second page loads. I’m gathering data now and will post here once obtained.
@sunyeongchoi thanks for taking a look.
Can confirm this isn’t related to the “past month” default on the UI being removed from the search box like @agilgur5 states. That is only client side.
All of these scenarios are important to us. We are using argo-workflows for CI. The last view that @agilgur5 mentions is important if I need to reference test failures from a previous workflow run. I may filter by label and then by archived.
A checkbox on the left panel that is checked by default is the ideal UI change IMO. I have started a PR locally for it but don’t want to waste the time if the work is already done as part of #12397 /cc @sunyeongchoi
I’m thinking about UI design to show Workflows and Archived Workflows separately. I’m thinking about using a select box like the picture below. Do you have any other good ideas?
for the 3rd case query, where the problems arise,
You could break the query this way
Workflows
/ k8s API / etcdArchived
/ backend databaseAnd paginate exhausting first ongoing workflows generator/list, then static workflows generator ( at least for sure against postgresql database you could get a cursor/generator sorted by creation timestamp with skip N elements with K elements window).
This way could be nice if you take the assertion all static workflows are successfully written in
Archived
. Which is an assertion that (I believe) happens in robust deployments.@sunyeongchoi That might be a good optimization for the third option I listed in https://github.com/argoproj/argo-workflows/issues/12025#issuecomment-1774346636. It could help a lot when there are a lot more archived workflows vs live workflows. Although we still need to fetch the entire list of live workflows on all pages.
@Guillermogsjc Thanks for the suggestion. However, we still need to make the same list query in the backend and then filter in the front-end.
There are some optimizations we can do, e.g. https://github.com/argoproj/argo-workflows/issues/12030.
However, this issue is more specific to argo-server/backend. The root cause is that the pagination method we implemented for
ListWorkflows()
requires retrieval of the entire list of workflows at once.@jmeridth Thank you for your explanation! Yes I’ll apply archived checkbox UI in this PR. So you can stop the work.
@sunyeongchoi yes. Sorry. If you will do the archived checkbox work as part of this PR I can stop the work I started locally for it. I do not have a GitHub PR for it. Does this help?
This is a tad incorrect as well unfortunately. It is relatively close though, as the exceptions are somewhat edge cases / race conditions:
Pending
Workflow can already be in the Archive DB before it is marked asArchived
(that’s a race condition that exists).Archived
Workflow is not necessarily the same in the Archive DB and in the cluster.Archived
Workflow, the state in the cluster and in the DB will differ, at least temporarily. That’s why there is some front-end logic (see also #11906 and #11336) to prefer the in-cluster version if available.I mentioned these in my previous comment as well
Yes please, if you are ok with it, recovering this option is very necesary, at least until the feature “merged view” scales well (bifurcating by the labels said here or similar) and is stable.
If really need to show two resources(
Archived
,Workflows
) on one page, I think this problem can be solved by filtering through k8s’ label selector.I think it would be a good idea to discuss again whether to separate these resources or show them together.
And personally, I think archived workflow is unmanaged item. It stored in database(3th party object) and from that point on, argo workflows is not responsible for the data stored like this. (I actually can’t understand why argo workflow provides api for archived workflow. It feels like argo workflows offers an artificial storage(s3) lookup api.)🤔
I think this way might be good. I will implement and then test it. thank you 😃
Of course! If you do, I would be very grateful.
I heard that
Archived
andWorkflows
was separated page before and merged them recently. because user should not need to know the difference between two types of workflows.So I’m also considering third option that showing
Archived
andWorkflows
together as much as possible.Options 1 and 2 are natural and <v3.5.0 arch ones, those should definitely exist to avoid feature regression (currently happened at v3.5.0 where you lost this ability to query either etcd or your backend db).
Best way for 3rd option would skip deduplication and also the forbidden fetch all.
There are two kind of workflow status:
In a deployment where archiving process is robust, there are no reasons to query static state workflows against etcd. Similarly, there are no ongoing workflows on backend database.
Based on this statement, you can on option 3, just break the query in two disjoint sets:
When there is needed to show a set on the UI that contains both kinds, exhaust first ongoing list, then go for static list.
The only lose of ergonomics for the user here are the case where static workflows are younger than ongoing ones, as those will be placed without honoring creation timestamp order. But this situation is understandable and the overall balance positive.
I think this is the best way.
However, instead of paginate
Archived
andWorkflows
separately in step 3, fetch allArchived
andWorkflows
just once at first and combine them and remove duplicates.And store that data in the cache. And when go to next page, we use data in the cache.
It is not yet clear whether it is possible to store it in the cache, and should also investigate the informer cache suggested above.
But if possible, I think this is the best way.
Are there any other good opinions?
@sunyeongchoi Great! Let me know if you need any help.
Thank you so much for so many people suggest good ideas.
First, I will start with optimizing Archived Workflows first.
After that I will investigate the informer cache 😃
it is not only breaking and making unusable UI on v3.5.0, it is also crashing badly with OOM on 3200mb guaranteed deployed pod, with a lot of archived workflows (postgresql) and few at etcd (6 hours TTL on worfkflows with working GC).
The issue is at the main view where all workflows are listed.
Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags “show archived” that you are commenting, would help a lot.
ahh, I see - I didn’t have much knowledge of the k8s api, so didn’t realize it doesn’t really support filtering/ordering/pagination.
The 3 options I see are:
I’m similarly curious… I wonder if it would be possible to use a cursor that encodes 2 offsets - one for the k8s api and one for the db, then fetches
limit
rows from both sources with the given offset, merges the results together and applies the limit to that combined list.something like:
What are some of those edge cases? We can still over/under fetch so long as it does not overload the server. For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).
Did the previous Archived Workflows page not have pagination? If so, I would think it would have been similarly susceptible to this, just not as frequently hit since it was a separate page.
I feel like separate pages is a better UX than a drop-down. If the APIs are identical, some careful refactoring could make them share a UI implementation.
For posterity, this was actually discussed yesterday at the Contributors Meeting. @jmeridth had been looking into it as it is blocking his team from upgrading as well and had eventually traced it to this PR discussion: https://github.com/argoproj/argo-workflows/pull/11761#discussion_r1317888160.
(I was involved as I made substantial refactors to the UI for 3.5 – #11891 in particular – in case those were the cause, but the UI is actually unrelated in this case, and the refactor actually decreased the number of network requests. Also #11840 removed a default date filter, but that was entirely client-side anyway, so did not impact any networking.)
Hello. I will test the issue as soon as possible and think about a solution. thank you.
Thanks for this issue. This is a known issue if you have a lot of archived workflows. It’s caused by the pagination method that first loads all live workflows and archived workflows and then performs pagination. cc @sunyeongchoi who worked on this in https://github.com/argoproj/argo-workflows/pull/11761.