argo-workflows: ListWorkflows causes server to hang when there are lots of archived workflows

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I’d like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We had >200,000 rows in the workflow archive table, and when trying to view the new combined workflow/archived workflow list page in the UI, the server times out

scanning the code, it looks like the LoadWorkflows code loads all rows from the archive table, combines them with the k8s results and then applies sorting and limiting.

as a workaround, we’ve reduced the archive ttl from 14 days to 1 day, and the endpoint now responds before timing out, but is still pretty slow.

Version

v3.5.0

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Reactions: 21
  • Comments: 50 (40 by maintainers)

Most upvoted comments

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

I propose that we add a dropdown to allow users to select whether to display:

  1. Only live workflows;
  2. Only archived workflows;
  3. Both live and archived workflows (with notice/warning that this is only suitable when the number of workflows is not too large);

Additional requirements:

  1. This dropdown box is only available if there are both archived workflows. UI should be smart enough to figure this out.
  2. We should also consider the ability to disable the third option as admin to avoid bringing down cluster performance.
  3. The “archived” column should only be displayed when appropriate. For example, it’s evident that a workflow is archived if the user is only viewing archived workflows.

Motivations for this proposal:

  1. Some users only care about one of these types of workflows;
  2. Since there are performance issues that we cannot get around in order to view both types of workflows, we should only provide this option with caution;
  3. Keep using the original pagination implementation for live workflows or archived workflows where the logic is much more precise while keeping the front-end codebase simple;
  4. The first two options are almost identical to previous versions, but the UI should be less buggy since they now share most of the implementation; the third option is an addition to previous versions.

Any thoughts?

I am reverting related change https://github.com/argoproj/argo-workflows/pull/12068 for now since otherwise the UI is not usable when there are many workflows. In the meantime, we can continue discussing proposal https://github.com/argoproj/argo-workflows/issues/12025#issuecomment-1774346636 here. We can probably release a patch version next week https://github.com/argoproj/argo-workflows/issues/11997#issuecomment-1775681491.

I think the underlying reason is to try to show normal workflows and archived workflows together, but I don’t agree to show both together. Even in api, the two workflows are separated, but it feels rather confusing to show them together on the UI.

With the unified workflow list API for both live + archived workflows we have to do the following:

  1. we need all workflows in the cluster because kubernetes API server does not support sorting by creation timestamp, but argo-server does.
  2. we should only query X workflows from the archive, where X is the requested page size. The underlying database does support filtering and sorting, so this is efficient. The fact that we query everything from archive is nonsensical.

Number 1 is a challenge because performing LIST all workflows for every request will be taxing on K8s API server as users use the UI. Listing all workflows for the purposes of only showing 20 is extremely inefficient. To get around this, I propose a technique we use in Argo CD:

Since LIST all workflows is the requirement and also the problem, we can use an informer cache on Argo API server to avoid the additional LIST calls to k8s and have all workflows ready to be returned by API server from in-memory informer cache. When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.

Note that this does balloon the memory requirements of argo-server because of the informer cache. But I feel this is manageable with archive policy and gc. And the benefits of a unified live+archive API, as well as reducing load on K8s API server, outweigh the extra memory requirements of argo-server.

If we are worried about expensive processing / sorting of the results of List() calls we make against the informer cache, we could consider maintaining our own ordered workflow list (by creationTimestamp) that is automatically updated with UpdateFunc, AddFunc, DeleteFunc registered to the informer. But I consider this an optimization.

I mentioned above that I think checkboxes in the left sidebar would make more sense then a select box. I also mentioned that we could hide it as “Extra Filters” in a modal as well.

Checkboxes and label being:

WORKFLOW STORAGE

  • In-cluster
  • Archived

We could have a little info icon next to the “Archived” checkbox linking to the Workflow Archive docs as well. We could also explain within info icons that “In-cluster” means as a k8s resource stored in etcd and that “Archived” means stored as a row in an Archive DB

at least until the feature “merged view” scales well (bifurcating by the labels said here or similar) and is stable.

The plan is to make them scale properly, this issue is still very much listed as a bug. We also did discuss this in the most recent Contributor Meeting and a solid use-case did pop up there for filtering out Archived Workflows: CI wherein new commits cancel existing Workflows which are then immediately archived – quick sequential commits can cause a lot of clutter. Similar fast-updating use-cases fit that too. The converse I’m not as sure of on a use-case, showing only Archived Workflows and no In-cluster Workflows

And currently PR only has a page that shows Workflows and Archived Workflows combined.

Would it be a good idea to add a filter or page that only shows Workflows and Archived Workflows separately?

Just reverted back to v3.5.0-rc1 due to this. Perhaps we can add the default started time back?

I’m checking the logic, but regarding the archived object’s label being Persisted, is it possible to solve the duplicate issue by filtering the workflows?

Of course! If you do, I would be very grateful.

I didn’t want to intrude initially. Wanted to make sure you had time to work with it. But if we are now considering what are effectively total reverts, then I think it would be wise to have more contributors take a look before going down that direction.

This way could be nice if you take the assertion all static workflows are successfully written in Archived. Which is an assertion that (I believe) happens in robust deployments.

This is unfortunately an overly optimistic assumption that doesn’t hold up for all cases. It is mostly correct as a simplifying assumption, but not entirely. For example, even in “robust” deployments, it is very common to keep completed Workflows in k8s/etcd for some period of time before GC’ing them. This is especially the case for failed or errored Workflows, where someone may want to manually investigate the Pods etc (similarly to a dead letter queue). There’s also the race conditions when Workflows are in the process of being Archived. Both can exist simultaneously during that time. Similarly there is the case of retrying an Archived Workflow – retries are a mutable operation (compared to resubmit, which creates a new Workflow), so the Workflow is no longer “static” in that case.

There is actually specific code to deal with this edge case of having a Workflow in both the Archive and in k8s/etcd, c.f. #11336, #11371, etc.

And then of course there is also the case of not having configured a Workflow Archive at all.

Options 1 and 2 are natural and <v3.5.0 arch ones, those should definitely exist to avoid feature regression (currently happened at v3.5.0 where you lost this ability to query either etcd or your backend db).

I’m not so sure that this would be considered a “feature regression”. The API and UI were intentionally unified in #11121. As per the original issue #10781, an Argo user (not necessarily an operator) generally does not know about or care about the difference between the two.

Having optional checkboxes (not a dropdown) could make sense as those are then just “filters”. Those could also be hidden behind a modal/menu on the UI as most users won’t change the default. That doesn’t solve this issue though for when both are checked.

I am also wondering if it may be simpler to keep the APIs decoupled and have the UI do merging. This was added as a backward-compatible option to the API, so the /archived-workflows/ API is still available, and argo archive list is also still available in the CLI. EDIT: This option was previously rejected in https://github.com/argoproj/argo-workflows/issues/10781#issuecomment-1489263668

I think this is the best way.

I mentioned above that I feel a drop-down seems like clunkier UX than a separate page (the previous behavior) to me.

If this is the primary option we are considering, I would ask if I could take some time to take a look through in-depth myself (I was not previously involved with this change). I’ve built lists with complex sorting, filtering, and pagination with efficient implementations in the past. If needed, usually the worst-case scenario is that you have to send to the API a list of all IDs to exclude (e.g. duplicates).

However, instead of paginate Archived and Workflows separately in step 3, fetch all Archived and Workflows just once at first and combine them and remove duplicates.

Fetching all is pretty much never an option. There is always an amount at which it can overload a server and is also a big load spike.

Even with an Informer cache (which only handles in-cluster Workflows), it can cause spikiness when the Informer rebuilds the cache periodically as well. Spikiness is generally a behavior to avoid from an operations perspective (my day-to-day work these days is primarily SRE-focused), if possible. Informers are a bit more transparent than generic caches, but even then, it adds a layer of complexity (i.e. more potential for bugs, which we have quite a bit of, some due to Informer misconfigurations) to the Server that we should avoid if possible.

Note that this does balloon the memory requirements of argo-server because of the informer cache.

I think it’s worth noting that this balloons the memory requirements of each Argo Server Pod. If you have a standard, 3 replica HA set-up, your memory usage is going to go way up.

@sjhewitt thank you for filing this. I’m having the same issue. 30+ second page loads. I’m gathering data now and will post here once obtained.

@sunyeongchoi thanks for taking a look.

Can confirm this isn’t related to the “past month” default on the UI being removed from the search box like @agilgur5 states. That is only client side.

CI wherein new commits cancel existing Workflows which are then immediately archived – quick sequential commits can cause a lot of clutter. Similar fast-updating use-cases fit that too. The converse I’m not as sure of on a use-case, showing only Archived Workflows and no In-cluster Workflows

All of these scenarios are important to us. We are using argo-workflows for CI. The last view that @agilgur5 mentions is important if I need to reference test failures from a previous workflow run. I may filter by label and then by archived.

A checkbox on the left panel that is checked by default is the ideal UI change IMO. I have started a PR locally for it but don’t want to waste the time if the work is already done as part of #12397 /cc @sunyeongchoi

Would it be a good idea to add a filter or page that only shows Workflows and Archived Workflows separately?

I’m thinking about UI design to show Workflows and Archived Workflows separately. I’m thinking about using a select box like the picture below. Do you have any other good ideas?

image

I propose that we add a dropdown to allow users to select whether to display:

Only live workflows; Only archived workflows; Both live and archived workflows (with notice/warning that this is only suitable when the number of workflows is not too large);

for the 3rd case query, where the problems arise,

Both live and archived workflows (with notice/warning that this is only suitable when the number of workflows is not too large);

You could break the query this way

  • ongoing workflows -> Workflows / k8s API / etcd
  • static workflows -> Archived / backend database

And paginate exhausting first ongoing workflows generator/list, then static workflows generator ( at least for sure against postgresql database you could get a cursor/generator sorted by creation timestamp with skip N elements with K elements window).

This way could be nice if you take the assertion all static workflows are successfully written in Archived. Which is an assertion that (I believe) happens in robust deployments.

Suggestion: If 20 pieces of data are needed on one page, fetch 20 Archived Workflows, fetch 20 Workflows, and then merge them. Use the cursorPaginationByResourceVersion function when searching Workflows data for the next page.

@sunyeongchoi That might be a good optimization for the third option I listed in https://github.com/argoproj/argo-workflows/issues/12025#issuecomment-1774346636. It could help a lot when there are a lot more archived workflows vs live workflows. Although we still need to fetch the entire list of live workflows on all pages.

Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags “show archived” that you are commenting, would help a lot.

@Guillermogsjc Thanks for the suggestion. However, we still need to make the same list query in the backend and then filter in the front-end.

There are some optimizations we can do, e.g. https://github.com/argoproj/argo-workflows/issues/12030.

However, this issue is more specific to argo-server/backend. The root cause is that the pagination method we implemented for ListWorkflows() requires retrieval of the entire list of workflows at once.

@jmeridth Thank you for your explanation! Yes I’ll apply archived checkbox UI in this PR. So you can stop the work.

@sunyeongchoi yes. Sorry. If you will do the archived checkbox work as part of this PR I can stop the work I started locally for it. I do not have a GitHub PR for it. Does this help?

I think this problem can be solved by filtering through k8s’ label selector.

This is a tad incorrect as well unfortunately. It is relatively close though, as the exceptions are somewhat edge cases / race conditions:

  • A Pending Workflow can already be in the Archive DB before it is marked as Archived (that’s a race condition that exists).
  • An Archived Workflow is not necessarily the same in the Archive DB and in the cluster.
    • For example, if you retry an Archived Workflow, the state in the cluster and in the DB will differ, at least temporarily. That’s why there is some front-end logic (see also #11906 and #11336) to prefer the in-cluster version if available.

I mentioned these in my previous comment as well

Would it be a good idea to add a filter or page that only shows Workflows and Archived Workflows separately?

Yes please, if you are ok with it, recovering this option is very necesary, at least until the feature “merged view” scales well (bifurcating by the labels said here or similar) and is stable.

If really need to show two resources(Archived, Workflows) on one page, I think this problem can be solved by filtering through k8s’ label selector.

	// LabelKeyWorkflowArchivingStatus indicates if a workflow needs archiving or not:
	// * `` - does not need archiving ... yet
	// * `Pending` - pending archiving
	// * `Archived` - has been archived and has live manifest
	// * `Persisted` - has been archived and retrieved from db
	// See also `LabelKeyCompleted`.
	LabelKeyWorkflowArchivingStatus = workflow.WorkflowFullName + "/workflow-archiving-status"

I think it would be a good idea to discuss again whether to separate these resources or show them together.

And personally, I think archived workflow is unmanaged item. It stored in database(3th party object) and from that point on, argo workflows is not responsible for the data stored like this. (I actually can’t understand why argo workflow provides api for archived workflow. It feels like argo workflows offers an artificial storage(s3) lookup api.)🤔

ongoing workflows -> Workflows / k8s API / etcd static workflows -> Archived / backend database

I think this way might be good. I will implement and then test it. thank you 😃

If this is the primary option we are considering, I would ask if I could take some time to take a look through in-depth myself

Of course! If you do, I would be very grateful.

I heard that Archived and Workflows was separated page before and merged them recently. because user should not need to know the difference between two types of workflows.

So I’m also considering third option that showing Archived and Workflows together as much as possible.

Options 1 and 2 are natural and <v3.5.0 arch ones, those should definitely exist to avoid feature regression (currently happened at v3.5.0 where you lost this ability to query either etcd or your backend db).

Best way for 3rd option would skip deduplication and also the forbidden fetch all.

There are two kind of workflow status:

  • ongoing: pending and running
  • static: successful, failed, error

In a deployment where archiving process is robust, there are no reasons to query static state workflows against etcd. Similarly, there are no ongoing workflows on backend database.

Based on this statement, you can on option 3, just break the query in two disjoint sets:

  • ongoing workflows -> etcd
  • static workflows -> backend database

When there is needed to show a set on the UI that contains both kinds, exhaust first ongoing list, then go for static list.

The only lose of ergonomics for the user here are the case where static workflows are younger than ongoing ones, as those will be placed without honoring creation timestamp order. But this situation is understandable and the overall balance positive.

I propose that we add a dropdown to allow users to select whether to display:

  1. Only live workflows;
  2. Only archived workflows;
  3. Both live and archived workflows (with notice/warning that this is only suitable when the number of workflows is not too large);

I think this is the best way.

However, instead of paginate Archived and Workflows separately in step 3, fetch all Archived and Workflows just once at first and combine them and remove duplicates.

And store that data in the cache. And when go to next page, we use data in the cache.

It is not yet clear whether it is possible to store it in the cache, and should also investigate the informer cache suggested above.

But if possible, I think this is the best way.

Are there any other good opinions?

@sunyeongchoi Great! Let me know if you need any help.

Thank you so much for so many people suggest good ideas.

First, I will start with optimizing Archived Workflows first.

we should only query X workflows from the archive, where X is the requested page size. The underlying database does support filtering and sorting, so this is efficient. The fact that we query everything from archive is nonsensical.

After that I will investigate the informer cache 😃

When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.

it is not only breaking and making unusable UI on v3.5.0, it is also crashing badly with OOM on 3200mb guaranteed deployed pod, with a lot of archived workflows (postgresql) and few at etcd (6 hours TTL on worfkflows with working GC).

The issue is at the main view where all workflows are listed.

Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags “show archived” that you are commenting, would help a lot.

ahh, I see - I didn’t have much knowledge of the k8s api, so didn’t realize it doesn’t really support filtering/ordering/pagination.

The 3 options I see are:

  • separate the k8s backed api/ui from the db backed api/ui (reverting to previous behaviour)
  • fetch the whole k8s data and merge it with a subset of the db data
  • persist workflows in the db from the moment they are submitted, updating their state as they are scheduled/change state. then (if the db archive is enabled) make the UI solely reliant on querying from the db. In this case, the data could be augmented with data from the k8s API if the workflows are still running…

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

What are some of those edge cases? We can still over/under fetch so long as it does not overload the server. For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).

I’m similarly curious… I wonder if it would be possible to use a cursor that encodes 2 offsets - one for the k8s api and one for the db, then fetches limit rows from both sources with the given offset, merges the results together and applies the limit to that combined list.

something like:

orderBy = ...
filters = ...
limit = 20
cursor = 0, 0
k8sResults = fetchK8s(cursor[0], limit, filters, orderBy)
dbResults = fetchDB(cursor[1], limit, filters, orderBy)
results = mergeResults(k8sResults, dbResults).slice(0, limit)

newK8sOffset = getLastK8sResult(results)
newDBOffset = getLastDbResult(results)
newCursor = (newK8sOffset, newDBOffset)

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

What are some of those edge cases? We can still over/under fetch so long as it does not overload the server. For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).

Did the previous Archived Workflows page not have pagination? If so, I would think it would have been similarly susceptible to this, just not as frequently hit since it was a separate page.

4. The first two options are almost identical to previous versions, but the UI should be less buggy since they now share most of the implementation; the third option is an addition to previous versions.

I feel like separate pages is a better UX than a drop-down. If the APIs are identical, some careful refactoring could make them share a UI implementation.

For posterity, this was actually discussed yesterday at the Contributors Meeting. @jmeridth had been looking into it as it is blocking his team from upgrading as well and had eventually traced it to this PR discussion: https://github.com/argoproj/argo-workflows/pull/11761#discussion_r1317888160.

(I was involved as I made substantial refactors to the UI for 3.5 – #11891 in particular – in case those were the cause, but the UI is actually unrelated in this case, and the refactor actually decreased the number of network requests. Also #11840 removed a default date filter, but that was entirely client-side anyway, so did not impact any networking.)

Hello. I will test the issue as soon as possible and think about a solution. thank you.

Thanks for this issue. This is a known issue if you have a lot of archived workflows. It’s caused by the pagination method that first loads all live workflows and archived workflows and then performs pagination. cc @sunyeongchoi who worked on this in https://github.com/argoproj/argo-workflows/pull/11761.