druid: Druid Router UI throwing 504 when there are too many tasks

Affected Version

0.19.1

Description

We have this use case where we submit 10k+ tasks per day. When loading the Router UI, it usually throws a 504 when loading the Tasks tile after a while of waiting. The same happens when we open the Ingestion tab as well.

Screen Shot 2021-04-21 at 5 59 18 PM

Screen Shot 2021-04-21 at 6 13 59 PM

Is there a way to prevent the Router from loading all tasks at once, and rather just lazy load?

A workaround is to set druid.indexer.storage.recentlyFinishedThreshold to a lower value. But we were wondering if there is a better way of doing this.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 15 (12 by maintainers)

Most upvoted comments

I have done some profiling on our stack here, my analysis follows. We are configured with HeapMemoryTaskStorage.

By issuing repeated SQL requests against broker (such as with ab) we can see the workload increase on overlord. Taking a CPU profile of the overlord host and focusing on the CPU related to the /tasks endpoint gives a view that over 50% of the CPU load is in HeapMemoryTaskStorage::getTasks, and only a small % of time in serialization.

Notice specifically in the before/after below that the % of time (width of bar) of getCompletedTaskInfo... (highlighted in a magenta-ish colour), and that the bulk of the time is in sortedCopy.

Before

image

After changes, getCompletedTaskInfo... is significantly reduced as a % of the overall CPU time, so much that serialization is now far larger than the query time.

image

Hey @jasonk000 we’re using Postgres on AWS RDS as our druid.indexer.storage.type ,ie, metadata