druid: Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

Apologies if this breaks any rules, but I tried on the druid forums without much success so trying here to see if I can reach a different audience. Relevant information below and more details in the druid forum post.

Druid Version: 0.22.1, 0.23.0
Kafka Ingestion (idempotent producer) - HOURLY
Overlord type: remote

https://www.druidforum.org/t/kafka-ingestion-peon-tasks-success-but-overlord-shows-failure/7374

In general when we run all our tasks, we start seeing issues between Overlord and MM/Peons. Often times, the Peon will show that the task was successful but the overlord believes it failed and tries to shut it down. And things start to get sluggish with the Overlord and it starts taking a while to recognize completed tasks and tasks that are trying to start which seems to be pointing at a communication/coordination failure between Overlord and MM/Peons. We even see TaskAssignment between Overlord and MM timeouts (PT10M - default is PT5M) occur.

The only thing that seems to be able to help is reducing the number of tasks we have running concurrently by suspending certain supervisors. Which also indicates an issue with the 3 Druid services handling the load of our current ingestion. But according to system metrics, resource usage is not hitting any limits and it still has more compute it can use. It’s odd since we know there are probably a lot of users ingesting more data per hour than us and we don’t see this type of issue in their discussions/white papers.

Any help will definitely be appreciated.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 44 (13 by maintainers)

Most upvoted comments

Thank you @pchang388. This is the exact information that I was looking for. I don’t know who maintains that dashboard, but it will be nice to augment this base dashboard with more metrics such as task/action/runtime. not sure though how to do that.

abhishekagarwal87 on Aug 9, 2022

you should also run sjk for like ~ 5 minutes when peon is not responding. Loading a flame graph with 1 hour of data is very slow.

abhishekagarwal87 on Jul 20, 2022

yes, let’s use the standard 0.23.0 build. btw we would need the flame graphs on the overlord and the tasks that are not responding to pause request.

abhishekagarwal87 on Jul 20, 2022

Hi I am still working on this, but I do have an ~hour of data from Overlord, Peon, and MM in a flame graph.

Just to note, the Peons and MMs are using the custom build we made (with the section commented out) - my plan tomorrow is to get more data using the standard 0.23.0 build (since the change did not seem to fix the issue and also ensure everyone’s code is the same). If this is not preferred, please let me know and I will keep the custom build running.

I am attaching the 3 flame graphs to this comment (in a single zip file), in this specific example, it appears the task did complete - although I’m not 100% sure since I was doing this while also working on other things and lost track of the taskId. And as stated, I will try to get more flame graphs with the standard 0.23.0 build and also try to capture a Peon’s graph during a failed task. Also to note, the peon stack dump generation did have an error at the end:

Unexpected error: java.lang.reflect.UndeclaredThrowableException

It still has data but I believe the error was because the PID shutdown before the sjk duration. But if this is unrecoverable, I can definitely get more. But hopefully there is some data here that may point at a potential issue

flame_graphs_1.zip

pchang388 on Jul 20, 2022

Thank you @pchang388 for the detailed analysis. The best way to troubleshoot these degradations is to look at the flame graphs of overlord and peon. Looking forward to that information. I skimmed through the information that is already available. There is no major degradation in the system and JVM metrics. It could be that the overlord just became slower due to an application issue. I have seen that sometimes overlord HTTP threads slow down because of metadata operations. Some of these threads busy in metadata operations hold a lock that locks other threads. This brings the whole service to a crawl. We can confirm this by looking at flame graphs. If the metadata store is indeed the cause of slowness, then scaling the metadata store would certainly help.

abhishekagarwal87 on Jul 18, 2022

Since the Peon seems to be unable to pause in a reasonable timeframe or at times unresponsive/hung, I took a look at some of the metrics for the common actions it would do during a task lifecycle.

According to the docs:

An indexing task starts running and building a new segment. It must determine the identifier of the segment before it starts building it. For a task that is appending (like a Kafka task, or an index task in append mode) this is done by calling an “allocate” API on the Overlord to potentially add a new partition to an existing set of segments. For a task that is overwriting (like a Hadoop task, or an index task not in append mode) this is done by locking an interval and creating a new version number and new set of segments.

When the indexing task has finished reading data for the segment, it pushes it to deep storage and then publishes it by writing a record into the metadata store.

So during the READING phase (we did see a few fail because it didn’t pause but was in the READING phase - also gave an example in an earlier comment of this), it is communicating to Overlord via API to allocate new partition to an existing set of segments.

And during the PUBLISH phase it is pushing to Object store and also Metadata DB, so looking at some the general state for that:

SQL Read/Write/Update Performance in our Metadata DB (we are using yugabyte DB - distrubuted/HA postgres - in Kubernetes due to VM capacity constraints on our side) - not as much data yet I recently enabled prom scrapping for it:

The select and delete operations appear to be quite high. Depending on application, select statements usually should be returned fairly quickly (< 1 second) especially for user visible processes and ours seems quite high but unsure how much of an affect this would have on tasks. But write performance seems to be okay.

Object Storage Pushes and Persists by Peon, some of the larger objects appear to take a longer time than expected, especially with multi-part upload but unsure if that is being used by druid:

Segment[REDACT_2022-07-14T18:00:00.000Z_2022-07-14T19:00:00.000Z_2022-07-14T19:24:01.698Z_14] of 274,471,389 bytes built from 27 incremental persist(s) in 42,830ms; pushed to deep storage in 47,408ms
Segment[REDACT_2022-07-14T17:00:00.000Z_2022-07-14T18:00:00.000Z_2022-07-14T17:47:19.425Z_32] of 42,782,021 bytes built from 12 incremental persist(s) in 4,177ms; pushed to deep storage in 5,958ms
Segment[REDACT_2022-07-14T19:00:00.000Z_2022-07-14T20:00:00.000Z_2022-07-14T20:41:44.206Z_4] of 224,815,291 bytes built from 22 incremental persist(s) in 33,123ms; pushed to deep storage in 40,514ms

I hope this background information provides more details into our setup/configuration. Hopefully makes it easier to spot a potential issue/bottleneck (like the Overlord seems to be).

I really do appreciate the help @abhishekagarwal87 and @AmatyaAvadhanula. My next steps is to get the flame graphs for the peons to get an idea of what the threads are doing. But please let me know if you have any further suggestions or things to try or I should provide any more information.

pchang388 on Jul 15, 2022

@pchang388, thanks for the response

The task may not be in a “STARTING” phase at the time of the request as this state is not assigned after the task starts “READING”.

I think so because:

The DEBUG log ['request'] starting is something that the NettyHttpClient seems to be publishing.
Also the /pause request checks if a task is in PAUSED or READING state and returns 400 if not.

AmatyaAvadhanula on Jul 9, 2022