druid: Ingestion task fails with InterruptedException when handling the segments
Affected Version
0.20.0
Description
We’ve seen tasks failing randomly. Didn’t managed to get a pattern or cause, so I’m thinking might be some sort of bug. The tasks ends up with the status:
"statusCode": "FAILED",
"status": "FAILED",
"runnerStatusCode": "WAITING",
When happened, happened to three tasks that started around the same time (±10 min) Cluster size: 27 Dataservers (Middlemanager + Historical), 2 Brokers, 2 Routers, 2 Overlords, 2 Coordinator, 3 Zookeeper (in dedicated hosts).
Partial log:
2021-02-08T22:35:01,185 INFO [qtp1999784536-190] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Stopping gracefully (status: [PUBLISHING])
2021-02-08T22:35:01,189 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception in run() before persisting.
org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:520) ~[?:?]
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:281) ~[?:?]
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236) ~[?:?]
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1301) ~[?:?]
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1237) ~[?:?]
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1210) ~[?:?]
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.poll(KafkaRecordSupplier.java:124) ~[?:?]
at org.apache.druid.indexing.kafka.IncrementalPublishingKafkaIndexTaskRunner.getRecords(IncrementalPublishingKafkaIndexTaskRunner.java:98) ~[?:?]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:603) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:267) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:145) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:451) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:423) [druid-indexing-service-0.20.0.jar:0.20.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
Caused by: java.lang.InterruptedException
... 17 more
2021-02-08T22:35:01,218 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_77] at path[/druid/segments/druid-dataserver-4:8103/druid-dataserver-4:8103_indexer-executor__default_tier_2021-02-08T18:29:28.553Z_8e48d33532c9490e9a7b6b372bf2cf0d0]
2021-02-08T22:35:01,221 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[ssp-events-hourly_2021-02-08T19:00:00.000Z_2021-02-08T20:00:00.000Z_2021-02-08T19:58:53.766Z_427] at path[/druid/segments/druid-dataserver-4:8103/druid-dataserver-4:8103_indexer-executor__default_tier_2021-02-08T18:29:28.553Z_8e48d33532c9490e9a7b6b372bf2cf0d0]
2021-02-08T22:35:01,221 INFO [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/segments/druid-dataserver-4:8103/druid-dataserver-4:8103_indexer-executor__default_tier_2021-02-08T18:29:28.553Z_8e48d33532c9490e9a7b6b372bf2cf0d0]
2021-02-08T22:35:01,621 ERROR [[index_kafka_ssp-events-hourly_5895d14b59e566a_fkffhhpn]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Incremental persist failed: {class=org.apache.druid.segment.realtime.appenderator.AppenderatorImpl, segment=ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_77, dataSource=ssp-events-hourly, count=10}
2021-02-08T22:35:01,623 INFO [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/internal-discovery/PEON/druid-dataserver-4:8103]
2021-02-08T22:35:01,632 INFO [task-runner-0-priority-0] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self [{"druidNode":{"service":"druid/middleManager","host":"druid-dataserver-4","bindOnHost":false,"plaintextPort":8103,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"peon","services":{"dataNodeService":{"type":"dataNodeService","tier":"_default_tier","maxSize":0,"type":"indexer-executor","priority":0},"lookupNodeService":{"type":"lookupNodeService","lookupTier":"__default"}}}].
2021-02-08T22:35:01,632 INFO [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/announcements/druid-dataserver-4:8103]
2021-02-08T22:35:01,648 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception while running task.
org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:520) ~[?:?]
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:281) ~[?:?]
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236) ~[?:?]
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1301) ~[?:?]
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1237) ~[?:?]
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1210) ~[?:?]
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.poll(KafkaRecordSupplier.java:124) ~[?:?]
at org.apache.druid.indexing.kafka.IncrementalPublishingKafkaIndexTaskRunner.getRecords(IncrementalPublishingKafkaIndexTaskRunner.java:98) ~[?:?]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:603) ~[druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:267) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:145) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:451) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:423) [druid-indexing-service-0.20.0.jar:0.20.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
Suppressed: java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) ~[?:1.8.0_275]
at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:1.8.0_275]
at org.apache.druid.segment.realtime.appenderator.StreamAppenderatorDriver.persist(StreamAppenderatorDriver.java:231) ~[druid-server-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:756) ~[druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:267) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:145) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:451) [druid-indexing-service-0.20.0.jar:0.20.0]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:423) [druid-indexing-service-0.20.0.jar:0.20.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
Caused by: java.lang.InterruptedException
... 17 more
2021-02-08T22:35:01,680 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
"id" : "index_kafka_ssp-events-hourly_5895d14b59e566a_fkffhhpn",
"status" : "FAILED",
"duration" : 14743336,
"errorMsg" : "org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException\n\tat org.apache.kaf...",
"location" : {
"host" : null,
"port" : -1,
"tlsPort" : -1
}
}
2021-02-08T22:35:01,690 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2021-02-08T22:35:01,691 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 6
- Comments: 29 (3 by maintainers)
Strangely, the problem disappeared in our last version of our code. Don’t really see what have really changed. Just, one datasource disappeared, so we just kafka ingest in three datasources in parallel instead of 4. And datas send are a little different. And we now use druid 0.22.1, but really not sure it is linked, problem seems to have disappeared a little bit before that. Hard to be sure, as it happens randomly.
Same problem in
0.20.1
/ Similar cluster configurationFacing a similar issue, @tanisdlj indexing logs are determined by
druid.indexer.logs.type
anddruid.indexer.logs.directory
in your runtime properties. Mine was in/opt/data/indexing-logs
on the middle-manager