druid: Ingestion task fails with InterruptedException when handling the segments

Affected Version

0.20.0

Description

We’ve seen tasks failing randomly. Didn’t managed to get a pattern or cause, so I’m thinking might be some sort of bug. The tasks ends up with the status:

  "statusCode": "FAILED",
  "status": "FAILED",
  "runnerStatusCode": "WAITING",

When happened, happened to three tasks that started around the same time (±10 min) Cluster size: 27 Dataservers (Middlemanager + Historical), 2 Brokers, 2 Routers, 2 Overlords, 2 Coordinator, 3 Zookeeper (in dedicated hosts).

Partial log:

2021-02-08T22:35:01,185 INFO [qtp1999784536-190] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Stopping gracefully (status: [PUBLISHING])
2021-02-08T22:35:01,189 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception in run() before persisting.
org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:520) ~[?:?]
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:281) ~[?:?]
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236) ~[?:?]
	at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1301) ~[?:?]
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1237) ~[?:?]
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1210) ~[?:?]
	at org.apache.druid.indexing.kafka.KafkaRecordSupplier.poll(KafkaRecordSupplier.java:124) ~[?:?]
	at org.apache.druid.indexing.kafka.IncrementalPublishingKafkaIndexTaskRunner.getRecords(IncrementalPublishingKafkaIndexTaskRunner.java:98) ~[?:?]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:603) [druid-indexing-service-0.20.0.jar:0.20.0]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:267) [druid-indexing-service-0.20.0.jar:0.20.0]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:145) [druid-indexing-service-0.20.0.jar:0.20.0]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:451) [druid-indexing-service-0.20.0.jar:0.20.0]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:423) [druid-indexing-service-0.20.0.jar:0.20.0]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_275]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
Caused by: java.lang.InterruptedException
	... 17 more
2021-02-08T22:35:01,218 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_77] at path[/druid/segments/druid-dataserver-4:8103/druid-dataserver-4:8103_indexer-executor__default_tier_2021-02-08T18:29:28.553Z_8e48d33532c9490e9a7b6b372bf2cf0d0]
2021-02-08T22:35:01,221 INFO [task-runner-0-priority-0] org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing segment[ssp-events-hourly_2021-02-08T19:00:00.000Z_2021-02-08T20:00:00.000Z_2021-02-08T19:58:53.766Z_427] at path[/druid/segments/druid-dataserver-4:8103/druid-dataserver-4:8103_indexer-executor__default_tier_2021-02-08T18:29:28.553Z_8e48d33532c9490e9a7b6b372bf2cf0d0]
2021-02-08T22:35:01,221 INFO [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/segments/druid-dataserver-4:8103/druid-dataserver-4:8103_indexer-executor__default_tier_2021-02-08T18:29:28.553Z_8e48d33532c9490e9a7b6b372bf2cf0d0]
2021-02-08T22:35:01,621 ERROR [[index_kafka_ssp-events-hourly_5895d14b59e566a_fkffhhpn]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Incremental persist failed: {class=org.apache.druid.segment.realtime.appenderator.AppenderatorImpl, segment=ssp-events-hourly_2021-02-08T20:00:00.000Z_2021-02-08T21:00:00.000Z_2021-02-08T21:14:50.337Z_77, dataSource=ssp-events-hourly, count=10}
2021-02-08T22:35:01,623 INFO [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/internal-discovery/PEON/druid-dataserver-4:8103]
2021-02-08T22:35:01,632 INFO [task-runner-0-priority-0] org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self [{"druidNode":{"service":"druid/middleManager","host":"druid-dataserver-4","bindOnHost":false,"plaintextPort":8103,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"peon","services":{"dataNodeService":{"type":"dataNodeService","tier":"_default_tier","maxSize":0,"type":"indexer-executor","priority":0},"lookupNodeService":{"type":"lookupNodeService","lookupTier":"__default"}}}].
2021-02-08T22:35:01,632 INFO [task-runner-0-priority-0] org.apache.druid.curator.announcement.Announcer - Unannouncing [/druid/announcements/druid-dataserver-4:8103]
2021-02-08T22:35:01,648 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - Encountered exception while running task.
org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:520) ~[?:?]
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:281) ~[?:?]
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236) ~[?:?]
	at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1301) ~[?:?]
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1237) ~[?:?]
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1210) ~[?:?]
	at org.apache.druid.indexing.kafka.KafkaRecordSupplier.poll(KafkaRecordSupplier.java:124) ~[?:?]
	at org.apache.druid.indexing.kafka.IncrementalPublishingKafkaIndexTaskRunner.getRecords(IncrementalPublishingKafkaIndexTaskRunner.java:98) ~[?:?]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:603) ~[druid-indexing-service-0.20.0.jar:0.20.0]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:267) [druid-indexing-service-0.20.0.jar:0.20.0]
	at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:145) [druid-indexing-service-0.20.0.jar:0.20.0]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:451) [druid-indexing-service-0.20.0.jar:0.20.0]
	at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:423) [druid-indexing-service-0.20.0.jar:0.20.0]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_275]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
	Suppressed: java.lang.InterruptedException
		at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) ~[?:1.8.0_275]
		at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:1.8.0_275]
		at org.apache.druid.segment.realtime.appenderator.StreamAppenderatorDriver.persist(StreamAppenderatorDriver.java:231) ~[druid-server-0.20.0.jar:0.20.0]
		at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:756) ~[druid-indexing-service-0.20.0.jar:0.20.0]
		at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:267) [druid-indexing-service-0.20.0.jar:0.20.0]
		at org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:145) [druid-indexing-service-0.20.0.jar:0.20.0]
		at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:451) [druid-indexing-service-0.20.0.jar:0.20.0]
		at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:423) [druid-indexing-service-0.20.0.jar:0.20.0]
		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_275]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
		at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
Caused by: java.lang.InterruptedException
	... 17 more
2021-02-08T22:35:01,680 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_kafka_ssp-events-hourly_5895d14b59e566a_fkffhhpn",
  "status" : "FAILED",
  "duration" : 14743336,
  "errorMsg" : "org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException\n\tat org.apache.kaf...",
  "location" : {
    "host" : null,
    "port" : -1,
    "tlsPort" : -1
  }
}
2021-02-08T22:35:01,690 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [ANNOUNCEMENTS]
2021-02-08T22:35:01,691 INFO [main] org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle [module] stage [SERVER]

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 6
  • Comments: 29 (3 by maintainers)

Most upvoted comments

Strangely, the problem disappeared in our last version of our code. Don’t really see what have really changed. Just, one datasource disappeared, so we just kafka ingest in three datasources in parallel instead of 4. And datas send are a little different. And we now use druid 0.22.1, but really not sure it is linked, problem seems to have disappeared a little bit before that. Hard to be sure, as it happens randomly.

Same problem in 0.20.1 / Similar cluster configuration

Facing a similar issue, @tanisdlj indexing logs are determined by druid.indexer.logs.type and druid.indexer.logs.directory in your runtime properties. Mine was in /opt/data/indexing-logs on the middle-manager