amazon-kinesis-client: KCL 2.0 stops consuming from some shards
We are in the process of upgrading several consumers to KCL 2.0. They are attached to a somewhat large stream (thousands of shards) and have been running on KCL 1.x for a long time without issue.
Today we ran into the following exception which caused some shards not to be consumed:
shardId-000000000634: Last request was dispatched at 2018-11-09T21:28:14.619Z, but no response as of 2018-11-09T21:28:50.632Z (PT36.013S). Cancelling subscription, and restarting.
at software.amazon.kinesis.lifecycle.ShardConsumer.healthCheck
at software.amazon.kinesis.lifecycle.ShardConsumer.executeLifecycle
at software.amazon.kinesis.coordinator.Scheduler.runProcessLoop
at software.amazon.kinesis.coordinator.Scheduler.run
We’ve seen this exception before right after deploys, but they usually disappear within 30 minutes. Today it started happening during normal operation and it caused some shards not to be consumed at all. The max statistic of the SubscribeToShardEvent.MillisBehindLatest metric just kept increasing.
We are running on the latest commit (f52f2559ed) of the master branch. Any idea what could be happening?
Edit: should probably also mention that we let it run like this for over 2 hours and it never recovered. We’ve had to revert everything back to the old KCL.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 5
- Comments: 19 (5 by maintainers)
We are using KCL 2.0.5 and seeing a lot of these messages too, for example:
The 35 second value seems to be hard-coded in
ShardConsumer#MAX_TIME_BETWEEN_REQUEST_RESPONSE, so we don’t have control over it.There are also warnings like this, but I don’t know if the two are related:
@ShibaBandit How long did the call to
processRecordstake? The warning you’re getting is telling you that the call toprocessRecordshas taken more than the configured time in this case it looks like 35 seconds. The last time data arrived warning happens to be emitted at the same time. If yourprocessRecordswas blocked for 5 to 10 minutes than the KCL will need to wait for your record processor to finish before getting more data from Kinesis. This is the break you see in the metrics, and the read timeout.