amazon-kinesis-client: "Stuck" Kinesis Shards

@pfifer I’ve been using Kinesis as my real time data streaming pipeline for over a year now, and I am consistently running into an issue where Kinesis shards seem to randomly stop processing data. I am using the Node.js KCL library to consume the stream.

Here are some screenshots to show what I mean.

screen shot 2017-06-27 at 6 34 51 pm

You can see here that from roughly 21:00 to 6:00, shards 98 and 100 stop emitting data for DataBytesProcessed. I restarted the application at 6:00 and the consumer immediately started processing data again.

Now here is a graph of the IteratorAgeMilliseconds for the same time period.

screen shot 2017-06-27 at 6 34 34 pm

The shards are still emitting IteratorAge from 21:00 to 6:00 and show that the IteratorAgeMilliseconds is still at 0, so it seems like the KCL was still running during that time, but it wasn’t consuming more data from the stream. You can see that when I restarted the application at 6:00, the KCL realized it was very far behind and the IteratorAge spiked up to 32.0M instantly. I can confirm that the KCL process was still running during that time period and continuously logged out the usual messages:

INFO: Current stream shard assignments: shardId-000000000098
Jun 28, 2017 5:37:36 PM com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker info
INFO: Sleeping ...

There are 2 other shards in the same stream and they continued to process data as normal. I have been working on debugging this on my end for several months now, but have come up empty. I am catching all exceptions thrown in my application, and I have set timeouts on any network requests or async I/O operations, so I don’t think the process is crashing or getting hung up on anything.

I also thought that maybe I didn’t have enough resources allocated to the application and something weird was happening with that and that’s why the process got hung up. So, I made 4 instances of my application (1 for each shard) each with 224GB of RAM and 32 cpus, but I still run into this issue.

I cannot seem to reliably replicate this issue; it seems to just happen randomly across different shards and can range from happening a few times a week to a few times a day. FYI, I have several other Node applications using the KCL that also experience the same issue.

I have seen this issue where he seems to have the same problem with the Python library which he solved by just restarting the applications every 2 hours…and also this issue which seems to be the same problem I have as well.

Is this a known bug with the KCL? And if the problem is on my end, do you have any pointers on how to track down this problem?

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 29
Comments: 34 (12 by maintainers)

Commits related to this issue

Release 1.8.1 of the Amazon Kinesis Client Support timeouts for calls to the MultiLang Daemon This adds support for setting a timeout when dispatching records to the client record processor. If the r... — committed to pfifer/amazon-kinesis-client by pfifer 7 years ago
Release 1.8.1 of the Amazon Kinesis Client (#198) Support timeouts for calls to the MultiLang Daemon This adds support for setting a timeout when dispatching records to the client record processor.... — committed to awslabs/amazon-kinesis-client by pfifer 7 years ago

Most upvoted comments

I don’t believe this is a bug in the core Java KCL, but could be a bug in the MultiLang component.

I also thought that maybe I didn’t have enough resources allocated to the application and something weird was happening with that and that’s why the process got hung up. So, I made 4 instances of my application (1 for each shard) each with 224GB of RAM and 32 cpus, but I still run into this issue.

You shouldn’t need to do this, since each record processor is single threaded. If your record processor does its work on the main thread, adding more threading capacity will be wasted.

What you’re seeing happen is the lack of response from when the record were dispatched to the language component. The Java record processor is waiting for response from the language components before returning from the processRecords call. The MultiLang Daemon implements the Java record processor interface and acts as a forwarder to the language specific record processor. The dispatch behavior of the Java record processor is to make a call to processRecords, and pause activity until that call returns. The MultiLang Daemon implements the same behavior, but across process boundaries. The MultiLang Daemon, acting as a forwarder, needs to wait for a response from the native language record processor. If for some reason the response never comes, it will hang indefinitely.

There isn’t a good way to see what is actually causing the problem it could be any of three components:

The Java MultiLangProtocol
The language specific KCL interface e.g. KCLProcess for Node.js.
The application specific record processor.

I think for a short term fix adding a configuration that would limit the maximum time the MultiLang Daemon will wait for a response could limit the impact of these sort of issues. This fix would terminate the record processor if a response isn’t received within the configured time.

Longer term will require adding some additional logging to understand the state of components on both side of the process boundary. This would help us detect which component is losing track of the current state.

For everyone else please comment, or add a reaction to this to help us know how many people are affected by this issue.

+23

pfifer on Jun 30, 2017

@pfifer any update on this? Even implementing some sort of short term fix would be extremely helpful.

aloisbarreras on Jul 19, 2017

@sahilpalvia We are using Java.

juanstiza on Oct 24, 2017

+1, Java KCL

xinec on Dec 5, 2017

I’m not sure if this is the same case, but running the kcl sample project with a stream of 4 shards, the first shard shardId-00000000 is initialized, but recordProcess is never called for that shard. In dynamoDB, checkpoint column always stays as LATEST. Re-running the application doesn’t seems to work and I get no errors. Is this issue related to what you are trying to fix? Thank you!

Is there a workaround in the meantime?

oscarjiv91 on Aug 15, 2017

@pfifer with client v1.8.1 having below issue.


2019-02-09 21:10:01.875  INFO 26971 --- [      Thread-29] c.a.s.k.clientlibrary.lib.worker.Worker  : Worker shutdown requested.
2019-02-09 21:10:01.876  INFO 26971 --- [      Thread-29] c.a.s.k.leases.impl.LeaseCoordinator     : Worker ip-1234. has successfully stopped lease-tracking threads
2019-02-09 21:10:01.877  INFO 26971 --- [dProcessor-0000] c.c.d.v.s.p.KinesisRecordProcessor       : Checkpointing shard shardId-000000000000
2019-02-09 21:10:01.878  INFO 26971 --- [dProcessor-0000] k.c.l.w.KinesisClientLibLeaseCoordinator : Worker ip-1234. could not update checkpoint for shard shardId-000000000000 because it does not hold the lease
2019-02-09 21:10:01.878  INFO 26971 --- [dProcessor-0000] c.c.d.v.s.p.KinesisRecordProcessor       : Caught shutdown exception, skipping checkpoint.

com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:174) ~[amazon-kinesis-client-1.8.1.jar!/:na]

Any clue, if this is my issue? I see sometimes checkpoint gets updated, sometimes throws above error and it delivers again those messages back to consumer.

appericiate your quick response.

dharmeshspatel4u on Feb 9, 2019

@htarevern If you’re using the DynamoDB Streams plugin your issue maybe related to https://github.com/awslabs/dynamodb-streams-kinesis-adapter/issues/20

pfifer on Oct 30, 2018