amazon-kinesis-client: Lots of error messages like: ` Cancelling subscription, and marking self as failed` with `Invalid StartingSequenceNumber` since upgrading to 2.0

java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.InvalidArgumentException: Invalid StartingSequenceNumber. It encodes shardId-000000000234, while it was used in a call to a shard with shardId-000000000309 (Service: Kinesis, Status Code: 400, Request ID: eab0b990-9136-1aad-beb8-145829c423c5)
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java:155) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java:121) ~[sdk-core-2.0.1.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[na:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$Completable.lambda$complete$0(MakeAsyncHttpRequestStage.java:200) ~[sdk-core-2.0.1.jar!/:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [na:na]
	at java.base/java.lang.Thread.run(Thread.java:844) [na:na]
Caused by: software.amazon.awssdk.services.kinesis.model.InvalidArgumentException: Invalid StartingSequenceNumber. It encodes shardId-000000000234, while it was used in a call to a shard with shardId-000000000309 (Service: Kinesis, Status Code: 400, Request ID: eab0b990-9136-1aad-beb8-145829c423c5)
	at software.amazon.awssdk.services.kinesis.model.InvalidArgumentException$BuilderImpl.build(InvalidArgumentException.java:104) ~[kinesis-2.0.1.jar!/:na]
	at software.amazon.awssdk.services.kinesis.model.InvalidArgumentException$BuilderImpl.build(InvalidArgumentException.java:64) ~[kinesis-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.response.SdkErrorResponseHandler.handle(SdkErrorResponseHandler.java:46) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.response.SdkErrorResponseHandler.handle(SdkErrorResponseHandler.java:30) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.async.SyncResponseHandlerAdapter.complete(SyncResponseHandlerAdapter.java:92) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.client.handler.BaseAsyncClientHandler$InterceptorCallingHttpResponseHandler.complete(BaseAsyncClientHandler.java:225) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.handleResponse(MakeAsyncHttpRequestStage.java:185) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.complete(MakeAsyncHttpRequestStage.java:171) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.complete(MakeAsyncHttpRequestStage.java:122) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$PublisherAdapter$1.onComplete(ResponseHandler.java:255) ~[netty-nio-client-2.0.1.jar!/:na]
	at com.typesafe.netty.HandlerPublisher.publishMessage(HandlerPublisher.java:362) ~[netty-reactive-streams-2.0.0.jar!/:na]
	at com.typesafe.netty.HandlerPublisher.flushBuffer(HandlerPublisher.java:304) ~[netty-reactive-streams-2.0.0.jar!/:na]
	at com.typesafe.netty.HandlerPublisher.receivedDemand(HandlerPublisher.java:258) ~[netty-reactive-streams-2.0.0.jar!/:na]
	at com.typesafe.netty.HandlerPublisher.access$200(HandlerPublisher.java:41) ~[netty-reactive-streams-2.0.0.jar!/:na]
	at com.typesafe.netty.HandlerPublisher$ChannelSubscription$1.run(HandlerPublisher.java:452) ~[netty-reactive-streams-2.0.0.jar!/:na]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[netty-common-4.1.23.Final.jar!/:4.1.23.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) ~[netty-common-4.1.23.Final.jar!/:4.1.23.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) ~[netty-transport-4.1.23.Final.jar!/:4.1.23.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886) ~[netty-common-4.1.23.Final.jar!/:4.1.23.Final]
	... 1 common frames omitted

In the last three hours, I’ve had these errors / warnings with the following combinations of encoded shardId’s and the shard ID it claims it was used with:

(all shardId’s are with the format shardId-000000000301 - so I’m omitting the first part)

Encoded Shard ID Used-with ShardID count
301 349 2507
190 205 6385
222 196 2589
302 238 2582
270 258 2503
236 198 2512
253 229 356
266 342 7336
234 309 4710
260 229 2634
333 278 2519
308 318 2669

My stream has not been resharded recently and has 80 shards.

Question: Is this something that happens normally as part of the KCL client behavior, or does this indicate something I should be concerned about? (And If it is considered to be normal behavior, can the logs be changed to something other than warn-level and the full stacktrace not be logged?)

If this is something I should be worried about - what might be causing this?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 110 (51 by maintainers)

Commits related to this issue

Most upvoted comments

I’m working on a change that will validate each sequence number in the response from the FanOutRecordsProducer. I’m just starting testing, and will be looking to do a SNAPSHOT release soon. I want to get it out today or tonight to allow for some testing.

Hi @pfifer , I’m on 2.0.5 and I still get lots of those warnings in my tests. I have 4 shards in my stream and my application is running in 2 ECS instances benefiting from enhanced fanout.

{
    "timestamp": "2018-11-16T21:15:11.016Z",
    "logger": "software.amazon.kinesis.lifecycle.ShardConsumer",
    "exceptionClass": "software.amazon.kinesis.retrieval.RetryableRetrievalException",
    "stackTrace": "software.amazon.kinesis.retrieval.RetryableRetrievalException: ReadTimeout
	at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.errorOccurred(FanOutRecordsPublisher.java: 142)
	at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.access$700(FanOutRecordsPublisher.java: 51)
	at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher$RecordFlow.exceptionOccurred(FanOutRecordsPublisher.java: 516)
	at software.amazon.awssdk.services.kinesis.DefaultKinesisAsyncClient.lambda$subscribeToShard$1(DefaultKinesisAsyncClient.java: 2102)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java: 760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java: 736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java: 474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java: 1977)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java: 155)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java: 121)
	at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java: 822)
	at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java: 797)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java: 474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java: 1977)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$Completable.lambda$completeExceptionally$1(MakeAsyncHttpRequestStage.java: 208)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 624)
	at java.lang.Thread.run(Thread.java: 748)
Caused by: software.amazon.awssdk.core.exception.SdkClientException
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java: 97)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java: 143)
	... 9 more
Caused by: io.netty.handler.timeout.ReadTimeoutException",
    "thread": "ShardRecordProcessor-0002",
    "exceptionMessage": "ReadTimeout",
    "message": "ShardConsumer: shardId-000000000003: onError().  Cancelling subscription, and marking self as failed.",
    "logLevel": "WARN"
}

and also

{
    "timestamp": "2018-11-16T21:19:03.243Z",
    "logger": "software.amazon.awssdk.http.nio.netty.internal.RunnableRequest",
    "exceptionClass": "io.netty.channel.ConnectTimeoutException",
    "stackTrace": "io.netty.channel.ConnectTimeoutException: connection timed out: monitoring.us-east-1.amazonaws.com/ip.ip.ip.ip:443
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	at java.lang.Thread.run(Thread.java:748)",
    "thread": "aws-java-sdk-NettyEventLoop-2-0",
    "exceptionMessage": "connection timed out: monitoring.us-east-1.amazonaws.com/ip.ip.ip.ip:443",
    "message": "RunnableRequest: Failed to create connection to https://monitoring.us-east-1.amazonaws.com",
    "logLevel": "ERROR"
}

and finally:

{
	"timestamp": "2018-11-16T21:16:08.397Z",
	"logger": "software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher",
	"thread": "ShardRecordProcessor-0001",
	"message": "FanOutRecordsPublisher: shardId-000000000003: (FanOutRecordsPublisher/Subscription#request) - Rejected an attempt to request(6), because subscribers don't match.",
	"logLevel": "WARN"
}

Hi folks, interestingly I stumbled on to this post myself as I am also experiencing a similar issue as described above… Pulling down 2.0.5 of KCL to test tonight. The issue occurs several hours in to running.

Unlike what is/was described above, I use the Enhanced Fan out of KCL with only 1 shard…

Throughout its run, I get these errors sporadically:

18/11/15 23:26:38 WARN channel.DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. java.io.IOException: Connection reset by peer

But, about 8 hours in to running I get a series of errors like so;

18/11/15 23:26:38 WARN channel.DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. java.io.IOException: Connection reset by peer 18/11/16 08:51:12 WARN fanout.FanOutRecordsPublisher: shardId-000000000018: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2018-11-15T21:46:40.848Z id: shardId-000000000018-117 – software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) 18/11/16 08:51:13 WARN fanout.FanOutRecordsPublisher: shardId-000000000018: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2018-11-15T21:51:12.753Z id: shardId-000000000018-118 – software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) 18/11/16 08:51:13 WARN fanout.FanOutRecordsPublisher: shardId-000000000018: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2018-11-15T21:51:13.754Z id: shardId-000000000018-119 – CompletionException/software.amazon.awssdk.services.kinesis.model.ResourceInUseE xception: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (Service: Kinesis, Status Code: 400, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (S ervice: Kinesis, Status Code: 400, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) Caused by: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (Service: Kinesis, Status Code: 4 00, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:104) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:64) java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (S ervice: Kinesis, Status Code: 400, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) Caused by: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (Service: Kinesis, Status Code: 4 00, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:104) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:64)

Going to try 2.0.5 tonight as I need to plan for an outage.

My KCL configuration is rather simple, it just follows the KCL 1.0 to 2.0 migration guide in that it has;

new Scheduler( configsBuilder.checkpointConfig(), configsBuilder.coordinatorConfig(), configsBuilder.leaseManagementConfig() .initialLeaseTableReadCapacity(5) .initialLeaseTableWriteCapacity(5) .cleanupLeasesUponShardCompletion(true) .failoverTimeMillis(60000), configsBuilder.lifecycleConfig(), configsBuilder.metricsConfig() .metricsLevel(MetricsLevel.NONE), configsBuilder.processorConfig() .callProcessRecordsEvenForEmptyRecordList(false), configsBuilder.retrievalConfig());

Any tips or tricks much appreciated.

Chris

Yes, I’m going to start looking at it right now.

Edit: Adjusted grammar, since apparently I’m still tired.

Just an update. We re-deployed our application yesterday morning with experiment-6. We have not yet tried to downgrade the AWS SDK to 2.0.0. It ran well with no stuck shards for about 12 hours. Then, it began failing on one shard (shardId-000000000031).

First, the mismatched records message appeared for shardId-000000000031 repeatedly for about a minute:

2018-09-27T00:26:09,761 [ERROR] software.amazon.kinesis.lifecycle.ShardConsumer:197 - [shardId-000000000031] Found mismatched records: (shardId-000000000038 -> 1)

Then, the following three messages started appearing repeatedly until the application was restarted:

2018-09-27T00:27:31,382 [WARN] software.amazon.kinesis.lifecycle.ShardConsumer:270 - shardId-000000000031: Failure occurred in retrieval. Restarting data requests
java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.InvalidArgumentException: Invalid StartingSequenceNumber. It encodes shardId-000000000038, while it was used in a call to a shard with shardId-000000000031 (Service: Kinesis, Status Code: 400, Request ID: d73a5a9e-cad8-ccb4-831c-7cacad8c5c53)
2018-09-27T00:27:31,383 [DEBUG] software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher:135 - shardId-000000000031: [SubscriptionLifetime]: (FanOutRecordsPublisher#subscribeToShard) @ 2018-09-27T00:27:31.383Z id: shardId-000000000031-49 -- Starting subscribe to shard
2018-09-27T00:27:31,383 [DEBUG] software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher:140 -shardId-000000000031: [SubscriptionLifetime]: (FanOutRecordsPublisher#subscribeToShard) @ 2018-09-27T00:27:31.383Z id: shardId-000000000031-49 -- Making StoS Request: SubscribeToShardRequest(ConsumerARN=arn:aws:kinesis:us-east-1:905389620212:stream/measnap-MeasurementSnapshotCaptorStreamV2/consumer/measnap-MeasurementSnapshotCaptorV2:1536187455, ShardId=shardId-000000000031, StartingPosition=StartingPosition(Type=AFTER_SEQUENCE_NUMBER, SequenceNumber=49587051802106862994389666143737629092499109650800575074))

I checked the sequence number in the request (49587051802106862994389666143737629092499109650800575074), and it does belong to shardId-000000000038, not shardId-000000000031. I searched all our logs for that sequence number, but it does not appear anywhere else besides that set of three error messages. The good news was that it did not checkpoint an invalid sequence number. The sequence number stored in the lease table for shardId-000000000031 was valid. So, a restart of the application was all that was needed to resolve the issue.

One thing of note is that the application appeared to have restarted about three minutes before the issue started occurring. This is typical of what we have been seeing, as usually the error begins after code deploys.

I want to thank everyone for their patience, we are still investigating the problem trying to track down where the records might be getting misdelivered.

I’m running a variety of tests, but have still not been able to recreate mismatched records. I did discover another issues affecting ReadTimeout’s. Specifically what I was seeing was a large number of ReadTimeout’s which would not recover until the JVM was restarted. I’m investigating with the SDK team on that issue.

To provide an overview of what I’m investigating here is the effective pipeline that records go in. IT also includes the possible places I think there maybe a problem.

                                +-------------+
                                |             |
                                |   Kinesis   |
                                |             |
                                +------+------+
                                       |
                                       +-------------------- Data from Kinesis gets mixed between two different shards delivering shard data to the wrong consumer
                                       |
                                      \|/                                                                                                                         
                                +------+------+
                                |             |
                                |   AWS SDK   |
                                |             |
                                +------+------+
                                       |
                                       +--------------------- The SDK receives the data on the right streams, but delivers it to the wrong consumers
                                       |
                                      \|/
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                |  Producer   |                                                                                                                             
                                |             |                                                                                                                             
                                +------+------+                                                                                                                             
                                       |                                                                                                                                    
                                       |                                                                                                                                    
                                       |                                                                                                                                    
                                      \|/                                                                                                                                   
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                |   RxJava    |                                                                                                                             
                                |             |                                                                                                                             
                                +------+------+                                                                                                                             
                                       |                                                                                                                                    
                                       +--------------------- RxJava intermixes two or more Producer/Subscriber connections delivering records for one shard to the incorrect Subscriber
                                       |
                                      \|/                                                                                                                                   
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                |  Consumer   |                                                                                                                             
                                |             |                                                                                                                             
                                +------+------+                                                                                                                             
                                       |                                                                                                                                    
                                       +--------------------- The ShardConumer gets a reference to an active FanOutRecordsProducer intead of a newly created one            
                                       |
                                      \|/                                                                                                                                   
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                |  Processor  |                                                                                                                             
                                |             |                                                                                                                             
                                +------+------+                                                                                                                             
                                       |                                                                                                                                    
                                       +--------------------- The RecordProcessor intermixes records in some way and checkpoints on a record from a different shard         
                                       |
                                      \|/                                                                                                                                   
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                | Checkpoint  |                                                                                                                             
                                |             |                                                                                                                             
                                +------+------+                                                                                                                             
                                       |                                                                                                                                    
                                       +--------------------- The Checkpointer updates the wrong lease                                                                      
                                       |
                                      \|/                                                                                                                                   
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                |  DynamoDB   |                                                                                                                             
                                |             |                                                                                                                             
                                +-------------+

Read timeout’s generally occur when the record processor takes more than 30 seconds to complete processing. The connection to Kinesis is terminated, but the internal subscription in the KCL is still live. There may also be records in the RxJava queue that will be delivered before the exception is delivered to the ShardConsumer. #403 should help quiet these errors down as they are handled automatically.