amazon-kinesis-client: Lots of error messages like: ` Cancelling subscription, and marking self as failed` with `Invalid StartingSequenceNumber` since upgrading to 2.0

java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.InvalidArgumentException: Invalid StartingSequenceNumber. It encodes shardId-000000000234, while it was used in a call to a shard with shardId-000000000309 (Service: Kinesis, Status Code: 400, Request ID: eab0b990-9136-1aad-beb8-145829c423c5)
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java:155) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java:121) ~[sdk-core-2.0.1.jar!/:na]
	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[na:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$Completable.lambda$complete$0(MakeAsyncHttpRequestStage.java:200) ~[sdk-core-2.0.1.jar!/:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [na:na]
	at java.base/java.lang.Thread.run(Thread.java:844) [na:na]
Caused by: software.amazon.awssdk.services.kinesis.model.InvalidArgumentException: Invalid StartingSequenceNumber. It encodes shardId-000000000234, while it was used in a call to a shard with shardId-000000000309 (Service: Kinesis, Status Code: 400, Request ID: eab0b990-9136-1aad-beb8-145829c423c5)
	at software.amazon.awssdk.services.kinesis.model.InvalidArgumentException$BuilderImpl.build(InvalidArgumentException.java:104) ~[kinesis-2.0.1.jar!/:na]
	at software.amazon.awssdk.services.kinesis.model.InvalidArgumentException$BuilderImpl.build(InvalidArgumentException.java:64) ~[kinesis-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.response.SdkErrorResponseHandler.handle(SdkErrorResponseHandler.java:46) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.response.SdkErrorResponseHandler.handle(SdkErrorResponseHandler.java:30) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.async.SyncResponseHandlerAdapter.complete(SyncResponseHandlerAdapter.java:92) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.client.handler.BaseAsyncClientHandler$InterceptorCallingHttpResponseHandler.complete(BaseAsyncClientHandler.java:225) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.handleResponse(MakeAsyncHttpRequestStage.java:185) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.complete(MakeAsyncHttpRequestStage.java:171) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.complete(MakeAsyncHttpRequestStage.java:122) ~[sdk-core-2.0.1.jar!/:na]
	at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$PublisherAdapter$1.onComplete(ResponseHandler.java:255) ~[netty-nio-client-2.0.1.jar!/:na]
	at com.typesafe.netty.HandlerPublisher.publishMessage(HandlerPublisher.java:362) ~[netty-reactive-streams-2.0.0.jar!/:na]
	at com.typesafe.netty.HandlerPublisher.flushBuffer(HandlerPublisher.java:304) ~[netty-reactive-streams-2.0.0.jar!/:na]
	at com.typesafe.netty.HandlerPublisher.receivedDemand(HandlerPublisher.java:258) ~[netty-reactive-streams-2.0.0.jar!/:na]
	at com.typesafe.netty.HandlerPublisher.access$200(HandlerPublisher.java:41) ~[netty-reactive-streams-2.0.0.jar!/:na]
	at com.typesafe.netty.HandlerPublisher$ChannelSubscription$1.run(HandlerPublisher.java:452) ~[netty-reactive-streams-2.0.0.jar!/:na]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[netty-common-4.1.23.Final.jar!/:4.1.23.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) ~[netty-common-4.1.23.Final.jar!/:4.1.23.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) ~[netty-transport-4.1.23.Final.jar!/:4.1.23.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886) ~[netty-common-4.1.23.Final.jar!/:4.1.23.Final]
	... 1 common frames omitted

In the last three hours, I’ve had these errors / warnings with the following combinations of encoded shardId’s and the shard ID it claims it was used with:

(all shardId’s are with the format shardId-000000000301 - so I’m omitting the first part)

Encoded Shard ID	Used-with ShardID	count
301	349	2507
190	205	6385
222	196	2589
302	238	2582
270	258	2503
236	198	2512
253	229	356
266	342	7336
234	309	4710
260	229	2634
333	278	2519
308	318	2669

My stream has not been resharded recently and has 80 shards.

Question: Is this something that happens normally as part of the KCL client behavior, or does this indicate something I should be concerned about? (And If it is considered to be normal behavior, can the logs be changed to something other than warn-level and the full stacktrace not be logged?)

If this is something I should be worried about - what might be causing this?

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 110 (51 by maintainers)

Commits related to this issue

Partially reverting commit 25b5331ca49d96a80e000fb3248c909b80bbdb29 which introduced a bug described in https://github.com/awslabs/amazon-kinesis-client/issues/391 — committed to aws/aws-sdk-java-v2 by shorea 6 years ago
Partially reverting commit 25b5331ca49d96a80e000fb3248c909b80bbdb29 which introduced a bug described in https://github.com/awslabs/amazon-kinesis-client/issues/391 — committed to aws/aws-sdk-java-v2 by shorea 6 years ago
Revert "Partially reverting commit 25b5331ca49d96a80e000fb3248c909b80bbdb29 which introduced a bug described in https://github.com/awslabs/amazon-kinesis-client/issues/391" This reverts commit 1b2458... — committed to aws/aws-sdk-java-v2 by shorea 6 years ago
Revert "Partially reverting commit 25b5331ca49d96a80e000fb3248c909b80bbdb29 which introduced a bug described in https://github.com/awslabs/amazon-kinesis-client/issues/391" This reverts commit 1b2458... — committed to aws/aws-sdk-java-v2 by shorea 6 years ago

Most upvoted comments

I’m working on a change that will validate each sequence number in the response from the FanOutRecordsProducer. I’m just starting testing, and will be looking to do a SNAPSHOT release soon. I want to get it out today or tonight to allow for some testing.

pfifer on Sep 13, 2018

Hi @pfifer , I’m on 2.0.5 and I still get lots of those warnings in my tests. I have 4 shards in my stream and my application is running in 2 ECS instances benefiting from enhanced fanout.

{
    "timestamp": "2018-11-16T21:15:11.016Z",
    "logger": "software.amazon.kinesis.lifecycle.ShardConsumer",
    "exceptionClass": "software.amazon.kinesis.retrieval.RetryableRetrievalException",
    "stackTrace": "software.amazon.kinesis.retrieval.RetryableRetrievalException: ReadTimeout
	at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.errorOccurred(FanOutRecordsPublisher.java: 142)
	at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.access$700(FanOutRecordsPublisher.java: 51)
	at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher$RecordFlow.exceptionOccurred(FanOutRecordsPublisher.java: 516)
	at software.amazon.awssdk.services.kinesis.DefaultKinesisAsyncClient.lambda$subscribeToShard$1(DefaultKinesisAsyncClient.java: 2102)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java: 760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java: 736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java: 474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java: 1977)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java: 155)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java: 121)
	at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java: 822)
	at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java: 797)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java: 474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java: 1977)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$Completable.lambda$completeExceptionally$1(MakeAsyncHttpRequestStage.java: 208)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 624)
	at java.lang.Thread.run(Thread.java: 748)
Caused by: software.amazon.awssdk.core.exception.SdkClientException
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java: 97)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java: 143)
	... 9 more
Caused by: io.netty.handler.timeout.ReadTimeoutException",
    "thread": "ShardRecordProcessor-0002",
    "exceptionMessage": "ReadTimeout",
    "message": "ShardConsumer: shardId-000000000003: onError().  Cancelling subscription, and marking self as failed.",
    "logLevel": "WARN"
}

and also

{
    "timestamp": "2018-11-16T21:19:03.243Z",
    "logger": "software.amazon.awssdk.http.nio.netty.internal.RunnableRequest",
    "exceptionClass": "io.netty.channel.ConnectTimeoutException",
    "stackTrace": "io.netty.channel.ConnectTimeoutException: connection timed out: monitoring.us-east-1.amazonaws.com/ip.ip.ip.ip:443
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	at java.lang.Thread.run(Thread.java:748)",
    "thread": "aws-java-sdk-NettyEventLoop-2-0",
    "exceptionMessage": "connection timed out: monitoring.us-east-1.amazonaws.com/ip.ip.ip.ip:443",
    "message": "RunnableRequest: Failed to create connection to https://monitoring.us-east-1.amazonaws.com",
    "logLevel": "ERROR"
}

and finally:

{
	"timestamp": "2018-11-16T21:16:08.397Z",
	"logger": "software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher",
	"thread": "ShardRecordProcessor-0001",
	"message": "FanOutRecordsPublisher: shardId-000000000003: (FanOutRecordsPublisher/Subscription#request) - Rejected an attempt to request(6), because subscribers don't match.",
	"logLevel": "WARN"
}

majidfn on Nov 16, 2018

Hi folks, interestingly I stumbled on to this post myself as I am also experiencing a similar issue as described above… Pulling down 2.0.5 of KCL to test tonight. The issue occurs several hours in to running.

Unlike what is/was described above, I use the Enhanced Fan out of KCL with only 1 shard…

Throughout its run, I get these errors sporadically:

But, about 8 hours in to running I get a series of errors like so;

18/11/15 23:26:38 WARN channel.DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. java.io.IOException: Connection reset by peer 18/11/16 08:51:12 WARN fanout.FanOutRecordsPublisher: shardId-000000000018: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2018-11-15T21:46:40.848Z id: shardId-000000000018-117 – software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) 18/11/16 08:51:13 WARN fanout.FanOutRecordsPublisher: shardId-000000000018: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2018-11-15T21:51:12.753Z id: shardId-000000000018-118 – software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) 18/11/16 08:51:13 WARN fanout.FanOutRecordsPublisher: shardId-000000000018: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2018-11-15T21:51:13.754Z id: shardId-000000000018-119 – CompletionException/software.amazon.awssdk.services.kinesis.model.ResourceInUseE xception: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (Service: Kinesis, Status Code: 400, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (S ervice: Kinesis, Status Code: 400, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) Caused by: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (Service: Kinesis, Status Code: 4 00, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:104) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:64) java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (S ervice: Kinesis, Status Code: 400, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) Caused by: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (Service: Kinesis, Status Code: 4 00, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:104) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:64)

Going to try 2.0.5 tonight as I need to plan for an outage.

My KCL configuration is rather simple, it just follows the KCL 1.0 to 2.0 migration guide in that it has;

new Scheduler( configsBuilder.checkpointConfig(), configsBuilder.coordinatorConfig(), configsBuilder.leaseManagementConfig() .initialLeaseTableReadCapacity(5) .initialLeaseTableWriteCapacity(5) .cleanupLeasesUponShardCompletion(true) .failoverTimeMillis(60000), configsBuilder.lifecycleConfig(), configsBuilder.metricsConfig() .metricsLevel(MetricsLevel.NONE), configsBuilder.processorConfig() .callProcessRecordsEvenForEmptyRecordList(false), configsBuilder.retrievalConfig());

Any tips or tricks much appreciated.

Chris

cdfleischmann on Nov 15, 2018

Yes, I’m going to start looking at it right now.

^{Edit: Adjusted grammar, since apparently I’m still tired.}

pfifer on Oct 2, 2018

Just an update. We re-deployed our application yesterday morning with experiment-6. We have not yet tried to downgrade the AWS SDK to 2.0.0. It ran well with no stuck shards for about 12 hours. Then, it began failing on one shard (shardId-000000000031).

First, the mismatched records message appeared for shardId-000000000031 repeatedly for about a minute:

2018-09-27T00:26:09,761 [ERROR] software.amazon.kinesis.lifecycle.ShardConsumer:197 - [shardId-000000000031] Found mismatched records: (shardId-000000000038 -> 1)

Then, the following three messages started appearing repeatedly until the application was restarted:

2018-09-27T00:27:31,382 [WARN] software.amazon.kinesis.lifecycle.ShardConsumer:270 - shardId-000000000031: Failure occurred in retrieval. Restarting data requests
java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.InvalidArgumentException: Invalid StartingSequenceNumber. It encodes shardId-000000000038, while it was used in a call to a shard with shardId-000000000031 (Service: Kinesis, Status Code: 400, Request ID: d73a5a9e-cad8-ccb4-831c-7cacad8c5c53)

2018-09-27T00:27:31,383 [DEBUG] software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher:135 - shardId-000000000031: [SubscriptionLifetime]: (FanOutRecordsPublisher#subscribeToShard) @ 2018-09-27T00:27:31.383Z id: shardId-000000000031-49 -- Starting subscribe to shard

2018-09-27T00:27:31,383 [DEBUG] software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher:140 -shardId-000000000031: [SubscriptionLifetime]: (FanOutRecordsPublisher#subscribeToShard) @ 2018-09-27T00:27:31.383Z id: shardId-000000000031-49 -- Making StoS Request: SubscribeToShardRequest(ConsumerARN=arn:aws:kinesis:us-east-1:905389620212:stream/measnap-MeasurementSnapshotCaptorStreamV2/consumer/measnap-MeasurementSnapshotCaptorV2:1536187455, ShardId=shardId-000000000031, StartingPosition=StartingPosition(Type=AFTER_SEQUENCE_NUMBER, SequenceNumber=49587051802106862994389666143737629092499109650800575074))

I checked the sequence number in the request (49587051802106862994389666143737629092499109650800575074), and it does belong to shardId-000000000038, not shardId-000000000031. I searched all our logs for that sequence number, but it does not appear anywhere else besides that set of three error messages. The good news was that it did not checkpoint an invalid sequence number. The sequence number stored in the lease table for shardId-000000000031 was valid. So, a restart of the application was all that was needed to resolve the issue.

One thing of note is that the application appeared to have restarted about three minutes before the issue started occurring. This is typical of what we have been seeing, as usually the error begins after code deploys.

athielen2 on Sep 27, 2018

I want to thank everyone for their patience, we are still investigating the problem trying to track down where the records might be getting misdelivered.

I’m running a variety of tests, but have still not been able to recreate mismatched records. I did discover another issues affecting ReadTimeout’s. Specifically what I was seeing was a large number of ReadTimeout’s which would not recover until the JVM was restarted. I’m investigating with the SDK team on that issue.

To provide an overview of what I’m investigating here is the effective pipeline that records go in. IT also includes the possible places I think there maybe a problem.

                                +-------------+
                                |             |
                                |   Kinesis   |
                                |             |
                                +------+------+
                                       |
                                       +-------------------- Data from Kinesis gets mixed between two different shards delivering shard data to the wrong consumer
                                       |
                                      \|/                                                                                                                         
                                +------+------+
                                |             |
                                |   AWS SDK   |
                                |             |
                                +------+------+
                                       |
                                       +--------------------- The SDK receives the data on the right streams, but delivers it to the wrong consumers
                                       |
                                      \|/
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                |  Producer   |                                                                                                                             
                                |             |                                                                                                                             
                                +------+------+                                                                                                                             
                                       |                                                                                                                                    
                                       |                                                                                                                                    
                                       |                                                                                                                                    
                                      \|/                                                                                                                                   
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                |   RxJava    |                                                                                                                             
                                |             |                                                                                                                             
                                +------+------+                                                                                                                             
                                       |                                                                                                                                    
                                       +--------------------- RxJava intermixes two or more Producer/Subscriber connections delivering records for one shard to the incorrect Subscriber
                                       |
                                      \|/                                                                                                                                   
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                |  Consumer   |                                                                                                                             
                                |             |                                                                                                                             
                                +------+------+                                                                                                                             
                                       |                                                                                                                                    
                                       +--------------------- The ShardConumer gets a reference to an active FanOutRecordsProducer intead of a newly created one            
                                       |
                                      \|/                                                                                                                                   
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                |  Processor  |                                                                                                                             
                                |             |                                                                                                                             
                                +------+------+                                                                                                                             
                                       |                                                                                                                                    
                                       +--------------------- The RecordProcessor intermixes records in some way and checkpoints on a record from a different shard         
                                       |
                                      \|/                                                                                                                                   
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                | Checkpoint  |                                                                                                                             
                                |             |                                                                                                                             
                                +------+------+                                                                                                                             
                                       |                                                                                                                                    
                                       +--------------------- The Checkpointer updates the wrong lease                                                                      
                                       |
                                      \|/                                                                                                                                   
                                +------+------+                                                                                                                             
                                |             |                                                                                                                             
                                |  DynamoDB   |                                                                                                                             
                                |             |                                                                                                                             
                                +-------------+

pfifer on Sep 20, 2018

Read timeout’s generally occur when the record processor takes more than 30 seconds to complete processing. The connection to Kinesis is terminated, but the internal subscription in the KCL is still live. There may also be records in the RxJava queue that will be delivered before the exception is delivered to the ShardConsumer. #403 should help quiet these errors down as they are handled automatically.

pfifer on Sep 19, 2018