amazon-kinesis-client: Lots of error messages like: ` Cancelling subscription, and marking self as failed` with `Invalid StartingSequenceNumber` since upgrading to 2.0
java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.InvalidArgumentException: Invalid StartingSequenceNumber. It encodes shardId-000000000234, while it was used in a call to a shard with shardId-000000000309 (Service: Kinesis, Status Code: 400, Request ID: eab0b990-9136-1aad-beb8-145829c423c5)
at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java:155) ~[sdk-core-2.0.1.jar!/:na]
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java:121) ~[sdk-core-2.0.1.jar!/:na]
at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[na:na]
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$Completable.lambda$complete$0(MakeAsyncHttpRequestStage.java:200) ~[sdk-core-2.0.1.jar!/:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [na:na]
at java.base/java.lang.Thread.run(Thread.java:844) [na:na]
Caused by: software.amazon.awssdk.services.kinesis.model.InvalidArgumentException: Invalid StartingSequenceNumber. It encodes shardId-000000000234, while it was used in a call to a shard with shardId-000000000309 (Service: Kinesis, Status Code: 400, Request ID: eab0b990-9136-1aad-beb8-145829c423c5)
at software.amazon.awssdk.services.kinesis.model.InvalidArgumentException$BuilderImpl.build(InvalidArgumentException.java:104) ~[kinesis-2.0.1.jar!/:na]
at software.amazon.awssdk.services.kinesis.model.InvalidArgumentException$BuilderImpl.build(InvalidArgumentException.java:64) ~[kinesis-2.0.1.jar!/:na]
at software.amazon.awssdk.core.internal.http.response.SdkErrorResponseHandler.handle(SdkErrorResponseHandler.java:46) ~[sdk-core-2.0.1.jar!/:na]
at software.amazon.awssdk.core.internal.http.response.SdkErrorResponseHandler.handle(SdkErrorResponseHandler.java:30) ~[sdk-core-2.0.1.jar!/:na]
at software.amazon.awssdk.core.internal.http.async.SyncResponseHandlerAdapter.complete(SyncResponseHandlerAdapter.java:92) ~[sdk-core-2.0.1.jar!/:na]
at software.amazon.awssdk.core.client.handler.BaseAsyncClientHandler$InterceptorCallingHttpResponseHandler.complete(BaseAsyncClientHandler.java:225) ~[sdk-core-2.0.1.jar!/:na]
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.handleResponse(MakeAsyncHttpRequestStage.java:185) ~[sdk-core-2.0.1.jar!/:na]
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.complete(MakeAsyncHttpRequestStage.java:171) ~[sdk-core-2.0.1.jar!/:na]
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$ResponseHandler.complete(MakeAsyncHttpRequestStage.java:122) ~[sdk-core-2.0.1.jar!/:na]
at software.amazon.awssdk.http.nio.netty.internal.ResponseHandler$PublisherAdapter$1.onComplete(ResponseHandler.java:255) ~[netty-nio-client-2.0.1.jar!/:na]
at com.typesafe.netty.HandlerPublisher.publishMessage(HandlerPublisher.java:362) ~[netty-reactive-streams-2.0.0.jar!/:na]
at com.typesafe.netty.HandlerPublisher.flushBuffer(HandlerPublisher.java:304) ~[netty-reactive-streams-2.0.0.jar!/:na]
at com.typesafe.netty.HandlerPublisher.receivedDemand(HandlerPublisher.java:258) ~[netty-reactive-streams-2.0.0.jar!/:na]
at com.typesafe.netty.HandlerPublisher.access$200(HandlerPublisher.java:41) ~[netty-reactive-streams-2.0.0.jar!/:na]
at com.typesafe.netty.HandlerPublisher$ChannelSubscription$1.run(HandlerPublisher.java:452) ~[netty-reactive-streams-2.0.0.jar!/:na]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[netty-common-4.1.23.Final.jar!/:4.1.23.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) ~[netty-common-4.1.23.Final.jar!/:4.1.23.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) ~[netty-transport-4.1.23.Final.jar!/:4.1.23.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886) ~[netty-common-4.1.23.Final.jar!/:4.1.23.Final]
... 1 common frames omitted
In the last three hours, I’ve had these errors / warnings with the following combinations of encoded shardId’s and the shard ID it claims it was used with:
(all shardId’s are with the format shardId-000000000301 - so I’m omitting the first part)
| Encoded Shard ID | Used-with ShardID | count |
|---|---|---|
| 301 | 349 | 2507 |
| 190 | 205 | 6385 |
| 222 | 196 | 2589 |
| 302 | 238 | 2582 |
| 270 | 258 | 2503 |
| 236 | 198 | 2512 |
| 253 | 229 | 356 |
| 266 | 342 | 7336 |
| 234 | 309 | 4710 |
| 260 | 229 | 2634 |
| 333 | 278 | 2519 |
| 308 | 318 | 2669 |
My stream has not been resharded recently and has 80 shards.
Question: Is this something that happens normally as part of the KCL client behavior, or does this indicate something I should be concerned about? (And If it is considered to be normal behavior, can the logs be changed to something other than warn-level and the full stacktrace not be logged?)
If this is something I should be worried about - what might be causing this?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 110 (51 by maintainers)
Commits related to this issue
- Partially reverting commit 25b5331ca49d96a80e000fb3248c909b80bbdb29 which introduced a bug described in https://github.com/awslabs/amazon-kinesis-client/issues/391 — committed to aws/aws-sdk-java-v2 by shorea 6 years ago
- Partially reverting commit 25b5331ca49d96a80e000fb3248c909b80bbdb29 which introduced a bug described in https://github.com/awslabs/amazon-kinesis-client/issues/391 — committed to aws/aws-sdk-java-v2 by shorea 6 years ago
- Revert "Partially reverting commit 25b5331ca49d96a80e000fb3248c909b80bbdb29 which introduced a bug described in https://github.com/awslabs/amazon-kinesis-client/issues/391" This reverts commit 1b2458... — committed to aws/aws-sdk-java-v2 by shorea 6 years ago
- Revert "Partially reverting commit 25b5331ca49d96a80e000fb3248c909b80bbdb29 which introduced a bug described in https://github.com/awslabs/amazon-kinesis-client/issues/391" This reverts commit 1b2458... — committed to aws/aws-sdk-java-v2 by shorea 6 years ago
I’m working on a change that will validate each sequence number in the response from the FanOutRecordsProducer. I’m just starting testing, and will be looking to do a SNAPSHOT release soon. I want to get it out today or tonight to allow for some testing.
Hi @pfifer , I’m on
2.0.5and I still get lots of those warnings in my tests. I have4shards in my stream and my application is running in2ECS instances benefiting from enhanced fanout.and also
and finally:
Hi folks, interestingly I stumbled on to this post myself as I am also experiencing a similar issue as described above… Pulling down 2.0.5 of KCL to test tonight. The issue occurs several hours in to running.
Unlike what is/was described above, I use the Enhanced Fan out of KCL with only 1 shard…
Throughout its run, I get these errors sporadically:
18/11/15 23:26:38 WARN channel.DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. java.io.IOException: Connection reset by peer
But, about 8 hours in to running I get a series of errors like so;
18/11/15 23:26:38 WARN channel.DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. java.io.IOException: Connection reset by peer 18/11/16 08:51:12 WARN fanout.FanOutRecordsPublisher: shardId-000000000018: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2018-11-15T21:46:40.848Z id: shardId-000000000018-117 – software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: ec79819c-7d93-afea-b81e-6a1c96170973) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) 18/11/16 08:51:13 WARN fanout.FanOutRecordsPublisher: shardId-000000000018: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2018-11-15T21:51:12.753Z id: shardId-000000000018-118 – software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) software.amazon.awssdk.services.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: f1dd016a-e432-c116-a5ba-edfa0fb6678f) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:100) at software.amazon.awssdk.services.kinesis.model.InternalFailureException$BuilderImpl.build(InternalFailureException.java:60) 18/11/16 08:51:13 WARN fanout.FanOutRecordsPublisher: shardId-000000000018: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2018-11-15T21:51:13.754Z id: shardId-000000000018-119 – CompletionException/software.amazon.awssdk.services.kinesis.model.ResourceInUseE xception: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (Service: Kinesis, Status Code: 400, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (S ervice: Kinesis, Status Code: 400, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) Caused by: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (Service: Kinesis, Status Code: 4 00, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:104) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:64) java.util.concurrent.CompletionException: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (S ervice: Kinesis, Status Code: 400, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) Caused by: software.amazon.awssdk.services.kinesis.model.ResourceInUseException: Another active subscription exists for this consumer: ACCOUNT_ID:SHARD_NAME:shardId-000000000018:APPLICATION_NAME-01b34944-323d-4d1c-b61a-a30f0a3a79eb1 (Service: Kinesis, Status Code: 4 00, Request ID: e6bec218-1cab-1ab3-b2d9-2e89f72fbc2a) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:104) at software.amazon.awssdk.services.kinesis.model.ResourceInUseException$BuilderImpl.build(ResourceInUseException.java:64)
Going to try 2.0.5 tonight as I need to plan for an outage.
My KCL configuration is rather simple, it just follows the KCL 1.0 to 2.0 migration guide in that it has;
new Scheduler( configsBuilder.checkpointConfig(), configsBuilder.coordinatorConfig(), configsBuilder.leaseManagementConfig() .initialLeaseTableReadCapacity(5) .initialLeaseTableWriteCapacity(5) .cleanupLeasesUponShardCompletion(true) .failoverTimeMillis(60000), configsBuilder.lifecycleConfig(), configsBuilder.metricsConfig() .metricsLevel(MetricsLevel.NONE), configsBuilder.processorConfig() .callProcessRecordsEvenForEmptyRecordList(false), configsBuilder.retrievalConfig());
Any tips or tricks much appreciated.
Chris
Yes, I’m going to start looking at it right now.
Edit: Adjusted grammar, since apparently I’m still tired.
Just an update. We re-deployed our application yesterday morning with experiment-6. We have not yet tried to downgrade the AWS SDK to 2.0.0. It ran well with no stuck shards for about 12 hours. Then, it began failing on one shard (shardId-000000000031).
First, the mismatched records message appeared for shardId-000000000031 repeatedly for about a minute:
Then, the following three messages started appearing repeatedly until the application was restarted:
I checked the sequence number in the request (49587051802106862994389666143737629092499109650800575074), and it does belong to shardId-000000000038, not shardId-000000000031. I searched all our logs for that sequence number, but it does not appear anywhere else besides that set of three error messages. The good news was that it did not checkpoint an invalid sequence number. The sequence number stored in the lease table for shardId-000000000031 was valid. So, a restart of the application was all that was needed to resolve the issue.
One thing of note is that the application appeared to have restarted about three minutes before the issue started occurring. This is typical of what we have been seeing, as usually the error begins after code deploys.
I want to thank everyone for their patience, we are still investigating the problem trying to track down where the records might be getting misdelivered.
I’m running a variety of tests, but have still not been able to recreate mismatched records. I did discover another issues affecting ReadTimeout’s. Specifically what I was seeing was a large number of ReadTimeout’s which would not recover until the JVM was restarted. I’m investigating with the SDK team on that issue.
To provide an overview of what I’m investigating here is the effective pipeline that records go in. IT also includes the possible places I think there maybe a problem.
Read timeout’s generally occur when the record processor takes more than 30 seconds to complete processing. The connection to Kinesis is terminated, but the internal subscription in the KCL is still live. There may also be records in the RxJava queue that will be delivered before the exception is delivered to the
ShardConsumer. #403 should help quiet these errors down as they are handled automatically.