azure-cosmos-dotnet-v3: CosmosDb client throwing timeout Exception
We are using “Microsoft.Azure.Cosmos” Version=“3.23.0” in Net Core application. After migrating to the newer client SDK, we are seeing intermittent failures from Cosmos. To Create CosmosClient and container we are using default values for the time out and not specifying any custom value to override it. here is the stack trace of Exception
`
<html> <body>The operation was canceled.CosmosDiagnostics: {“Summary”:{“DirectCalls”:{“(200, 0)”:1,“(204, 0)”:20},“RegionsContacted”:1},“name”:“ReplaceItemAsync”,“id”:“fcafa2a7-6c50-4de5-86e4-a89ea4e0d0aa”,“caller info”:{“member”:“OperationHelperWithRootTraceAsync”,“file”:“ClientContextCore.cs”,“line”:244},“start time”:“01:03:27:626”,“duration in milliseconds”:1907.6462,“data”:{“Client Configuration”:{“Client Created Time Utc”:“2022-02-02T03:37:50.3122530Z”,“NumberOfClientsCreated”:1,“User Agent”:“cosmos-netstandard-sdk/3.23.0|3.23.1|1|X64|Linux 5.4.0-1064-azure 67 18.|.NET Core 3.1.22|L|”,“ConnectionConfig”:{“gw”:“(cps:50, urto:10, p:False, httpf: False)”,“rntbd”:“(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)”,“other”:“(ed:False, be:False)”},“ConsistencyConfig”:“(consistency: NotSet, prgns:[westus])”},“Operation Cancelled Exception”:“System.OperationCanceledException: The operation was canceled.\n at System.Threading.CancellationToken.ThrowOperationCanceledException()\n at Microsoft.Azure.Documents.TimeoutHelper.ThrowGoneIfElapsed()\n at Microsoft.Azure.Documents.StoreReader.ReadMultipleReplicaAsync(DocumentServiceRequest entity, Boolean includePrimary, Int32 replicaCountToRead, Boolean requiresValidLsn, Boolean useSessionToken, ReadMode readMode, Boolean checkMinLSN, Boolean forceReadAll)\n at Microsoft.Azure.Documents.ConsistencyWriter.WaitForWriteBarrierAsync(DocumentServiceRequest barrierRequest, Int64 selectedGlobalCommittedLsn)\n at Microsoft.Azure.Documents.ConsistencyWriter.WritePrivateAsync(DocumentServiceRequest request, TimeoutHelper timeout, Boolean forceRefresh)\n at Microsoft.Azure.Documents.BackoffRetryUtility1.ExecuteRetryAsync(Func 1 callbackMethod, Func3 callShouldRetry, Func 1 inBackoffAlternateCallbackMethod, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action1 preRetryCallback)\n at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)\n at Microsoft.Azure.Documents.BackoffRetryUtility 1.ExecuteRetryAsync(Func1 callbackMethod, Func 3 callShouldRetry, Func1 inBackoffAlternateCallbackMethod, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action 1 preRetryCallback)\n at Microsoft.Azure.Documents.ConsistencyWriter.WriteAsync(DocumentServiceRequest entity, TimeoutHelper timeout, Boolean forceRefresh, CancellationToken cancellationToken)\n at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass31_0.<<InvokeAsync>b__0>d.MoveNext()\n— End of stack trace from previous location where exception was thrown —\n at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func1 executeAsync, Func 1 prepareRequest, IRequestRetryPolicy2 policy, CancellationToken cancellationToken, Func 1 inBackoffAlternateCallbackMethod, Nullable1 minBackoffForInBackoffCallback)\n at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)\n at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func 1 executeAsync, Func1 prepareRequest, IRequestRetryPolicy 2 policy, CancellationToken cancellationToken, Func1 inBackoffAlternateCallbackMethod, Nullable 1 minBackoffForInBackoffCallback)\n at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy, Func2 prepareRequestAsyncDelegate)\n at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.RouterHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.ExecuteHttpRequestAsync(Func 1 callbackMethod, Func3 callShouldRetry, Func 3 callShouldRetryException, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Team.Services.Common.CosmosDb.Metrics.CosmosDbMetricsRequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(String resourceUriString, ResourceType resourceType, OperationType operationType, RequestOptions requestOptions, ContainerInternal cosmosContainerCore, FeedRange feedRange, Stream streamPayload, Action1 requestEnricher, ITrace trace, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ContainerCore.ProcessItemStreamAsync(Nullable 1 partitionKey, String itemId, Stream streamPayload, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ContainerCore.ExtractPartitionKeyAndProcessItemStreamAsync[T](Nullable1 partitionKey, String itemId, T item, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ContainerCore.ReplaceItemAsync[T](T item, String id, ITrace trace, Nullable 1 partitionKey, ItemRequestOptions requestOptions, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func`2 task)”},“children”:[{“name”:“ItemSerialize”,“id”:“f1df4548-aca2-4c4c-b3d1-628be7929927”,“caller info”:{“member”:“ExtractPartitionKeyAndProcessItemStreamAsync”,“file”:“ContainerCore.Items.cs”,“line”:941},“start time”:“01:03:27:626”,“duration in milliseconds”:0.037},{“name”:“Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler”,“id”:“297adf96-e2e9-469b-8377-b9112b372d58”,“start time”:“01:03:27:626”,“duration in milliseconds”:1907.3794,“children”:[{“name”:“Team.Services.Common.CosmosDb.Metrics.CosmosDbMetricsRequestHandler”,“id”:“cde494c7-1337-410b-ab0f-d1d134bff1d0”,“start time”:“01:03:27:626”,“duration in milliseconds”:1907.2991,“children”:[{“name”:“Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler”,“id”:“fa46111b-3207-40f4-83f2-8df7f6c51000”,“start time”:“01:03:27:626”,“duration in milliseconds”:1906.8139,“children”:[{“name”:“Microsoft.Azure.Cosmos.Handlers.RetryHandler”,“id”:“ea078910-bd49-43d2-a2e5-bcf1f75eaa20”,“start time”:“01:03:27:626”,“duration in milliseconds”:1906.7529,“children”:[{“name”:“Microsoft.Azure.Cosmos.Handlers.RouterHandler”,“id”:“b97c43e4-7dd0-425b-ad9b-063d22c3f2e9”,“start time”:“01:03:27:626”,“duration in milliseconds”:1906.6648,“children”:[{“name”:“Microsoft.Azure.Cosmos.Handlers.TransportHandler”,“id”:“9eb400c9-7db3-42f0-a6ca-04b8ea09d665”,“start time”:“01:03:27:626”,“duration in milliseconds”:1906.6244,“children”:[{“name”:“Microsoft.Azure.Documents.ServerStoreModel Transport Request”,“id”:“d0dd59bc-daf8-4277-859d-a006f2c8b3fa”,“caller info”:{“member”:“ProcessMessageAsync”,“file”:“TransportHandler.cs”,“line”:109},“start time”:“01:03:27:626”,“duration in milliseconds”:1906.4729,“data”:{“Client Side Request Stats”:{“Id”:“AggregatedClientSideRequestStatistics”,“ContactedReplicas”:[{“Count”:1,<Removed for security>],“FailedReplicas”:[],“AddressResolutionStatistics”:[],“StoreResponseStatistics”:[{“ResponseTimeUTC”:“2022-02-03T01:03:27.6953683Z”,“ResourceType”:“Document”,“OperationType”:“Replace”,“LocationEndpoint”:“endpoint/”,“StoreResult”:{“ActivityId”:“82a312ab-765d-4362-abc1-257425ef55d5”,“StatusCode”:“Ok”,“SubStatusCode”:“Unknown”,“LSN”:9692316,“PartitionKeyRangeId”:“0”,“GlobalCommittedLSN”:9692310,“ItemLSN”:-1,“UsingLocalLSN”:false,“QuorumAckedLSN”:9692315,“SessionToken”:“0#9692316#7=-1”,“CurrentWriteQuorum”:3,“CurrentReplicaSetSize”:4,“NumberOfReadRegions”:1,“IsValid”:true,“StorePhysicalAddress”:“rntbd://:14068/apps/ba19ffe5-e50d-4d28-9b5d-c069d0e15f9a/services/f707bbef-ed4b-4815-82f7-ccd9e8b7f212/partitions/b691ef48-742a-4c3d-b1c5-1117beeb0a6d/replicas/132876734229942716p/”,“RequestCharge”:16,“RetryAfterInMs”:null,“BELatencyInMs”:“5.287”,“transportRequestTimeline”:{“requestTimeline”:[{“event”: “Created”, “startTimeUtc”: “2022-02-03T01:03:27.6264893Z”, “durationInMs”: 0.0027},{“event”: “ChannelAcquisitionStarted”, “startTimeUtc”: “2022-02-03T01:03:27.6264920Z”, “durationInMs”: 0.009},{“event”: “Pipelined”, “startTimeUtc”: “2022-02-03T01:03:27.6265010Z”, “durationInMs”: 0.122},{“event”: “Transit Time”, “startTimeUtc”: “2022-02-03T01:03:27.6266230Z”, “durationInMs”: 68.6575},{“event”: “Received”, “startTimeUtc”: “2022-02-03T01:03:27.6952805Z”, “durationInMs”: 0.0698},{“event”: “Completed”, “startTimeUtc”: “2022-02-03T01:03:27.6953503Z”, “durationInMs”: 0}],“requestSizeInBytes”:2902,“requestBodySizeInBytes”:851,“responseMetadataSizeInBytes”:500},“TransportException”:null}},{“ResponseTimeUTC”:“2022-02-03T01:03:27.7595265Z”,“ResourceType”:“Collection”,“OperationType”:“Head”,“LocationEndpoint”:“https://”,“StoreResult”:{“ActivityId”:“82a312ab-765d-4362-abc1-257425ef55d5”,“StatusCode”:“NoContent”,“SubStatusCode”:“Unknown”,“LSN”:9692310,“PartitionKeyRangeId”:“0”,“GlobalCommittedLSN”:9692310,“ItemLSN”:8971340,“UsingLocalLSN”:false,“QuorumAckedLSN”:-1,“SessionToken”:“0#9692310#7=-1”,“CurrentWriteQuorum”:-1,“CurrentReplicaSetSize”:-1,“NumberOfReadRegions”:1,“IsValid”:true,“StorePhysicalAddress”:“rntbd://:/apps//services/f707bbef-ed4b-4815-82f7-ccd9e8b7f212/partitions/b691ef48-742a-4c3d-b1c5-1117beeb0a6d/replicas/132860494913145228s/”,“RequestCharge”:0,“RetryAfterInMs”:null,“BELatencyInMs”:“0.451”,“transportRequestTimeline”:{“requestTimeline”:[{“event”: “Created”, “startTimeUtc”: “2022-02-03T01:03:27.6956262Z”, “durationInMs”: 0.0027},{“event”: “ChannelAcquisitionStarted”, “startTimeUtc”: “2022-02-03T01:03:27.6956289Z”, “durationInMs”: 0.0026},{“event”: “Pipelined”, “startTimeUtc”: “2022-02-03T01:03:27.6956315Z”, “durationInMs”: 0.1074},{“event”: “Transit Time”, “startTimeUtc”: “2022-02-03T01:03:27.6957389Z”, “durationInMs”: 63.7201},{“event”: “Received”, “startTimeUtc”: “2022-02-03T01:03:27.7594590Z”, “durationInMs”: 0.0621},{“event”: “Completed”, “startTimeUtc”: “2022-02-03T01:03:27.7595211Z”, “durationInMs”: 0}],“requestSizeInBytes”:1881,“responseMetadataSizeInBytes”:298},“TransportException”:null}},{“ResponseTimeUTC”:“2022-02-03T01:03:27.8362170Z”,“ResourceType”:“Collection”,“OperationType”:“Head”,“LocationEndpoint”:“/”,“StoreResult”:{“ActivityId”:“82a312ab-765d-4362-abc1-257425ef55d5”,“StatusCode”:“NoContent”,“SubStatusCode”:“Unknown”,“LSN”:9692316,“PartitionKeyRangeId”:“0”,“GlobalCommittedLSN”:9692310,“ItemLSN”:8971340,“UsingLocalLSN”:false,“QuorumAckedLSN”:-1,“SessionToken”:“0#9692316#7=-1”,“CurrentWriteQuorum”:-1,“CurrentReplicaSetSize”:-1,“NumberOfReadRegions”:1,“IsValid”:true,“StorePhysicalAddress”:"rntbd://c/apps/ba |
---|
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (9 by maintainers)
The CancellationToken is stopping the quorum verification for Strong/Bounded Staleness, that means that quorum, for some reason, is not being achieved within those 2 seconds. The 20 Collection HEAD requests mean that the SDK checked 20 times for quorum until the token cancelled. It would be good to investigate why is it taking that long for quorum. Increasing the token time would certainly help with mitigation.
First step would be to increase the CancellationToken time and see if that impacts. Afterwards, if the issue persists, filing a support ticket should help the service team investigate what could be the reason. For the service ticket it would be ideal to have the full exception if possible (no truncation).