azure-cosmos-dotnet-v3: CosmosDb client throwing timeout Exception

We are using “Microsoft.Azure.Cosmos” Version=“3.23.0” in Net Core application. After migrating to the newer client SDK, we are seeing intermittent failures from Cosmos. To Create CosmosClient and container we are using default values for the time out and not specifying any custom value to override it. here is the stack trace of Exception

`

<html> <body>
The operation was canceled.CosmosDiagnostics: {“Summary”:{“DirectCalls”:{“(200, 0)”:1,“(204, 0)”:20},“RegionsContacted”:1},“name”:“ReplaceItemAsync”,“id”:“fcafa2a7-6c50-4de5-86e4-a89ea4e0d0aa”,“caller info”:{“member”:“OperationHelperWithRootTraceAsync”,“file”:“ClientContextCore.cs”,“line”:244},“start time”:“01:03:27:626”,“duration in milliseconds”:1907.6462,“data”:{“Client Configuration”:{“Client Created Time Utc”:“2022-02-02T03:37:50.3122530Z”,“NumberOfClientsCreated”:1,“User Agent”:“cosmos-netstandard-sdk/3.23.0|3.23.1|1|X64|Linux 5.4.0-1064-azure 67 18.|.NET Core 3.1.22|L|”,“ConnectionConfig”:{“gw”:“(cps:50, urto:10, p:False, httpf: False)”,“rntbd”:“(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)”,“other”:“(ed:False, be:False)”},“ConsistencyConfig”:“(consistency: NotSet, prgns:[westus])”},“Operation Cancelled Exception”:“System.OperationCanceledException: The operation was canceled.\n at System.Threading.CancellationToken.ThrowOperationCanceledException()\n at Microsoft.Azure.Documents.TimeoutHelper.ThrowGoneIfElapsed()\n at Microsoft.Azure.Documents.StoreReader.ReadMultipleReplicaAsync(DocumentServiceRequest entity, Boolean includePrimary, Int32 replicaCountToRead, Boolean requiresValidLsn, Boolean useSessionToken, ReadMode readMode, Boolean checkMinLSN, Boolean forceReadAll)\n at Microsoft.Azure.Documents.ConsistencyWriter.WaitForWriteBarrierAsync(DocumentServiceRequest barrierRequest, Int64 selectedGlobalCommittedLsn)\n at Microsoft.Azure.Documents.ConsistencyWriter.WritePrivateAsync(DocumentServiceRequest request, TimeoutHelper timeout, Boolean forceRefresh)\n at Microsoft.Azure.Documents.BackoffRetryUtility1.ExecuteRetryAsync(Func1 callbackMethod, Func3 callShouldRetry, Func1 inBackoffAlternateCallbackMethod, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action1 preRetryCallback)\n at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)\n at Microsoft.Azure.Documents.BackoffRetryUtility1.ExecuteRetryAsync(Func1 callbackMethod, Func3 callShouldRetry, Func1 inBackoffAlternateCallbackMethod, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action1 preRetryCallback)\n at Microsoft.Azure.Documents.ConsistencyWriter.WriteAsync(DocumentServiceRequest entity, TimeoutHelper timeout, Boolean forceRefresh, CancellationToken cancellationToken)\n at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass31_0.<<InvokeAsync>b__0>d.MoveNext()\n— End of stack trace from previous location where exception was thrown —\n at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func1 executeAsync, Func1 prepareRequest, IRequestRetryPolicy2 policy, CancellationToken cancellationToken, Func1 inBackoffAlternateCallbackMethod, Nullable1 minBackoffForInBackoffCallback)\n at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)\n at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func1 executeAsync, Func1 prepareRequest, IRequestRetryPolicy2 policy, CancellationToken cancellationToken, Func1 inBackoffAlternateCallbackMethod, Nullable1 minBackoffForInBackoffCallback)\n at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy, Func2 prepareRequestAsyncDelegate)\n at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.RouterHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.ExecuteHttpRequestAsync(Func1 callbackMethod, Func3 callShouldRetry, Func3 callShouldRetryException, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Team.Services.Common.CosmosDb.Metrics.CosmosDbMetricsRequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(String resourceUriString, ResourceType resourceType, OperationType operationType, RequestOptions requestOptions, ContainerInternal cosmosContainerCore, FeedRange feedRange, Stream streamPayload, Action1 requestEnricher, ITrace trace, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ContainerCore.ProcessItemStreamAsync(Nullable1 partitionKey, String itemId, Stream streamPayload, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ContainerCore.ExtractPartitionKeyAndProcessItemStreamAsync[T](Nullable1 partitionKey, String itemId, T item, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ContainerCore.ReplaceItemAsync[T](T item, String id, ITrace trace, Nullable1 partitionKey, ItemRequestOptions requestOptions, CancellationToken cancellationToken)\n at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func`2 task)”},“children”:[{“name”:“ItemSerialize”,“id”:“f1df4548-aca2-4c4c-b3d1-628be7929927”,“caller info”:{“member”:“ExtractPartitionKeyAndProcessItemStreamAsync”,“file”:“ContainerCore.Items.cs”,“line”:941},“start time”:“01:03:27:626”,“duration in milliseconds”:0.037},{“name”:“Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler”,“id”:“297adf96-e2e9-469b-8377-b9112b372d58”,“start time”:“01:03:27:626”,“duration in milliseconds”:1907.3794,“children”:[{“name”:“Team.Services.Common.CosmosDb.Metrics.CosmosDbMetricsRequestHandler”,“id”:“cde494c7-1337-410b-ab0f-d1d134bff1d0”,“start time”:“01:03:27:626”,“duration in milliseconds”:1907.2991,“children”:[{“name”:“Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler”,“id”:“fa46111b-3207-40f4-83f2-8df7f6c51000”,“start time”:“01:03:27:626”,“duration in milliseconds”:1906.8139,“children”:[{“name”:“Microsoft.Azure.Cosmos.Handlers.RetryHandler”,“id”:“ea078910-bd49-43d2-a2e5-bcf1f75eaa20”,“start time”:“01:03:27:626”,“duration in milliseconds”:1906.7529,“children”:[{“name”:“Microsoft.Azure.Cosmos.Handlers.RouterHandler”,“id”:“b97c43e4-7dd0-425b-ad9b-063d22c3f2e9”,“start time”:“01:03:27:626”,“duration in milliseconds”:1906.6648,“children”:[{“name”:“Microsoft.Azure.Cosmos.Handlers.TransportHandler”,“id”:“9eb400c9-7db3-42f0-a6ca-04b8ea09d665”,“start time”:“01:03:27:626”,“duration in milliseconds”:1906.6244,“children”:[{“name”:“Microsoft.Azure.Documents.ServerStoreModel Transport Request”,“id”:“d0dd59bc-daf8-4277-859d-a006f2c8b3fa”,“caller info”:{“member”:“ProcessMessageAsync”,“file”:“TransportHandler.cs”,“line”:109},“start time”:“01:03:27:626”,“duration in milliseconds”:1906.4729,“data”:{“Client Side Request Stats”:{“Id”:“AggregatedClientSideRequestStatistics”,“ContactedReplicas”:[{“Count”:1,<Removed for security>],“FailedReplicas”:[],“AddressResolutionStatistics”:[],“StoreResponseStatistics”:[{“ResponseTimeUTC”:“2022-02-03T01:03:27.6953683Z”,“ResourceType”:“Document”,“OperationType”:“Replace”,“LocationEndpoint”:“endpoint/”,“StoreResult”:{“ActivityId”:“82a312ab-765d-4362-abc1-257425ef55d5”,“StatusCode”:“Ok”,“SubStatusCode”:“Unknown”,“LSN”:9692316,“PartitionKeyRangeId”:“0”,“GlobalCommittedLSN”:9692310,“ItemLSN”:-1,“UsingLocalLSN”:false,“QuorumAckedLSN”:9692315,“SessionToken”:“0#9692316#7=-1”,“CurrentWriteQuorum”:3,“CurrentReplicaSetSize”:4,“NumberOfReadRegions”:1,“IsValid”:true,“StorePhysicalAddress”:“rntbd://:14068/apps/ba19ffe5-e50d-4d28-9b5d-c069d0e15f9a/services/f707bbef-ed4b-4815-82f7-ccd9e8b7f212/partitions/b691ef48-742a-4c3d-b1c5-1117beeb0a6d/replicas/132876734229942716p/”,“RequestCharge”:16,“RetryAfterInMs”:null,“BELatencyInMs”:“5.287”,“transportRequestTimeline”:{“requestTimeline”:[{“event”: “Created”, “startTimeUtc”: “2022-02-03T01:03:27.6264893Z”, “durationInMs”: 0.0027},{“event”: “ChannelAcquisitionStarted”, “startTimeUtc”: “2022-02-03T01:03:27.6264920Z”, “durationInMs”: 0.009},{“event”: “Pipelined”, “startTimeUtc”: “2022-02-03T01:03:27.6265010Z”, “durationInMs”: 0.122},{“event”: “Transit Time”, “startTimeUtc”: “2022-02-03T01:03:27.6266230Z”, “durationInMs”: 68.6575},{“event”: “Received”, “startTimeUtc”: “2022-02-03T01:03:27.6952805Z”, “durationInMs”: 0.0698},{“event”: “Completed”, “startTimeUtc”: “2022-02-03T01:03:27.6953503Z”, “durationInMs”: 0}],“requestSizeInBytes”:2902,“requestBodySizeInBytes”:851,“responseMetadataSizeInBytes”:500},“TransportException”:null}​},{“ResponseTimeUTC”:“2022-02-03T01:03:27.7595265Z”,“ResourceType”:“Collection”,“OperationType”:“Head”,“LocationEndpoint”:“https://”,“StoreResult”:{“ActivityId”:“82a312ab-765d-4362-abc1-257425ef55d5”,“StatusCode”:“NoContent”,“SubStatusCode”:“Unknown”,“LSN”:9692310,“PartitionKeyRangeId”:“0”,“GlobalCommittedLSN”:9692310,“ItemLSN”:8971340,“UsingLocalLSN”:false,“QuorumAckedLSN”:-1,“SessionToken”:“0#9692310#7=-1”,“CurrentWriteQuorum”:-1,“CurrentReplicaSetSize”:-1,“NumberOfReadRegions”:1,“IsValid”:true,“StorePhysicalAddress”:“rntbd://:/apps//services/f707bbef-ed4b-4815-82f7-ccd9e8b7f212/partitions/b691ef48-742a-4c3d-b1c5-1117beeb0a6d/replicas/132860494913145228s/”,“RequestCharge”:0,“RetryAfterInMs”:null,“BELatencyInMs”:“0.451”,“transportRequestTimeline”:{“requestTimeline”:[{“event”: “Created”, “startTimeUtc”: “2022-02-03T01:03:27.6956262Z”, “durationInMs”: 0.0027},{“event”: “ChannelAcquisitionStarted”, “startTimeUtc”: “2022-02-03T01:03:27.6956289Z”, “durationInMs”: 0.0026},{“event”: “Pipelined”, “startTimeUtc”: “2022-02-03T01:03:27.6956315Z”, “durationInMs”: 0.1074},{“event”: “Transit Time”, “startTimeUtc”: “2022-02-03T01:03:27.6957389Z”, “durationInMs”: 63.7201},{“event”: “Received”, “startTimeUtc”: “2022-02-03T01:03:27.7594590Z”, “durationInMs”: 0.0621},{“event”: “Completed”, “startTimeUtc”: “2022-02-03T01:03:27.7595211Z”, “durationInMs”: 0}],“requestSizeInBytes”:1881,“responseMetadataSizeInBytes”:298},“TransportException”:null}​},{“ResponseTimeUTC”:“2022-02-03T01:03:27.8362170Z”,“ResourceType”:“Collection”,“OperationType”:“Head”,“LocationEndpoint”:“/”,“StoreResult”:{“ActivityId”:“82a312ab-765d-4362-abc1-257425ef55d5”,“StatusCode”:“NoContent”,“SubStatusCode”:“Unknown”,“LSN”:9692316,“PartitionKeyRangeId”:“0”,“GlobalCommittedLSN”:9692310,“ItemLSN”:8971340,“UsingLocalLSN”:false,“QuorumAckedLSN”:-1,“SessionToken”:“0#9692316#7=-1”,“CurrentWriteQuorum”:-1,“CurrentReplicaSetSize”:-1,“NumberOfReadRegions”:1,“IsValid”:true,“StorePhysicalAddress”:"rntbd://c/apps/ba
</body> </html>`

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (9 by maintainers)

Most upvoted comments

The CancellationToken is stopping the quorum verification for Strong/Bounded Staleness, that means that quorum, for some reason, is not being achieved within those 2 seconds. The 20 Collection HEAD requests mean that the SDK checked 20 times for quorum until the token cancelled. It would be good to investigate why is it taking that long for quorum. Increasing the token time would certainly help with mitigation.

First step would be to increase the CancellationToken time and see if that impacts. Afterwards, if the issue persists, filing a support ticket should help the service team investigate what could be the reason. For the service ticket it would be ideal to have the full exception if possible (no truncation).