azure-cosmos-dotnet-v3: NullReferenceException in GoneAndRetryWithRequestRetryPolicy.TryHandleResponseSynchronously in SDK 3.18

We’re seeing a null ref exception occasionally in production code running in Azure on SDK 3.18. It seems to happen intermittently in bursts (we saw ~1200 in a 1 min interval at 2021-05-19T15:55:00Z) across different operations (Read, Create, Query).

Stack: “innermostType”: System.NullReferenceException, “innermostMessage”: Object reference not set to an instance of an object., “details”: at Microsoft.Azure.Documents.GoneAndRetryWithRequestRetryPolicy1.TryHandleResponseSynchronously(DocumentServiceRequest request, TResponse response, Exception exception, ShouldRetryResult& shouldRetryResult) at Microsoft.Azure.Documents.RequestRetryUtility.<ProcessRequestAsync>d__22.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Documents.StoreClient.<ProcessMessageAsync>d__19.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.Handlers.TransportHandler.<ProcessMessageAsync>d__3.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.Handlers.TransportHandler.<SendAsync>d__2.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.Handlers.RouterHandler.<SendAsync>d__3.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.RequestHandler.<SendAsync>d__6.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.<ExecuteHttpRequestAsync>d__2.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.<SendAsync>d__1.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.RequestHandler.<SendAsync>d__6.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.RequestHandler.<SendAsync>d__6.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.<SendAsync>d__6.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.<SendAsync>d__8.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.ContainerCore.<ProcessItemStreamAsync>d__87.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.ContainerCore.<ReadItemStreamAsync>d__55.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.ClientContextCore.<RunWithDiagnosticsHelperAsync>d__381.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.ClientContextCore.<OperationHelperWithRootTraceAsync>d__291.MoveNext() — End of stack trace from previous location where exception was thrown — at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Intercom.Azure.Helpers.CosmosDB.CosmosDBSqlClient`1.<ReadDocumentAsync>d__42.MoveNext() in C:__w\1\s\Utilities\Intercom.Azure.Helpers.NetStd\CosmosDB\CosmosDBSqlClient.cs:line 442

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 42 (18 by maintainers)

Commits related to this issue

Most upvoted comments

We figured out the issue. It was introduced in the following PR.

https://github.com/Azure/azure-cosmos-dotnet-v3/pull/2312

The original RegionsContacted hashset is never getting set so it is always null. There are certain service unavailable code paths that check if multiple regions were contacted. Since the hashset is never set it is null and causing the null reference exceptions.

https://github.com/Azure/azure-cosmos-dotnet-v3/blob/867c8c86e4d90ea1ad87b347694d1fec192e5d14/Microsoft.Azure.Cosmos/src/Tracing/TraceData/ClientSideRequestStatisticsTraceDatum.cs#L75

Adding StackTrace with line numbers: System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.Azure.Documents.GoneAndRetryWithRequestRetryPolicy1.TryHandleResponseSynchronously(DocumentServiceRequest request, TResponse response, Exception exception, ShouldRetryResult& shouldRetryResult) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\direct\GoneAndRetryWithRequestRetryPolicy.cs:line 179 at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func1 executeAsync, Func1 prepareRequest, IRequestRetryPolicy2 policy, CancellationToken cancellationToken, Func1 inBackoffAlternateCallbackMethod, Nullable1 minBackoffForInBackoffCallback) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\direct\RequestRetryUtility.cs:line 88 at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy, Func2 prepareRequestAsyncDelegate) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\direct\StoreClient.cs:line 116 at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\TransportHandler.cs:line 117 at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\TransportHandler.cs:line 33 at Microsoft.Azure.Cosmos.Handlers.RouterHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RouterHandler.cs:line 42 at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RequestHandler.cs:line 59 at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.ExecuteHttpRequestAsync(Func1 callbackMethod, Func3 callShouldRetry, Func3 callShouldRetryException, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\AbstractRetryHandler.cs:line 75 at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\AbstractRetryHandler.cs:line 28 at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RequestHandler.cs:line 59 at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RequestHandler.cs:line 59 at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RequestInvokerHandler.cs:line 78 at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(String resourceUriString, ResourceType resourceType, OperationType operationType, RequestOptions requestOptions, ContainerInternal cosmosContainerCore, FeedRange feedRange, Stream streamPayload, Action1 requestEnricher, ITrace trace, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RequestInvokerHandler.cs:line 285 at Microsoft.Azure.Cosmos.ContainerCore.ProcessItemStreamAsync(Nullable1 partitionKey, String itemId, Stream streamPayload, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Resource\Container\ContainerCore.Items.cs:line 1011 at Microsoft.Azure.Cosmos.ContainerCore.ReadItemAsync[T](String id, PartitionKey partitionKey, ITrace trace, ItemRequestOptions requestOptions, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Resource\Container\ContainerCore.Items.cs:line 118 at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func2 task) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Resource\ClientContextCore.cs:line 430 at Microsoft.Azure.Cosmos.ClientContextCore.OperationHelperWithRootTraceAsync[TResult](String operationName, RequestOptions requestOptions, Func2 task, TraceComponent traceComponent, TraceLevel traceLevel) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Resource\ClientContextCore.cs:line 223 at Microsoft.Azure.Cosmos.SDK.EmulatorTests.CosmosItemTests.CreateDropItemTest() in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\tests\Microsoft.Azure.Cosmos.EmulatorTests\CosmosItemTests.cs:line 178

We are working on a fix. It will be included in the next SDK release which should be done in the next week or so.

The null reference exception should only happen in scenarios where it was going to be a service unavailable exception. Most likely cause by high CPU or port exhaustion. The null reference will not happen in other scenarios.

The spikes in exceptions are correlated with bursts of activity on our servers that caused the CPU to max out so they’re likely an artifact of resource starvation on the client. I would expect errors in this situation but not NullRefExceptions.

Oh, and we haven’t seen it in any of our test/integration environments

@ealsur Yes, we do run the service in eus2euap but there’s very little traffic there and we haven’t seen the exception in that region. Most of the exceptions are coming from France Central but we have seen it in several US regions as well.