azure-sdk-for-java: [BUG] - Spring Boot application threads are blocked indefinitely querying in Cosmos using Spring Data

Describe the bug Spring boot application based on Tomcat - Spring Data Cosmos - at some point, all HTTP threads are in WAITING state for Spring Data Cosmos parallel-* threads. When it happens, the application is frozen (CPU 0% and high memory usage), and only restart can fix it.

Exception or Stack Trace No exceptions were explicitly found. The Application thread dump looks like:

HTTP Pool

"http-nio-8080-exec-378" #818 daemon prio=5 os_prio=0 tid=0x00000000247a5000 nid=0x84cc waiting on condition [0x0000000047c7c000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007315cb968> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at reactor.core.publisher.BlockingIterable$SubscriberIterator.hasNext(BlockingIterable.java:179)
        at java.util.Iterator.forEachRemaining(Iterator.java:115)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at com.cosmos.example.web.rest.PetDelegateImpl.searchPet(PetDelegateImpl.java:66)
        at com.cosmos.example.server.Pets.api.PetsApi.searchPet(PetsApi.java:79)
        at sun.reflect.GeneratedMethodAccessor109.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

cosmos-parallel-x

"cosmos-parallel-2" #39 daemon prio=5 os_prio=0 tid=0x000000002798d000 nid=0x54c4 waiting on condition [0x000000002e4bf000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006c3138198> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

parallel-x

"parallel-3" #61 daemon prio=5 os_prio=0 tid=0x000000001ff2a000 nid=0x8f20 waiting on condition [0x0000000032e3f000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006c3649c28> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - None

To Reproduce This issue can be replicated with a simple application stress test using the Apache Jmeter tool. The Cosmos collection had about 600.000 documents. No 429 error code has been found during the tests.

Code Snippet

Analyzing Azure Spring Data Cosmos repository, I potentially found the root cause in the usage of Reactor API in the class com.azure.spring.data.cosmos.core.CosmosTemplate:

   public <T> T findById(String containerName, Object id, Class<T> domainType) {
        Assert.hasText(containerName, "containerName should not be null, empty or only whitespaces");
        Assert.notNull(domainType, "domainType should not be null");

        final String query = "select * from root where root.id = @ROOT_ID";
        final SqlParameter param = new SqlParameter("@ROOT_ID", CosmosUtils.getStringIDValue(id));
        final SqlQuerySpec sqlQuerySpec = new SqlQuerySpec(query, param);
        final CosmosQueryRequestOptions options = new CosmosQueryRequestOptions();
        options.setQueryMetricsEnabled(this.queryMetricsEnabled);
        return cosmosAsyncClient
            .getDatabase(this.databaseName)
            .getContainer(containerName)
            .queryItems(sqlQuerySpec, options, JsonNode.class)
            .byPage()
            .publishOn(Schedulers.parallel())
            .flatMap(cosmosItemFeedResponse -> {
                CosmosUtils.fillAndProcessResponseDiagnostics(this.responseDiagnosticsProcessor,
                    cosmosItemFeedResponse.getCosmosDiagnostics(), cosmosItemFeedResponse);
                return Mono.justOrEmpty(cosmosItemFeedResponse
                    .getResults()
                    .stream()
                    .map(cosmosItem -> emitOnLoadEventAndConvertToDomainObject(domainType, containerName, cosmosItem))
                    .findFirst());
            })
            .onErrorResume(throwable ->
                CosmosExceptionUtils.findAPIExceptionHandler("Failed to find item", throwable,
                    this.responseDiagnosticsProcessor))
            .blockFirst();
    }
  • The blockFirst Api, without a timeout, forces the Thread to wait indefinitely (Reactor provides also the API that includes the timeout)
  • Schedulers.parallel() should be used for fast execution as for its documentation “Optimized for fast Runnable non-blocking executions”. This code could be configurable.

Expected behavior Any thread should not wait indefinitely, but at least it should raise an exception.

Setup (please complete the following information):

  • OS: Windows 10
  • IDE: IntelliJ
  • Library/Libraries: azure-spring-data-cosmos-3.16.0.jar
  • Java version: 8
  • App Server/Environment: Tomcat Embedded
  • Frameworks: Spring Boot 2.3.6.RELEASE

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 14
  • Comments: 19 (10 by maintainers)

Most upvoted comments

@kushagraThapar, I was giving a look at this page:

https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/best-practice-java

and I saw that some best practices seem are not followed by Spring Data Cosmos, in particular:

Use Appropriate Scheduler (Avoid stealing Event loop IO Netty threads)

and

End-to-End Timeouts

Please give me feedback about it. I really appreciate any help you can provide.