azure-sdk-for-java: [BUG] - Spring Boot application threads are blocked indefinitely querying in Cosmos using Spring Data
Describe the bug Spring boot application based on Tomcat - Spring Data Cosmos - at some point, all HTTP threads are in WAITING state for Spring Data Cosmos parallel-* threads. When it happens, the application is frozen (CPU 0% and high memory usage), and only restart can fix it.
Exception or Stack Trace No exceptions were explicitly found. The Application thread dump looks like:
HTTP Pool
"http-nio-8080-exec-378" #818 daemon prio=5 os_prio=0 tid=0x00000000247a5000 nid=0x84cc waiting on condition [0x0000000047c7c000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007315cb968> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at reactor.core.publisher.BlockingIterable$SubscriberIterator.hasNext(BlockingIterable.java:179)
at java.util.Iterator.forEachRemaining(Iterator.java:115)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at com.cosmos.example.web.rest.PetDelegateImpl.searchPet(PetDelegateImpl.java:66)
at com.cosmos.example.server.Pets.api.PetsApi.searchPet(PetsApi.java:79)
at sun.reflect.GeneratedMethodAccessor109.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
cosmos-parallel-x
"cosmos-parallel-2" #39 daemon prio=5 os_prio=0 tid=0x000000002798d000 nid=0x54c4 waiting on condition [0x000000002e4bf000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006c3138198> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- None
parallel-x
"parallel-3" #61 daemon prio=5 os_prio=0 tid=0x000000001ff2a000 nid=0x8f20 waiting on condition [0x0000000032e3f000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006c3649c28> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- None
To Reproduce This issue can be replicated with a simple application stress test using the Apache Jmeter tool. The Cosmos collection had about 600.000 documents. No 429 error code has been found during the tests.
Code Snippet
Analyzing Azure Spring Data Cosmos repository, I potentially found the root cause in the usage of Reactor API in the class com.azure.spring.data.cosmos.core.CosmosTemplate:
public <T> T findById(String containerName, Object id, Class<T> domainType) {
Assert.hasText(containerName, "containerName should not be null, empty or only whitespaces");
Assert.notNull(domainType, "domainType should not be null");
final String query = "select * from root where root.id = @ROOT_ID";
final SqlParameter param = new SqlParameter("@ROOT_ID", CosmosUtils.getStringIDValue(id));
final SqlQuerySpec sqlQuerySpec = new SqlQuerySpec(query, param);
final CosmosQueryRequestOptions options = new CosmosQueryRequestOptions();
options.setQueryMetricsEnabled(this.queryMetricsEnabled);
return cosmosAsyncClient
.getDatabase(this.databaseName)
.getContainer(containerName)
.queryItems(sqlQuerySpec, options, JsonNode.class)
.byPage()
.publishOn(Schedulers.parallel())
.flatMap(cosmosItemFeedResponse -> {
CosmosUtils.fillAndProcessResponseDiagnostics(this.responseDiagnosticsProcessor,
cosmosItemFeedResponse.getCosmosDiagnostics(), cosmosItemFeedResponse);
return Mono.justOrEmpty(cosmosItemFeedResponse
.getResults()
.stream()
.map(cosmosItem -> emitOnLoadEventAndConvertToDomainObject(domainType, containerName, cosmosItem))
.findFirst());
})
.onErrorResume(throwable ->
CosmosExceptionUtils.findAPIExceptionHandler("Failed to find item", throwable,
this.responseDiagnosticsProcessor))
.blockFirst();
}
- The
blockFirstApi, without a timeout, forces the Thread to wait indefinitely (Reactor provides also the API that includes the timeout) Schedulers.parallel()should be used for fast execution as for its documentation “Optimized for fast Runnable non-blocking executions”. This code could be configurable.
Expected behavior Any thread should not wait indefinitely, but at least it should raise an exception.
Setup (please complete the following information):
- OS: Windows 10
- IDE: IntelliJ
- Library/Libraries: azure-spring-data-cosmos-3.16.0.jar
- Java version: 8
- App Server/Environment: Tomcat Embedded
- Frameworks: Spring Boot 2.3.6.RELEASE
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 14
- Comments: 19 (10 by maintainers)
@kushagraThapar, I was giving a look at this page:
https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/best-practice-java
and I saw that some best practices seem are not followed by Spring Data Cosmos, in particular:
and
Please give me feedback about it. I really appreciate any help you can provide.