OpenSearch: [BUG] Search pipeline seen executing on transport_worker thread
Describe the bug
I have a search processor that uses a org.opensearch.client.Client object to execute a TransportAction. This invocation returns a BaseFuture and the processor does .get(), blocking until the client returns a response. Occasionally, the get() call blocks indefinitely and brings the whole cluster into bad state.
A thread dump on the node that hung revealed that the search processor was executing on a transport_worker thread.
"opensearch[opensearch-node1][transport_worker][T#2]" #32 daemon prio=5 os_prio=0 cpu=61810.65ms elapsed=43062.56s allocated=2771M defined_classes=251 tid=0x0000fffef4009140 nid=0x13d waiting on condition [0x0000ffff88cda000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.8/Native Method)
- parking to wait for <0x00000000ee75b640> (a org.opensearch.common.util.concurrent.BaseFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base@17.0.8/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8/AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8/AbstractQueuedSynchronizer.java:1047)
at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:272)
at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:104)
at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:74)
at org.opensearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:55)
at org.opensearch.searchpipelines.questionanswering.generative.llm.DefaultLlmImpl.doChatCompletion(DefaultLlmImpl.java:84)
at org.opensearch.searchpipelines.questionanswering.generative.GenerativeQAResponseProcessor.processResponse(GenerativeQAResponseProcessor.java:109)
at org.opensearch.search.pipeline.Pipeline.transformResponse(Pipeline.java:177)
at org.opensearch.search.pipeline.PipelinedRequest.transformResponse(PipelinedRequest.java:31)
at org.opensearch.action.search.TransportSearchAction.lambda$executeRequest$0(TransportSearchAction.java:398)
at org.opensearch.action.search.TransportSearchAction$$Lambda$4884/0x0000003001d5d850.accept(Unknown Source)
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
at org.opensearch.core.action.ActionListener$5.onResponse(ActionListener.java:268)
at org.opensearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:671)
at org.opensearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:132)
at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:428)
at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:422)
I do see in a few other thread dumps I captured the same processor running on a search thread.
I can see that we don’t expect a blocking call to happen on transport threads and there is code specifically in BaseFuture.get to disallow invocations on transport threads, although assert won’t trigger unless the OpenSearch process is run with the ea JVM flag.
Can we ensure that search processors run on search threads? Or are they really allowed to run on transport threads which means that I should not have any blocking calls in my search processor?
To Reproduce Steps to reproduce the behavior:
- Go to ‘…’
- Click on ‘…’
- Scroll down to ‘…’
- See error
Expected behavior A clear and concise description of what you expected to happen.
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 16 (12 by maintainers)
I cobbled together an implementation of async response processors: https://github.com/msfroh/OpenSearch/commit/6b4279c720267441299f746bdf9e7baf9ced99db
It’s mostly untested (besides unit tests), but it might let you avoid the blocking
get()call.