onnxruntime: [Performance] Slowdown from multiple inference sessions in serial.
Describe the issue
I am running multiple models as part of a pipeline (first head orientation detection, then face detection, and finally face recognition). I create a Ort::Session for each one of my models.
If I benchmark each of the models alone, then I get the following inference latency:
Average time head orientation detection: 35.28ms
Average time face detection: 6.8 ms
Average time face recognition: 32.25 ms
However, in a real deployment, the models need to be run in serial, first the head orientation detection model, then the face detection model, and finally the output passed to the face recognition model. This pipeline is repeated for every input.
If I then benchmark the models running in serial in a loop, I get the following total latency.
Average time head orientation detection + face detection + recognition: 160.825 ms
I would expect the total latency to be a combination of head orientation latency + face detection latency + face recognition latency ~ 77 ms (allowing for some increase in latency due to less efficient caching). However, the actual combined latency is significantly more than what I’m expecting.
Why is the actual combined latency so much higher? I think it has to do with threading. I am using the following options sessions:
m_sessionOptsPtr->SetIntraOpNumThreads(8);
m_sessionOptsPtr->SetInterOpNumThreads(8);
m_sessionOptsPtr->SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
m_sessionOptsPtr->AddConfigEntry(kOrtSessionOptionsConfigSetDenormalAsZero, "1"); // See this issue: https://github.com/microsoft/onnxruntime/issues/15630
What I have noticed is that each session, when running inference, uses up the 8 threads as expected. However, I’d expect the threads to be released once inference is complete so that the threads can be used by the next session which is to run inference in the pipeline.
Instead, it appears that each session is holding on to their threads in their threadpool (even when not actually running inference), and it is saturating my entire CPU (even though there is only one session running inference at a time). How do I resolve this issue? Is there some way I can get the threads to be released after inference is complete?
Since I’m running inference in serial, I’d expect the CPU usage to remain at 800% since each session only can use 8 threads and there is only 1 session running inference at a time. The above image shows what is actually happening.
Benchmarks run on i9 11th gen (16 cores)
To reproduce
Create multiple inference sessions and benchmark separately, and then benchmark in serial
Urgency
Somewhat more urgent. In use as part of commercial product.
Platform
Linux
OS Version
Ubuntu 20.04.5 LTS
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
v1.15.1
ONNX Runtime API
C++
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
No
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (7 by maintainers)
I see, that makes sense. Thank you for the Grade A support, much appreciated.
For anyone who comes across this thread and wonders how to implement a global thread pool:
You can do it in a couple of ways.