onnxruntime: [Performance] Slowdown from multiple inference sessions in serial.

Describe the issue

I am running multiple models as part of a pipeline (first head orientation detection, then face detection, and finally face recognition). I create a Ort::Session for each one of my models.

If I benchmark each of the models alone, then I get the following inference latency:

Average time head orientation detection: 35.28ms 
Average time face detection: 6.8 ms
Average time face recognition: 32.25 ms

However, in a real deployment, the models need to be run in serial, first the head orientation detection model, then the face detection model, and finally the output passed to the face recognition model. This pipeline is repeated for every input.

If I then benchmark the models running in serial in a loop, I get the following total latency.

Average time head orientation detection + face detection + recognition: 160.825 ms

I would expect the total latency to be a combination of head orientation latency + face detection latency + face recognition latency ~ 77 ms (allowing for some increase in latency due to less efficient caching). However, the actual combined latency is significantly more than what I’m expecting.

Why is the actual combined latency so much higher? I think it has to do with threading. I am using the following options sessions:

m_sessionOptsPtr->SetIntraOpNumThreads(8);
m_sessionOptsPtr->SetInterOpNumThreads(8);

m_sessionOptsPtr->SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);

m_sessionOptsPtr->AddConfigEntry(kOrtSessionOptionsConfigSetDenormalAsZero, "1"); // See this issue: https://github.com/microsoft/onnxruntime/issues/15630

What I have noticed is that each session, when running inference, uses up the 8 threads as expected. However, I’d expect the threads to be released once inference is complete so that the threads can be used by the next session which is to run inference in the pipeline.

Instead, it appears that each session is holding on to their threads in their threadpool (even when not actually running inference), and it is saturating my entire CPU (even though there is only one session running inference at a time). How do I resolve this issue? Is there some way I can get the threads to be released after inference is complete?

image Since I’m running inference in serial, I’d expect the CPU usage to remain at 800% since each session only can use 8 threads and there is only 1 session running inference at a time. The above image shows what is actually happening.

Benchmarks run on i9 11th gen (16 cores)

To reproduce

Create multiple inference sessions and benchmark separately, and then benchmark in serial

Urgency

Somewhat more urgent. In use as part of commercial product.

Platform

Linux

OS Version

Ubuntu 20.04.5 LTS

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

v1.15.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (7 by maintainers)

Most upvoted comments

I see, that makes sense. Thank you for the Grade A support, much appreciated.

For anyone who comes across this thread and wonders how to implement a global thread pool:

#include <iostream>
#include "onnxruntime_cxx_api.h"
#include <memory>
#include <opencv2/opencv.hpp>
#include "onnxruntime_session_options_config_keys.h"

#define USE_GLOBAL_THREADPOOL

// Stopwatch Utility
template <typename Clock = std::chrono::high_resolution_clock>
class Stopwatch
{
    typename Clock::time_point start_point;
public:
    Stopwatch() :start_point(Clock::now()){}

    // Returns elapsed time
    template <typename Rep = typename Clock::duration::rep, typename Units = typename Clock::duration>
    Rep elapsedTime() const {
        std::atomic_thread_fence(std::memory_order_relaxed);
        auto counted_time = std::chrono::duration_cast<Units>(Clock::now() - start_point).count();
        std::atomic_thread_fence(std::memory_order_relaxed);
        return static_cast<Rep>(counted_time);
    }
};

using preciseStopwatch = Stopwatch<>;
using systemStopwatch = Stopwatch<std::chrono::system_clock>;
using monotonicStopwatch = Stopwatch<std::chrono::steady_clock>;

std::vector<float> hwcToChw(const cv::Mat& rgbImage, const std::array<float, 3>& subVals = {0.f, 0.f, 0.f},
                            const std::array<float, 3>& divVals = {1.f, 1.f, 1.f}, bool normalize = true) {
    float normFactor = 255.f;
    if (!normalize) {
        normFactor = 1.f;
    }
    std::vector<float> dataBuffer;
    dataBuffer.resize(rgbImage.cols * rgbImage.rows * rgbImage.channels());
    int insertIdx = 0;
    // hwc to chw conversion
    for (int c = 0; c < 3; ++c) {
        for (int i = 0; i < rgbImage.rows; ++i) {
            for (int j = 0; j < rgbImage.cols; ++j) {
                const auto retrievalIdx = (i * rgbImage.cols + j) * 3 + c;
                const auto transformIdx = retrievalIdx % 3;
                dataBuffer[insertIdx++] =  (static_cast<float>(rgbImage.data[retrievalIdx]) - (subVals[transformIdx] * normFactor)) / (divVals[transformIdx] * normFactor);

            }
        }
    }
    return dataBuffer;
}

class OnnxRunner {
public:
    OnnxRunner(const std::string& modelPath, int numThreads) {
#ifdef USE_GLOBAL_THREADPOOL
        const OrtApi* g_ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);
        OrtThreadingOptions* tp_options;
        g_ort->CreateThreadingOptions(&tp_options);
        g_ort->SetGlobalIntraOpNumThreads(tp_options, numThreads);
        g_ort->SetGlobalDenormalAsZero(tp_options);
        m_ortEnvPtr = std::make_unique<Ort::Env>(tp_options, ORT_LOGGING_LEVEL_WARNING, "Default");
        g_ort->ReleaseThreadingOptions(tp_options);
#endif
        Ort::SessionOptions session_options;
#ifdef USE_GLOBAL_THREADPOOL
        session_options.DisablePerSessionThreads();
        std::cout << "Using global thread pool" << std::endl;
#else
        std::cout << "NOT using global thread pool" << std::endl;
        session_options.SetIntraOpNumThreads(numThreads);
        session_options.AddConfigEntry(kOrtSessionOptionsConfigSetDenormalAsZero, "1");
        session_options.AddConfigEntry(kOrtSessionOptionsConfigAllowIntraOpSpinning, "0");

        m_ortEnvPtr = std::make_unique<Ort::Env>(ORT_LOGGING_LEVEL_WARNING, "default");
#endif

        m_sessionPtr = std::make_unique<Ort::Session>(*m_ortEnvPtr, modelPath.c_str(), session_options);

        Ort::AllocatorWithDefaultOptions allocator;
        m_numInputs = m_sessionPtr->GetInputCount();
        m_inputNodeNames.resize(m_numInputs);
        for (int i = 0; i < m_numInputs; ++i) {
            m_inputNodeNames[i] = strdup(m_sessionPtr->GetInputNameAllocated(i, allocator).get());
            auto typeInfo = m_sessionPtr->GetInputTypeInfo(i);
            auto tensorInfo = typeInfo.GetTensorTypeAndShapeInfo();
            auto inputDims = tensorInfo.GetShape();
            size_t tensorSize = 1;
            for (auto & inputDim : inputDims) {
                if (inputDim == -1) {
                    inputDim = 1;
                }
                tensorSize *= inputDim;
            }

            m_inputNodeDims.push_back(inputDims);
            m_inputTensorSize.push_back(tensorSize);
        }
        m_numOutputs = m_sessionPtr->GetOutputCount();
        m_outputNodeNames.resize(m_numOutputs);
        for (int i = 0; i < m_numOutputs; ++i) {
            m_outputNodeNames[i] = strdup(m_sessionPtr->GetOutputNameAllocated(i, allocator).get());
            auto outputTypeInfo = m_sessionPtr->GetOutputTypeInfo(i);
            auto outputTensorInfo = outputTypeInfo.GetTensorTypeAndShapeInfo();
            auto outputDims = outputTensorInfo.GetShape();
            m_outputNodeDims.push_back(outputDims);
        }
    }

    std::vector<Ort::Value> runInference(const cv::Mat& inputRGB) {
        std::vector<std::vector<float>> inputFloatTensors(1);
        std::vector<Ort::Value> inputTensors;

        auto floatArray = hwcToChw(inputRGB);
        inputFloatTensors[0] = std::move(floatArray);
        inputTensors.emplace_back(Ort::Value::CreateTensor<float>(m_memoryInfoHandler, inputFloatTensors[0].data(),
                                                                  m_inputTensorSize[0], m_inputNodeDims[0].data(), m_inputNodeDims[0].size()));

        if (!inputTensors[0].IsTensor()) {
            throw std::invalid_argument("Unable to create tensor from provided input at idx " + std::to_string(0));
        }

        return m_sessionPtr->Run(Ort::RunOptions{nullptr}, m_inputNodeNames.data(), inputTensors.data(), inputTensors.size(),
                                 m_outputNodeNames.data(), m_numOutputs);
    }
private:
    std::unique_ptr<Ort::Env> m_ortEnvPtr = nullptr;
    std::unique_ptr<Ort::Session> m_sessionPtr = nullptr;

    // Model information
    std::vector<const char*> m_inputNodeNames;
    std::vector<std::vector<int64_t>> m_inputNodeDims;
    std::vector<size_t> m_inputTensorSize;
    Ort::MemoryInfo m_memoryInfoHandler = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
    std::vector<const char*> m_outputNodeNames;
    std::vector<std::vector<int64_t>> m_outputNodeDims;

    int m_numInputs = 1;
    int m_numOutputs = 1;
};

int main() {
    OnnxRunner runner1("../models/yolov8n.onnx", 8);
    OnnxRunner runner2("../models/yolov8n.onnx", 8);

    auto input = cv::imread("/home/cyrus/work/YOLOv8-TensorRT-CPP/images/640_640.jpg");
    cv::resize(input, input, cv::Size(640, 640));
    cv::cvtColor(input, input, cv::COLOR_BGR2RGB);

    int numParallelThreads = 4;

    int numIts = 100;
    {
        preciseStopwatch s;
#pragma omp parallel for num_threads(numParallelThreads)
        for (int i = 0; i < numIts; ++i) {
            auto ret = runner1.runInference(input);
        }

        auto totalTime = s.elapsedTime<float, std::chrono::milliseconds>();
        std::cout << "Runner 1: " << totalTime / numIts << "ms" << std::endl;
    }

    {
        preciseStopwatch s;
#pragma omp parallel for num_threads(numParallelThreads)
        for (int i = 0; i < numIts; ++i) {
            auto ret = runner2.runInference(input);
        }

        auto totalTime = s.elapsedTime<float, std::chrono::milliseconds>();
        std::cout << "Runner 2: " << totalTime / numIts << "ms" << std::endl;
    }

    {
        preciseStopwatch s;
#pragma omp parallel for num_threads(numParallelThreads)
        for (int i = 0; i < numIts; ++i) {
            auto ret = runner1.runInference(input);
            ret = runner2.runInference(input);
        }

        auto totalTime = s.elapsedTime<float, std::chrono::milliseconds>();
        std::cout << "Runner 1 & 2: " << totalTime / numIts << "ms" << std::endl;
    }

    return 0;
}

You can do it in a couple of ways.

  1. Use multiple sessions but 1 global threadpool shared by all the sessions. This assumes only one request is being inferenced at any point of time.
  2. If you’re doing pipelining, you may create multiple sessions each with its own threadpool but ensure that all threads are assigned to different cores (via thread affinity). This requires configuring each session appropriately.
  3. If combining all the models in one big model works for your use case, then you can get by with just one session.