onnxruntime: C++ is 10x slower compared with Python, CPU only

Describe the bug I have a Pytorch model that I converted to ONNX (no issue here). After, I run that model, using the CPU, in both Python and C++ (no issue here). The inputs and outputs are the same in both runs and they are correct. The C++ run is much slower (150ms) than the Python one (17ms).

At the moment I’m assuming that it is a simple configuration issue so I made sure to set everything that I could on both runs:

C++

session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
session_options.SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);
session_options.EnableCpuMemArena();
session_options.EnableMemPattern();
 session_options.SetIntraOpNumThreads(0);
session_options.SetInterOpNumThreads(0);
Ort::Session session (environment, model_path.c_str(), session_options);

Python

options = ort.SessionOptions()
options.enable_profiling = True
options.enable_mem_pattern = True
options.enable_cpu_mem_arena = True
options.enable_mem_reuse = False
options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
options.inter_op_num_threads = 0
options.intra_op_num_threads = 0
ort_sess = ort.InferenceSession("my_pretty_model.onnx", options, providers=["CPUExecutionProvider"])

Any idea about what could cause such a difference?

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • ONNX Runtime installed from (source or binary): Source
  • ONNX Runtime version: 1.10
  • Python version: 3.8
  • Visual Studio version (if applicable): 2019

Expected behavior I expected the C++ run to be faster or as fast as the Python one.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 15 (2 by maintainers)

Most upvoted comments

I was able to improve the performance significantly on C++ by tweaking the threads, as explained here: https://stackoverflow.com/questions/75241204/why-onnxruntime-runs-2-3x-slower-in-c-than-python#comment132771477_75241204 With the default settings it was about 2X slower in some platforms (Windows PC). I suspect it was using more threads than the number of cores and excess intra thread communication was the bottle neck. I also tried to recompile, as suggested here, but it did not help.

@Roios Thanks for your help, I have solved it.

It seems that “multithreading” and “graph optimization” are causing the C++ to be very slow. At first I initialize session like this:

Ort::SessionOptions session_options;
session_options.SetIntraOpNumThreads(num_threads); // num_threads = 16
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
session_options.SetLogSeverityLevel(4);

Ort::Session* ort_session = new Ort::Session(ort_env, onnx_path, session_options);

it’s very very slow in inference, about 900ms.

Then I removed any optimizations:

Ort::Session* ort_session = new Ort::Session(ort_env, onnx_path, Ort::SessionOptions{ nullptr });

The speed reached around 160ms, which is comparable to Python. I tried different num_threads , all slowed down inference to some extent.

I’m not very familiar with onnx, this is confusing me.

I’m hoping to make it faster with optimizations, 160ms is what Python can do.


My model is inferring on video sequences. There is LSTM in the model, which means that each inference needs to use the results of the previous inference. Is this why multithreading is causing the slower?

But I used the same configuration in Python, why didn’t it slow down the inference? The Python version looks like this:

options = ort.SessionOptions()
options.inter_op_num_threads = 16
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
options.log_severity_level = 4
sess1 = ort.InferenceSession(f'./{weight}', options)
sess2 = ort.InferenceSession(f'./{weight}')
# sess1 and sess2 are almost the same speed

I’ve tried release packages and Nuget packages, but they have the same result. I’m trying to build from source to see if there are better results.