onnxruntime: ONNX Runtime much slower than PyTorch (2-3x slower)
Describe the bug We built a UNet3D image segmentation model in PyTorch (based on this repo) and want to start distributing it. ONNX seemed like a good option as it allows us to compress our models and the dependencies needed to run them. As our models are large & slow, we need to run them on GPU
We were able to convert these models to ONNX, but noticed a significant slow-down of the inference (2-3x). The issue is that the timing is quite critical, and that our models are already relatively slow, so we can’t afford more slow-downs
I’m running my comparison tests following what was done in this issue
I could use your help to better understand where the issue is coming from and if it is resolvable at all. What tests, settings, etc. can I try to see where the issue might be ?
Urgency This is quite an urgent issue, we need to deliver our models to our clients in the coming month and will need to resolve to other solutions if we can’t fix ONNX soon
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
- ONNX Runtime installed from (source or binary): source
- ONNX Runtime version: 1.12
- Python version: 3.8.13
- Visual Studio version (if applicable):
- CUDA/cuDNN version: 11.1 / 8.0.5
- GPU model and memory: NVIDIA GeForce RTX 3060, 12GB
To Reproduce The code for the model will be quite hard to extract, so I’ll first try to describe the issue and what I’ve tested. I’m currently generating my model using:
with torch.no_grad():
torch.onnx.export(
torchModel,
dummyInput,
outPath,
export_params=True,
opset_version=14,
do_constant_folding=True,
input_names=["input"],
output_names=["output"],
dynamic_axes={
"input": [0],
"output": [0]
},
verbose=True,
)
The model that we are using uses the following operations:
- 3D conv
- Group normalisation
- Max pooling
- Nearest neighbor interpolation & concatenation
When converting to ONNX, I could see some weird things in the graph (see the first screenshot):
- For some operations the device is
cpuinstead ofcuda:0like all other operations; what does this mean? Will ONNX runtime run these operations on the CPU?. See below for the partial output of the conversion withverbose = True
%154 : Half(*, 512, *, *, *, strides=[884736, 1728, 144, 12, 1], requires_grad=0, device=cuda:0) = onnx::Relu(%153) # /usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:1297:0
%155 : Long(3, strides=[1], device=cpu) = onnx::Constant[value= 0 8 -1 [ CPULongType{3} ]]()
%156 : Half(0, 8, *, device=cpu) = onnx::Reshape[allowzero=0](%154, %155)
%157 : Half(8, strides=[1], device=cpu) = onnx::Constant[value= 1 1 1 1 1 1 1 1 [ CPUHalfType{8} ]]()
%158 : Half(8, strides=[1], device=cpu) = onnx::Constant[value= 0 0 0 0 0 0 0 0 [ CPUHalfType{8} ]]()
%159 : Half(0, 8, *, device=cpu) = onnx::InstanceNormalization[epsilon=1.0000000000000001e-05](%156, %157, %158)
%160 : Long(5, strides=[1], device=cpu) = onnx::Shape(%154)
%161 : Half(*, *, *, *, *, device=cpu) = onnx::Reshape[allowzero=0](%159, %160)
%164 : Half(*, *, *, *, *, device=cpu) = onnx::Mul(%161, %309)
%167 : Half(*, *, *, *, *, strides=[884736, 1728, 144, 12, 1], requires_grad=0, device=cuda:0) = onnx::Add(%164, %310) # /usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:2360:0
%169 : Long(5, strides=[1], device=cpu) = onnx::Shape(%167)
%170 : Long(1, strides=[1], device=cpu) = onnx::Constant[value={0}]()
%171 : Long(1, strides=[1], device=cpu) = onnx::Constant[value={0}]()
%172 : Long(1, strides=[1], device=cpu) = onnx::Constant[value={2}]()
%173 : Long(2, strides=[1], device=cpu) = onnx::Slice(%169, %171, %172, %170)
%175 : Long(5, strides=[1], device=cpu) = onnx::Concat[axis=0](%173, %311)
%176 : Tensor? = prim::Constant()
%177 : Tensor? = prim::Constant()
- The graph is rather “ugly” when compared to the one generated for ResNet (regarding all the
Mul,Add,Reshape, etc. operations). Could this be the reason for the slowdown?
I saw that Group normalisation wasn’t directly supported by ONNX and thus thought that this might be the cause for the slow-down, I thus tried with an alternative model where I remove the group norm, which led to a nicer graph (see 2nd screenshot) and to less slow-down (from 3x slower to 2x slower). The slow-down is still significant though, and the Slice, Concat, etc. operations still say that they occur on the cpu; are these then the issue?
Overall it would be great to get some guidance on where the problem could be located: should we adapt our model architecture, the way of exporting to ONNX, etc. ? Is it even possible at all with a model like UNet3D ?
Thanks for the help !
Screenshots
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 21 (12 by maintainers)
My understanding is that cuDNN only caches the results when the input shape is static. I was able to confirm this same behavior with a Torch model having dynamic input shapes exported to ONNX.
Benchmark mode in PyTorch is what ONNX calls EXHAUSTIVE and EXHAUSTIVE is the default ONNX setting per the documentation. PyTorch defaults to using
cudnnGetConvolutionForwardAlgorithm_v7which is much faster. So in this case with dynamic inputs, it leads to the Torch model appearing to run faster.I wrote an article with detailed steps on this comparison. https://medium.com/neuml/debug-onnx-gpu-performance-c9290fe07459
This link also has a related discussion. https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936/3
I ran into a similar issue where an ONNX model was much slower than it’s PyTorch counterpart on the GPU. I tried all the suggestions here including io_binding but nothing worked.
To solve the issue, profiling was enabled via the following code:
Once the program exited, a profiling JSON file was generated. I took a look at that to find the longest running nodes.
Skipping nodes for the full model run and session initialization, I was seeing nodes like this:
feed_forward/w_1/Conv_kernel_time. Reading the documentation, the following setting stood out,cudnn_conv_algo_search.The program was re-run with that setting changed (to either HEURISTIC or DEFAULT).
This time the performance was equal to or even slightly better than the PyTorch model on the GPU.
I’m not sure why ONNX defaults to an EXHAUSTIVE search. In reading similar code in PyTorch, it doesn’t appear PyTorch does (looks like it defaults to what ONNX calls HEURISTIC) and that is the performance difference in my case.
Hope this helps anyone running into performance issues one way or another. Looking at the original post, there were a lot of Conv operations, so it’s worth a try.
The
Slices andConcats that are being forced down to CPU are part of shape subgraphs - if you look at what they are doing, they slice out one int and concatenate 2 ints and so on. There is no need for these ops to be hardware accelerated (in fact they are detrimental). So ORT has logic to force these to CPU to save device bandwidth for ops that actually require hardware acceleration. So, I don’t believe this is the cause for the poor perf.Have you tried using nvprof and checking which kernel takes up the most time ? That is the best way to move forward with this.
First, pip uninstall
onnxruntime-training? Then,pip install onnxruntime?How many inputs/outputs/operators do you have? If the number of inputs/outputs is at the same scale as the number of operators, IOBinding is super slow.
Any difference is dependent on the model and the EPs that are enabled. If there are no internal ORT operators with CUDA implementations that apply to nodes the CUDA EP is taking there won’t be a difference between ‘basic’ and ‘extended’/‘all’.