onnxruntime: onnxruntime inference is around 5 times slower than pytorch when using GPU
Describe the bug Inference time of onnxruntime is 5x times slower as compared to the pytorch model on GPU BUT 2.5x times faster on CPU
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
- ONNX Runtime installed from (source or binary): Source
- ONNX Runtime version: 1.11.0 (onnx version 1.10.1)
- Python version: 3.8.12
- CUDA/cuDNN version: cuda version 11.5, cudnn version 8.2
- GPU model and memory: Quadro M2000M, 4 GB
To Reproduce
batch_size = 1
total_samples = 1000
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
def convert_to_onnx(resnet):
resnet.eval()
dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
input_names = [ 'input' ]
output_names = [ 'output' ]
torch.onnx.export(resnet,
dummy_input,
"resnet18.onnx",
verbose=True,
opset_version=13,
input_names=input_names,
output_names=output_names,
export_params=True,
do_constant_folding=True,
dynamic_axes={
'input': {0: 'batch_size'}, # variable length axes
'output': {0: 'batch_size'}}
)
def infer_pytorch(resnet):
print('Pytorch Inference')
print('==========================')
print()
x = torch.randn((batch_size, 3, 224, 224))
x = x.to(device=device)
latency = []
for i in range(total_samples):
t0 = time.time()
resnet.eval()
with torch.no_grad():
out = resnet(x)
latency.append(time.time() - t0)
print('Number of runs:', len(latency))
print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
def to_numpy(tensor):
return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
def infer_onnxruntime():
print('Onnxruntime Inference')
print('==========================')
print()
onnx_model = onnx.load("resnet18.onnx")
onnx.checker.check_model(onnx_model)
# Input
x = torch.randn((batch_size, 3, 224, 224))
x = x.to(device=device)
x = to_numpy(x)
so = onnxruntime.SessionOptions()
so.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
exproviders = ['CUDAExecutionProvider', 'CPUExecutionProvider']
model_onnx_path = os.path.join(".", "resnet18.onnx")
ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)
options = ort_session.get_provider_options()
cuda_options = options['CUDAExecutionProvider']
cuda_options['cudnn_conv_use_max_workspace'] = '1'
ort_session.set_providers(['CUDAExecutionProvider'], [cuda_options])
#IOBinding
input_names = ort_session.get_inputs()[0].name
output_names = ort_session.get_outputs()[0].name
io_binding = ort_session.io_binding()
io_binding.bind_cpu_input(input_names, x)
io_binding.bind_output(output_names, device)
#warm up run
ort_session.run_with_iobinding(io_binding)
ort_outs = io_binding.copy_outputs_to_cpu()
latency = []
for i in range(total_samples):
t0 = time.time()
ort_session.run_with_iobinding(io_binding)
latency.append(time.time() - t0)
ort_outs = io_binding.copy_outputs_to_cpu()
print('Number of runs:', len(latency))
print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
if __name__ == '__main__':
torch.cuda.empty_cache()
resnet = (models.resnet18(pretrained=True)).to(device=device)
convert_to_onnx(resnet)
infer_onnxruntime()
infer_pytorch(resnet)
Current behavior
If run on CPU,
Average onnxruntime cpu Inference time = 18.48 ms
Average PyTorch cpu Inference time = 51.74 ms
but, if run on GPU, I see
Average onnxruntime cuda Inference time = 47.89 ms
Average PyTorch cuda Inference time = 8.94 ms
If I change graph optimizations to onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL, I see some improvements in inference time on GPU, but its still slower than Pytorch.
I had read about similar issues here and ensured that i do the io binding so that the inputs are on GPU.
When converting the resnet to onnx, I see traces like
%193 : Float(64, 3, 7, 7, strides=[147, 49, 7, 1], requires_grad=0, device=cuda:0),
so, the nodes of the model are on GPU.
Further, during the processing for onnxruntime, I print device usage stats and I see this -
Using device: cuda:0
GPU Device name: Quadro M2000M
Memory Usage:
Allocated: 0.1 GB
Cached: 0.1 GB
So, GPU device is being used. Further, I have used the resnet18.onnx model from the ModelZoo to see if it is a converted mode issue, but i get the same results.
So, I cannot seem to figure this out any further and I am stuck here since quite a few days. Could somebody please point out to what could be the issue here?
Thanks
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 25 (7 by maintainers)
@nssrivathsa, I run your script in V100 GPU with PyTorch 1.12.1+cu116, OnnxRuntime-gpu 1.12.1, with latest CUDA 11.7, and latest cuDNN 8.5.0.96. Here is the output:
It seems that ORT is much faster than PyTorch in V100.
Here is the script:
Seeing similar problems as well – we saw 2.3x slower on
ORT + CUDA EPcompared to native PyTorch inference. In our case, the model is a large complicated diffusion model. I have a hunch that it’s caused by the exported onnx graph having too many glue operators when exporting.Could people from Microsoft step out and confirm whether it’s the case? Thanks
We’re having the same issues with our models - seeing a ~2x slow-down between running our models on GPU with PyTorch vs with ONNX-runtime. This is very problematic, and forces us to search for another solution for putting our models in production… Any help / update on this issue would be greatly appreciated! I’m happy to assist in the debugging if it can help, thanks!
FYI, using the above example seems to work for us though, we are seeing similar speeds between the ONNX and PyTorch models. In our case, we are using a 3D UNet model (see here), with similar options as above to convert to ONNX). What could be the causes of such a slow-down? Could it be due to some unsupported operations for example? I can attach the model graph if this can help
@tianleiwu Hi, thanks for the script. However, when batch size is large, like 128, ONNX is still much slower than PyTorch.