onnxruntime: BERT performance slower than default pytorch on CPU
Describe the bug I have exported a BERT model from huggingface’s transformers models.
Batch size: 1, sequence length: 256 Pytorch: 0.149689 seconds ONNX: 0.281283 seconds
Batch size: 8, sequence length: 256 Pytorch: 0.761311seconds ONNX: 2.792252 seconds
https://github.com/huggingface/transformers/blob/master/examples/benchmarks.py#L366
Urgency January/2020
System information
- OS Platform and Distribution: Linux Ubuntu 16.04
- ONNX Runtime installed from: binary (pip install onnxruntime)
- ONNX Runtime version: 1.1.0
- Python version: 3.7.4
- Visual Studio version (if applicable):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: CPU only
- GPU model and memory:
To Reproduce
- ONNX model convert https://github.com/huggingface/transformers/blob/master/examples/benchmarks.py#L334
model = AutoModel.from_pretrained(model_name, config=config)
.....
torch.onnx.export(model, sequence, "bert_" + str(slice_size) + ".onnx",
input_names=['input'],
output_names=['output'],
dynamic_axes={'input': {0: 'batch'}, 'output': {0: 'batch'}}, verbose=True)
- ONNX model execution
sess = ort.InferenceSession("bert_" + str(slice_size) + ".onnx")
runtimes = timeit.repeat(lambda: sess.run([], {'input':np.random.randn(batch_size, slice_size).astype(np.longlong)}), repeat=average_over, number=3)
Expected behavior ONNX converted version should be faster.
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 5
- Comments: 20 (16 by maintainers)
This tutorial was recently published as FYI: https://github.com/onnx/tutorials/blob/master/tutorials/Inference-PyTorch-Bert-Model-for-High-Performance-in-ONNX-Runtime.ipynb
Please enable optimization like the following and try again:
Currently, some BERT optimizations are not enabled by default.
@DomHudson, @JustinMBrown,
Here are latest Jupyter Notebooks:
Bert model for SQuAD (CPU inference)
Bert model for SQuAD (GPU inference)
You could try it in your machine, and let me know the result. Note that currently OnnxRuntime need one run to warm up, so you need measure many runs instead of looking at the first run.
The model is floating point, and Nuphar works for both fp32 and int8. The speed-up is mainly from fusing ops automatically, and running element-wise ops like Erf in parallel. Quantization to int8 might give you more speed-ups.
Another thing you may try is to use Nuphar execution provider, which compiles the model for optimized inference on CPU. You may follow its tutorial on how to run BERT model. To try it out, you may build from source, or use docker image with prebuilt Nuphar:
You may add following lines in Python with a Nuphar-enabled build:
I measured ~35% speed-up in batch=8, on Xeon E5-2690v4 (dual sockets, 14-core/28-HT each socket)