onnxruntime: BERT performance slower than default pytorch on CPU

Describe the bug I have exported a BERT model from huggingface’s transformers models.

Batch size: 1, sequence length: 256 Pytorch: 0.149689 seconds ONNX: 0.281283 seconds

Batch size: 8, sequence length: 256 Pytorch: 0.761311seconds ONNX: 2.792252 seconds

https://github.com/huggingface/transformers/blob/master/examples/benchmarks.py#L366

Urgency January/2020

System information

OS Platform and Distribution: Linux Ubuntu 16.04
ONNX Runtime installed from: binary (pip install onnxruntime)
ONNX Runtime version: 1.1.0
Python version: 3.7.4
Visual Studio version (if applicable):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: CPU only
GPU model and memory:

To Reproduce

ONNX model convert https://github.com/huggingface/transformers/blob/master/examples/benchmarks.py#L334

model = AutoModel.from_pretrained(model_name, config=config)
.....

torch.onnx.export(model, sequence, "bert_" + str(slice_size) + ".onnx",
                                   input_names=['input'],
                                   output_names=['output'],
                                   dynamic_axes={'input': {0: 'batch'}, 'output': {0: 'batch'}}, verbose=True)

ONNX model execution

sess = ort.InferenceSession("bert_" + str(slice_size) + ".onnx")
runtimes = timeit.repeat(lambda: sess.run([], {'input':np.random.randn(batch_size, slice_size).astype(np.longlong)}), repeat=average_over, number=3)

Expected behavior ONNX converted version should be faster.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 5
Comments: 20 (16 by maintainers)

Most upvoted comments

This tutorial was recently published as FYI: https://github.com/onnx/tutorials/blob/master/tutorials/Inference-PyTorch-Bert-Model-for-High-Performance-in-ONNX-Runtime.ipynb

faxu on Jan 24, 2020

Please enable optimization like the following and try again:

so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(model, so)

Currently, some BERT optimizations are not enabled by default.

tianleiwu on Jan 12, 2020

@DomHudson, @JustinMBrown,

Here are latest Jupyter Notebooks:

Bert model for SQuAD (CPU inference)

Bert model for SQuAD (GPU inference)

You could try it in your machine, and let me know the result. Note that currently OnnxRuntime need one run to warm up, so you need measure many runs instead of looking at the first run.

tianleiwu on Mar 26, 2020

The model is floating point, and Nuphar works for both fp32 and int8. The speed-up is mainly from fusing ops automatically, and running element-wise ops like Erf in parallel. Quantization to int8 might give you more speed-ups.

ke1337 on Jan 16, 2020

Another thing you may try is to use Nuphar execution provider, which compiles the model for optimized inference on CPU. You may follow its tutorial on how to run BERT model. To try it out, you may build from source, or use docker image with prebuilt Nuphar:

docker pull mcr.microsoft.com/azureml/onnxruntime:latest-nuphar

You may add following lines in Python with a Nuphar-enabled build:

import onnxruntime as ort
import numpy as np
from onnxruntime.nuphar.symbolic_shape_infer import SymbolicShapeInference
SymbolicShapeInference.infer_shapes("bert_" + str(slice_size) + ".onnx", "bert_" + str(slice_size) + ".onnx", auto_merge=True)
sess = ort.InferenceSession("bert_" + str(slice_size) + ".onnx")
runtimes = timeit.repeat(lambda: sess.run([], {'input':np.random.randn(batch_size, slice_size).astype(np.longlong)}), repeat=average_over, number=3)
ort_average_time = sum(runtimes) / float(len(runtimes)) / 3.0

I measured ~35% speed-up in batch=8, on Xeon E5-2690v4 (dual sockets, 14-core/28-HT each socket)

ke1337 on Jan 16, 2020