onnxruntime: BERT performance slower than default pytorch on CPU

Describe the bug I have exported a BERT model from huggingface’s transformers models.

Batch size: 1, sequence length: 256 Pytorch: 0.149689 seconds ONNX: 0.281283 seconds

Batch size: 8, sequence length: 256 Pytorch: 0.761311seconds ONNX: 2.792252 seconds

https://github.com/huggingface/transformers/blob/master/examples/benchmarks.py#L366

Urgency January/2020

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04
  • ONNX Runtime installed from: binary (pip install onnxruntime)
  • ONNX Runtime version: 1.1.0
  • Python version: 3.7.4
  • Visual Studio version (if applicable):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: CPU only
  • GPU model and memory:

To Reproduce

model = AutoModel.from_pretrained(model_name, config=config)
.....

torch.onnx.export(model, sequence, "bert_" + str(slice_size) + ".onnx",
                                   input_names=['input'],
                                   output_names=['output'],
                                   dynamic_axes={'input': {0: 'batch'}, 'output': {0: 'batch'}}, verbose=True)
  • ONNX model execution
sess = ort.InferenceSession("bert_" + str(slice_size) + ".onnx")
runtimes = timeit.repeat(lambda: sess.run([], {'input':np.random.randn(batch_size, slice_size).astype(np.longlong)}), repeat=average_over, number=3)

Expected behavior ONNX converted version should be faster.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 5
  • Comments: 20 (16 by maintainers)

Most upvoted comments

Please enable optimization like the following and try again:

so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(model, so)

Currently, some BERT optimizations are not enabled by default.

@DomHudson, @JustinMBrown,

Here are latest Jupyter Notebooks:

Bert model for SQuAD (CPU inference)

Bert model for SQuAD (GPU inference)

You could try it in your machine, and let me know the result. Note that currently OnnxRuntime need one run to warm up, so you need measure many runs instead of looking at the first run.

The model is floating point, and Nuphar works for both fp32 and int8. The speed-up is mainly from fusing ops automatically, and running element-wise ops like Erf in parallel. Quantization to int8 might give you more speed-ups.

Another thing you may try is to use Nuphar execution provider, which compiles the model for optimized inference on CPU. You may follow its tutorial on how to run BERT model. To try it out, you may build from source, or use docker image with prebuilt Nuphar:

docker pull mcr.microsoft.com/azureml/onnxruntime:latest-nuphar

You may add following lines in Python with a Nuphar-enabled build:

import onnxruntime as ort
import numpy as np
from onnxruntime.nuphar.symbolic_shape_infer import SymbolicShapeInference
SymbolicShapeInference.infer_shapes("bert_" + str(slice_size) + ".onnx", "bert_" + str(slice_size) + ".onnx", auto_merge=True)
sess = ort.InferenceSession("bert_" + str(slice_size) + ".onnx")
runtimes = timeit.repeat(lambda: sess.run([], {'input':np.random.randn(batch_size, slice_size).astype(np.longlong)}), repeat=average_over, number=3)
ort_average_time = sum(runtimes) / float(len(runtimes)) / 3.0

I measured ~35% speed-up in batch=8, on Xeon E5-2690v4 (dual sockets, 14-core/28-HT each socket)