optimum: ORT whisper on CUDAExecutionProvider is slower than PyTorch

System Info

optimum 1.7.1

Who can help?

@lewtun , @michaelbenayoun

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

The exported whisper ONNX decoder model has encoder.value as outputs. Actually encoder.value are constant in the decoding stage. Those copies with Identiy are very heavy and make the performance much worse.

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
session_options = onnxruntime.SessionOptions()
session_options.enable_profiling = True
model_ort = ORTModelForSpeechSeq2Seq.from_pretrained(whisper_model_name, from_transformers=True, use_io_binding=True, session_options=session_options)
generator_ort = pipeline(
    task="automatic-speech-recognition",
    model=model_ort,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    # batch_size=2,
    # device=0,
    chunk_length_s=30,
    stride_length_s=(1,1), # must have with chunk_length_s
    generate_kwargs={"max_new_tokens": 1024},
)

Expected behavior

Don’t output encoder value in decoder model.

About this issue

Original URL
State: open
Created a year ago
Reactions: 2
Comments: 15 (2 by maintainers)

Commits related to this issue

Whisper Model Optimization (#15473) ### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added opti... — committed to microsoft/onnxruntime by kunal-vaishnavi a year ago

Most upvoted comments

@fxmarty, we are working on the fusion of whisper model. Our internal benchmark shows we get good performance with the fusion. Will keep you posted.

yufenglee on Mar 15, 2023

I think this might be connected to an issue I am currently having with onnxruntime-web: https://discuss.huggingface.co/t/when-exporting-seq2seq-models-with-onnx-why-do-we-need-both-decoder-with-past-model-onnx-and-decoder-model-onnx/33354

For some reason, if I use an empty tensor of the correct shape (e.g., [batch, heads, 0, dim]) - which is completely valid for the PyTorch implementation - it returns an empty tensor for the present key values (as opposed to the PyTorch implementation which produces the correct output).

To preemptively answer the question as to why I would pass empty tensors to the decoder, it should bypass the decoder_model.onnx, meaning one would not need to export both decoder_model.onnx and decoder_model_with_past.onnx (which would make my Transformers.js library much more efficient!)

xenova on Mar 9, 2023

Hi @yufenglee , #872 should fix the issue. I would recommend you to use the CLI optimum-cli export onnx to avoid exporting at every run.

Thank you @hannan72, this is helpful to investigate.

Results for an input of shape (1, 80, 3000), on openai/whisper-small:

Framework	Inference time (s)
PyTorch 1.13.1 (eager), cuda	0.321
ORT + CUDAExecutionProvider + IOBinding (new)	0.388
ORT + CUDAExecutionProvider + IOBinding (old)	0.455

Framework	Inference time (s)
PyTorch 1.13.1 (eager), cpu	2.405
ORT + CPUExecutionProvider (new)	2.133
ORT + CPUExecutionProvider (old)	3.287

GPU: GeForce RTX 3060 Mobile CPU: i7-1280P

So it seems we are still slower than PyTorch on CUDAExecutionProvider, @yufenglee don’t hesitate if you have any suggestion for that.

Using the export on fp16 (optimum-cli export onnx --device cuda --fp16 --model openai/whisper-small whisper_small_new) I get:

Framework	Inference time (s)
PyTorch 1.13.1 (eager), cuda, fp16	0.204
ORT + CUDAExecutionProvider + IOBinding (new, fp16)	0.282

fxmarty on Mar 15, 2023