optimum: ORT whisper on CUDAExecutionProvider is slower than PyTorch

System Info

optimum 1.7.1

Who can help?

@lewtun , @michaelbenayoun

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

The exported whisper ONNX decoder model has encoder.value as outputs. Actually encoder.value are constant in the decoding stage. Those copies with Identiy are very heavy and make the performance much worse.

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
session_options = onnxruntime.SessionOptions()
session_options.enable_profiling = True
model_ort = ORTModelForSpeechSeq2Seq.from_pretrained(whisper_model_name, from_transformers=True, use_io_binding=True, session_options=session_options)
generator_ort = pipeline(
    task="automatic-speech-recognition",
    model=model_ort,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    # batch_size=2,
    # device=0,
    chunk_length_s=30,
    stride_length_s=(1,1), # must have with chunk_length_s
    generate_kwargs={"max_new_tokens": 1024},
)

image

Expected behavior

Don’t output encoder value in decoder model.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 2
  • Comments: 15 (2 by maintainers)

Commits related to this issue

Most upvoted comments

@fxmarty, we are working on the fusion of whisper model. Our internal benchmark shows we get good performance with the fusion. Will keep you posted.

I think this might be connected to an issue I am currently having with onnxruntime-web: https://discuss.huggingface.co/t/when-exporting-seq2seq-models-with-onnx-why-do-we-need-both-decoder-with-past-model-onnx-and-decoder-model-onnx/33354

For some reason, if I use an empty tensor of the correct shape (e.g., [batch, heads, 0, dim]) - which is completely valid for the PyTorch implementation - it returns an empty tensor for the present key values (as opposed to the PyTorch implementation which produces the correct output).

To preemptively answer the question as to why I would pass empty tensors to the decoder, it should bypass the decoder_model.onnx, meaning one would not need to export both decoder_model.onnx and decoder_model_with_past.onnx (which would make my Transformers.js library much more efficient!)

Hi @yufenglee , #872 should fix the issue. I would recommend you to use the CLI optimum-cli export onnx to avoid exporting at every run.

Thank you @hannan72, this is helpful to investigate.

Results for an input of shape (1, 80, 3000), on openai/whisper-small:

Framework Inference time (s)
PyTorch 1.13.1 (eager), cuda 0.321
ORT + CUDAExecutionProvider + IOBinding (new) 0.388
ORT + CUDAExecutionProvider + IOBinding (old) 0.455
Framework Inference time (s)
PyTorch 1.13.1 (eager), cpu 2.405
ORT + CPUExecutionProvider (new) 2.133
ORT + CPUExecutionProvider (old) 3.287

GPU: GeForce RTX 3060 Mobile CPU: i7-1280P

So it seems we are still slower than PyTorch on CUDAExecutionProvider, @yufenglee don’t hesitate if you have any suggestion for that.

Using the export on fp16 (optimum-cli export onnx --device cuda --fp16 --model openai/whisper-small whisper_small_new) I get:

Framework Inference time (s)
PyTorch 1.13.1 (eager), cuda, fp16 0.204
ORT + CUDAExecutionProvider + IOBinding (new, fp16) 0.282