optimum: ORT whisper on CUDAExecutionProvider is slower than PyTorch
System Info
optimum 1.7.1
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
The exported whisper ONNX decoder model has encoder.value as outputs. Actually encoder.value are constant in the decoding stage. Those copies with Identiy are very heavy and make the performance much worse.
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
session_options = onnxruntime.SessionOptions()
session_options.enable_profiling = True
model_ort = ORTModelForSpeechSeq2Seq.from_pretrained(whisper_model_name, from_transformers=True, use_io_binding=True, session_options=session_options)
generator_ort = pipeline(
task="automatic-speech-recognition",
model=model_ort,
feature_extractor=processor.feature_extractor,
tokenizer=processor.tokenizer,
# batch_size=2,
# device=0,
chunk_length_s=30,
stride_length_s=(1,1), # must have with chunk_length_s
generate_kwargs={"max_new_tokens": 1024},
)
Expected behavior
Don’t output encoder value in decoder model.
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 2
- Comments: 15 (2 by maintainers)
@fxmarty, we are working on the fusion of whisper model. Our internal benchmark shows we get good performance with the fusion. Will keep you posted.
I think this might be connected to an issue I am currently having with onnxruntime-web: https://discuss.huggingface.co/t/when-exporting-seq2seq-models-with-onnx-why-do-we-need-both-decoder-with-past-model-onnx-and-decoder-model-onnx/33354
For some reason, if I use an empty tensor of the correct shape (e.g., [batch, heads, 0, dim]) - which is completely valid for the PyTorch implementation - it returns an empty tensor for the present key values (as opposed to the PyTorch implementation which produces the correct output).
To preemptively answer the question as to why I would pass empty tensors to the decoder, it should bypass the decoder_model.onnx, meaning one would not need to export both decoder_model.onnx and decoder_model_with_past.onnx (which would make my Transformers.js library much more efficient!)
Hi @yufenglee , #872 should fix the issue. I would recommend you to use the CLI
optimum-cli export onnx
to avoid exporting at every run.Thank you @hannan72, this is helpful to investigate.
Results for an input of shape (1, 80, 3000), on
openai/whisper-small
:GPU: GeForce RTX 3060 Mobile CPU: i7-1280P
So it seems we are still slower than PyTorch on CUDAExecutionProvider, @yufenglee don’t hesitate if you have any suggestion for that.
Using the export on fp16 (
optimum-cli export onnx --device cuda --fp16 --model openai/whisper-small whisper_small_new
) I get: