transformers: [Bug] whisper pipeline inference bug on transformers master branch

System Info

OS: ubuntu 20.04

transformer version: master branch. pip install git+https://github.com/huggingface/transformers

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Run following code:

import transformers
from packaging.version import Version
import pathlib


def whisper_pipeline():
    task = "automatic-speech-recognition"
    architecture = "openai/whisper-tiny"
    model = transformers.WhisperForConditionalGeneration.from_pretrained(architecture)
    tokenizer = transformers.WhisperTokenizer.from_pretrained(architecture)
    feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(architecture)
    if Version(transformers.__version__) > Version("4.30.2"):
        model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]
    return transformers.pipeline(
        task=task, model=model, tokenizer=tokenizer, feature_extractor=feature_extractor
    )

def raw_audio_file():
    # The dataset file comes from https://github.com/mlflow/mlflow/blob/master/tests/datasets/apollo11_launch.wav
    datasets_path = "/path/to/apollo11_launch.wav"
    return pathlib.Path(datasets_path).read_bytes()


inference_config = {
    "return_timestamps": "word",
    "chunk_length_s": 60,
    "batch_size": 16,
}
whisper = whisper_pipeline()
raw_audio_file_data = raw_audio_file()
prediction = whisper(raw_audio_file_data, return_timestamps="word", chunk_length_s=60, batch_size=16)

The last line raises error like:

>>> prediction = whisper(raw_audio_file_data, return_timestamps="word", chunk_length_s=60, batch_size=16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/weichen.xu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 356, in __call__
    return super().__call__(inputs, **kwargs)
  File "/home/weichen.xu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1132, in __call__
    return next(
  File "/home/weichen.xu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/home/weichen.xu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
  File "/home/weichen.xu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1046, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/weichen.xu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 551, in _forward
    generate_kwargs["num_frames"] = stride[0] // self.feature_extractor.hop_length
TypeError: unsupported operand type(s) for //: 'tuple' and 'int'
>>> 

Note this error only happens on transformer github master branch. For released version, above code works well.

Expected behavior

My example code should not raise error.

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Reactions: 3
  • Comments: 17 (4 by maintainers)

Most upvoted comments

also have the same issue, any update on this @sanchit-gandhi ?

@josebruzzoni

I’ve had the same issue. @WeichenXu123 's replies were very helpful, thanks man!

First try setting batch size to 1 if that’s not a problem.

Second, you can try going into the location that the error message says in the 3rd from last row. For me it says “”/home/nofreewill/.local/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 552, in _forward" So I opened it, went to line 552 and changed according to @WeichenXu123 's suggestion: generate_kwargs[“num_frames”] = stride[0] // self.feature_extractor.hop_length generate_kwargs[“num_frames”] = stride[0][0] // self.feature_extractor.hop_length

And it works now with batch size > 1 as well

Thanks for the ping. My hunch is that this is due to batch_size being larger than 1. Just to confirm, does the same thing happen if you remove that argument?

Yes It only happens when batch > 1