transformers: Whisper breaks on poor quality speech audio

System Info

  • transformers version: 4.26.1
  • Platform: Linux-5.19.0-32-generic-x86_64-with-glibc2.17
  • Python version: 3.8.16
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 1.13.1+cu117 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I don’t know if it’s a bug, but it’s definitely not an expected behaviour for me. Also I saw a thread with such behaviour where @Narsil said that “the model is repeating itself”, but I can’t find it right now, I’ll update the issue when I do.

To recognize audio file I’m using script that I found in one of the threads here on github link

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(task="transcribe")

input_speech, sr = audio2numpy.audio_from_file(file)

input_features = processor(input_speech, return_tensors="pt", sampling_rate=16000).input_features
predicted_ids = model.generate(input_features, max_length=model.config.max_length, repetition_penalty=1)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

And it works okay if the speech is clear and the utterance is detected good, but when speaker talks fast or not legibly enough, or if there’s little silence in audio, then the transcription becomes ugly. It looks like: “Привіт, хороша погода але але але але але але але але але але але але але але але але” Currently I’m using only ukrainian files so I’m not aware if it happens in other languages.

Expected behavior

The text is recognized along the whole audio file without breaking

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 27 (9 by maintainers)

Most upvoted comments

Without being able to reproduce it’s really hard. Could you dive to the level of logits and figure out any potential differences ?

I’m pretty sure it should come down to a configuration difference in the end, but if we can’t reproduce, it’d be hard to understand.

Okay, thank you for suggestions, I’ll try to look at it, and also will try to find an audio I could share with you