transformers: Whisper breaks on poor quality speech audio
System Info
transformersversion: 4.26.1- Platform: Linux-5.19.0-32-generic-x86_64-with-glibc2.17
- Python version: 3.8.16
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.13.1+cu117 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I don’t know if it’s a bug, but it’s definitely not an expected behaviour for me. Also I saw a thread with such behaviour where @Narsil said that “the model is repeating itself”, but I can’t find it right now, I’ll update the issue when I do.
To recognize audio file I’m using script that I found in one of the threads here on github link
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(task="transcribe")
input_speech, sr = audio2numpy.audio_from_file(file)
input_features = processor(input_speech, return_tensors="pt", sampling_rate=16000).input_features
predicted_ids = model.generate(input_features, max_length=model.config.max_length, repetition_penalty=1)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
And it works okay if the speech is clear and the utterance is detected good, but when speaker talks fast or not legibly enough, or if there’s little silence in audio, then the transcription becomes ugly. It looks like: “Привіт, хороша погода але але але але але але але але але але але але але але але але” Currently I’m using only ukrainian files so I’m not aware if it happens in other languages.
Expected behavior
The text is recognized along the whole audio file without breaking
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 27 (9 by maintainers)
Okay, thank you for suggestions, I’ll try to look at it, and also will try to find an audio I could share with you