transformers: Seamless M4T-v2 Inference bug when using chunk_length_s parameter
System Info
Ubuntu 22 Python 3.12 Latest Transformers
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I know the chunk parameter is experimental, but it completely falls apart in case of m4t-v2. It works FAR better in Whisper. Would it be possible to disable it or print a big fat warning when it is used? To reproduce:
- Use seamless m4t-v2 model in a transformers pipeline.
- Use task transcribe and ASR for any language you like.
- Use an Audio which is longer than 30 seconds.
- Set the
chunk_length_sparameter to any value, e.g. 30 - Compare score to transcription without setting
chunk_length_s.
The scores in terms of WER % error are usually 4-5 times worse than you would expect. It is basically unusable at the moment, which make m4t-v2 not usable for ASR, except your files are very short.
Expected behavior
Just chunk it and get approxximateky the same score if possible. I dont know why it fails so hard in comparison to whisper which works pretty good.
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Comments: 15 (3 by maintainers)
Thanks for investigating @Narsil ! So this is the result after fixing the bug you found ?
It is, the model principal usage is translation. Using it in an ASR settings is likely to produce big hallucinations. Moreover, from my own usage, it is not really good with short audios.