transformers: Seamless M4T-v2 Inference bug when using chunk_length_s parameter

System Info

Ubuntu 22 Python 3.12 Latest Transformers

Who can help?

@Narsil @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I know the chunk parameter is experimental, but it completely falls apart in case of m4t-v2. It works FAR better in Whisper. Would it be possible to disable it or print a big fat warning when it is used? To reproduce:

  1. Use seamless m4t-v2 model in a transformers pipeline.
  2. Use task transcribe and ASR for any language you like.
  3. Use an Audio which is longer than 30 seconds.
  4. Set the chunk_length_s parameter to any value, e.g. 30
  5. Compare score to transcription without setting chunk_length_s.

The scores in terms of WER % error are usually 4-5 times worse than you would expect. It is basically unusable at the moment, which make m4t-v2 not usable for ASR, except your files are very short.

Expected behavior

Just chunk it and get approxximateky the same score if possible. I dont know why it fails so hard in comparison to whisper which works pretty good.

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Comments: 15 (3 by maintainers)

Most upvoted comments

Thanks for investigating @Narsil ! So this is the result after fixing the bug you found ?

@ylacombe is this model known for hallucinating so much?

It is, the model principal usage is translation. Using it in an ASR settings is likely to produce big hallucinations. Moreover, from my own usage, it is not really good with short audios.