transformers: Bug in whisper finetuning tutorial? "Multiple languages detected when trying to predict..."

System Info

Transformers version: 4.38.0.dev0 Python version: Python3.10 venv (local) Platform: MacOS Venture 13.5

Who can help?

@sanchit-gandhi

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Thank you for the amazing whisper finetuning tutorial at: https://huggingface.co/blog/fine-tune-whisper

When I download the ipynb and run it locally it runs fine.

However, when I change a single line (the last line) from:

trainer.train()

to:

eval_results = trainer.evaluate()

I get the following error:

ValueError: Multiple languages detected when trying to predict the most likely target language for transcription.

Full error log:

{
	"name": "ValueError",
	"message": "Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing `language='...'` or make sure all input audio is of the same language.",
	"stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[21], line 1
----> 1 eval_results = trainer.evaluate()
      2 print(eval_results)

File ~some_path/venv/lib/python3.10/site-packages/transformers/trainer_seq2seq.py:166, in Seq2SeqTrainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix, **gen_kwargs)
    164 self.gather_function = self.accelerator.gather
    165 self._gen_kwargs = gen_kwargs
--> 166 return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)

File ~some_path/venv/lib/python3.10/site-packages/transformers/trainer.py:3136, in Trainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   3133 start_time = time.time()
   3135 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 3136 output = eval_loop(
   3137     eval_dataloader,
   3138     description=\"Evaluation\",
   3139     # No point gathering the predictions if there are no metrics, otherwise we defer to
   3140     # self.args.prediction_loss_only
   3141     prediction_loss_only=True if self.compute_metrics is None else None,
   3142     ignore_keys=ignore_keys,
   3143     metric_key_prefix=metric_key_prefix,
   3144 )
   3146 total_batch_size = self.args.eval_batch_size * self.args.world_size
   3147 if f\"{metric_key_prefix}_jit_compilation_time\" in output.metrics:

File ~some_path/venv/lib/python3.10/site-packages/transformers/trainer.py:3325, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
   3322         batch_size = observed_batch_size
   3324 # Prediction step
-> 3325 loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
   3326 main_input_name = getattr(self.model, \"main_input_name\", \"input_ids\")
   3327 inputs_decode = self._prepare_input(inputs[main_input_name]) if args.include_inputs_for_metrics else None

File ~some_path/venv/lib/python3.10/site-packages/transformers/trainer_seq2seq.py:296, in Seq2SeqTrainer.prediction_step(self, model, inputs, prediction_loss_only, ignore_keys, **gen_kwargs)
    288 if (
    289     \"labels\" in generation_inputs
    290     and \"decoder_input_ids\" in generation_inputs
    291     and generation_inputs[\"labels\"].shape == generation_inputs[\"decoder_input_ids\"].shape
    292 ):
    293     generation_inputs = {
    294         k: v for k, v in inputs.items() if k not in (\"decoder_input_ids\", \"decoder_attention_mask\")
    295     }
--> 296 generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs)
    298 # Temporary hack to ensure the generation config is not initialized for each iteration of the evaluation loop
    299 # TODO: remove this hack when the legacy code that initializes generation_config from a model config is
    300 # removed in https://github.com/huggingface/transformers/blob/98d88b23f54e5a23e741833f1e973fdf600cc2c5/src/transformers/generation/utils.py#L1183
    301 if self.model.generation_config._from_model_config:

File ~some_path/venv/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py:533, in WhisperGenerationMixin.generate(self, input_features, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, return_timestamps, task, language, is_multilingual, prompt_ids, prompt_condition_type, condition_on_prev_tokens, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, num_segment_frames, attention_mask, time_precision, return_token_timestamps, return_segments, return_dict_in_generate, **kwargs)
    527 self._set_prompt_condition_type(
    528     generation_config=generation_config,
    529     prompt_condition_type=prompt_condition_type,
    530 )
    532 # pass self.config for backward compatibility
--> 533 init_tokens = self._retrieve_init_tokens(
    534     input_features,
    535     generation_config=generation_config,
    536     config=self.config,
    537     num_segment_frames=num_segment_frames,
    538     kwargs=kwargs,
    539 )
    540 # TODO(Sanchit) - passing `decoder_input_ids` is deprecated. One should use `prompt_ids` instead
    541 # This function should be be removed in v4.39
    542 self._check_decoder_input_ids(
    543     prompt_ids=prompt_ids, init_tokens=init_tokens, is_shortform=is_shortform, kwargs=kwargs
    544 )

File ~some_path/venv/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py:1166, in WhisperGenerationMixin._retrieve_init_tokens(self, input_features, generation_config, config, num_segment_frames, kwargs)
   1158 lang_ids = self.detect_language(
   1159     input_features=input_features,
   1160     encoder_outputs=kwargs.get(\"encoder_outputs\", None),
   1161     generation_config=generation_config,
   1162     num_segment_frames=num_segment_frames,
   1163 )
   1165 if torch.unique(lang_ids).shape[0] > 1:
-> 1166     raise ValueError(
   1167         \"Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing `language='...'` or make sure all input audio is of the same language.\"
   1168     )
   1170 lang_id = lang_ids[0].item()
   1172 # append or replace lang_id to init_tokens

ValueError: Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing `language='...'` or make sure all input audio is of the same language."
}

Is this expected behaviour? Thank you kindly in advance.

Expected behavior

A normal evaluation run to evaluate the performance of the model on the language before starting to train it.

About this issue

Original URL
State: closed
Created 5 months ago
Comments: 15 (4 by maintainers)

Most upvoted comments

Hey @rishabhjain16,

Ah yes indeed the training loop runs the evaluation loop inside and sadly doesn’t let the user pass any generation key word params such as "language". You can however fix this easily by replacing the following cell in the notebook:

with:

from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.generation_config.language = "hi"  # define your language of choice here

and the training should work!

+10

patrickvonplaten on Feb 12, 2024

This was my question basically too. I was not getting how to pass these now-required language arguments to the trainer rather than evaluate. What I ended up doing was this:

model = model = WhisperForConditionalGeneration.from_pretrained(....)

def custom_generate(self, *args, **kwargs):
    kwargs["language"] = your_language # 'en', 'nl'

    return WhisperForConditionalGeneration.generate(self, *args, **kwargs)

model.generate = custom_generate.__get__(model, WhisperForConditionalGeneration)

I am pretty sure a better solution will come along soon, but this works!

SethvdAxe on Mar 19, 2024

Sorry for being a bit late here. Yes this error is expected, we’ve recently changed the default behavior to language detection when not specifying which language is to be evaluated.

If you train your model on Hindi as shown in the notebook, can you make sure to pass:

- eval_results = trainer.evaluate()
+ eval_results = trainer.evaluate(language="hi")

so that the model doesn’t try to detect the language it has to transcribe?

patrickvonplaten on Feb 9, 2024

cc @patrickvonplaten as well

ArthurZucker on Feb 1, 2024

There has a been a lot of updates to make the API a lot better for the user. The model card available here mentions the generate_kwargs which should help you.

I am going to close this issue as both @patrickvonplaten and my comments should have adresse your inquiries.

ArthurZucker on Mar 4, 2024

I am getting a similar error during training. Any help is appreciated.

rishabhjain16 on Feb 12, 2024

I to have the same error. Verified my dataset, this is 1 language.

chicodespons on Feb 2, 2024

Ok, can confirm that on 4.37.2 this bug does not appear. Something to do with https://github.com/huggingface/transformers/pull/28687 I guess?

SethvdAxe on Feb 1, 2024