transformers: Bug in whisper finetuning tutorial? "Multiple languages detected when trying to predict..."
System Info
Transformers version: 4.38.0.dev0 Python version: Python3.10 venv (local) Platform: MacOS Venture 13.5
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Thank you for the amazing whisper finetuning tutorial at: https://huggingface.co/blog/fine-tune-whisper
When I download the ipynb and run it locally it runs fine.
However, when I change a single line (the last line) from:
trainer.train()
to:
eval_results = trainer.evaluate()
I get the following error:
ValueError: Multiple languages detected when trying to predict the most likely target language for transcription.
Full error log:
{
"name": "ValueError",
"message": "Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing `language='...'` or make sure all input audio is of the same language.",
"stack": "---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[21], line 1
----> 1 eval_results = trainer.evaluate()
2 print(eval_results)
File ~some_path/venv/lib/python3.10/site-packages/transformers/trainer_seq2seq.py:166, in Seq2SeqTrainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix, **gen_kwargs)
164 self.gather_function = self.accelerator.gather
165 self._gen_kwargs = gen_kwargs
--> 166 return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
File ~some_path/venv/lib/python3.10/site-packages/transformers/trainer.py:3136, in Trainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
3133 start_time = time.time()
3135 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 3136 output = eval_loop(
3137 eval_dataloader,
3138 description=\"Evaluation\",
3139 # No point gathering the predictions if there are no metrics, otherwise we defer to
3140 # self.args.prediction_loss_only
3141 prediction_loss_only=True if self.compute_metrics is None else None,
3142 ignore_keys=ignore_keys,
3143 metric_key_prefix=metric_key_prefix,
3144 )
3146 total_batch_size = self.args.eval_batch_size * self.args.world_size
3147 if f\"{metric_key_prefix}_jit_compilation_time\" in output.metrics:
File ~some_path/venv/lib/python3.10/site-packages/transformers/trainer.py:3325, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
3322 batch_size = observed_batch_size
3324 # Prediction step
-> 3325 loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
3326 main_input_name = getattr(self.model, \"main_input_name\", \"input_ids\")
3327 inputs_decode = self._prepare_input(inputs[main_input_name]) if args.include_inputs_for_metrics else None
File ~some_path/venv/lib/python3.10/site-packages/transformers/trainer_seq2seq.py:296, in Seq2SeqTrainer.prediction_step(self, model, inputs, prediction_loss_only, ignore_keys, **gen_kwargs)
288 if (
289 \"labels\" in generation_inputs
290 and \"decoder_input_ids\" in generation_inputs
291 and generation_inputs[\"labels\"].shape == generation_inputs[\"decoder_input_ids\"].shape
292 ):
293 generation_inputs = {
294 k: v for k, v in inputs.items() if k not in (\"decoder_input_ids\", \"decoder_attention_mask\")
295 }
--> 296 generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs)
298 # Temporary hack to ensure the generation config is not initialized for each iteration of the evaluation loop
299 # TODO: remove this hack when the legacy code that initializes generation_config from a model config is
300 # removed in https://github.com/huggingface/transformers/blob/98d88b23f54e5a23e741833f1e973fdf600cc2c5/src/transformers/generation/utils.py#L1183
301 if self.model.generation_config._from_model_config:
File ~some_path/venv/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py:533, in WhisperGenerationMixin.generate(self, input_features, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, return_timestamps, task, language, is_multilingual, prompt_ids, prompt_condition_type, condition_on_prev_tokens, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, num_segment_frames, attention_mask, time_precision, return_token_timestamps, return_segments, return_dict_in_generate, **kwargs)
527 self._set_prompt_condition_type(
528 generation_config=generation_config,
529 prompt_condition_type=prompt_condition_type,
530 )
532 # pass self.config for backward compatibility
--> 533 init_tokens = self._retrieve_init_tokens(
534 input_features,
535 generation_config=generation_config,
536 config=self.config,
537 num_segment_frames=num_segment_frames,
538 kwargs=kwargs,
539 )
540 # TODO(Sanchit) - passing `decoder_input_ids` is deprecated. One should use `prompt_ids` instead
541 # This function should be be removed in v4.39
542 self._check_decoder_input_ids(
543 prompt_ids=prompt_ids, init_tokens=init_tokens, is_shortform=is_shortform, kwargs=kwargs
544 )
File ~some_path/venv/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py:1166, in WhisperGenerationMixin._retrieve_init_tokens(self, input_features, generation_config, config, num_segment_frames, kwargs)
1158 lang_ids = self.detect_language(
1159 input_features=input_features,
1160 encoder_outputs=kwargs.get(\"encoder_outputs\", None),
1161 generation_config=generation_config,
1162 num_segment_frames=num_segment_frames,
1163 )
1165 if torch.unique(lang_ids).shape[0] > 1:
-> 1166 raise ValueError(
1167 \"Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing `language='...'` or make sure all input audio is of the same language.\"
1168 )
1170 lang_id = lang_ids[0].item()
1172 # append or replace lang_id to init_tokens
ValueError: Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing `language='...'` or make sure all input audio is of the same language."
}
Is this expected behaviour? Thank you kindly in advance.
Expected behavior
A normal evaluation run to evaluate the performance of the model on the language before starting to train it.
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Comments: 15 (4 by maintainers)
Hey @rishabhjain16,
Ah yes indeed the training loop runs the evaluation loop inside and sadly doesn’t let the user pass any generation key word params such as
"language"
. You can however fix this easily by replacing the following cell in the notebook:with:
and the training should work!
This was my question basically too. I was not getting how to pass these now-required language arguments to the trainer rather than evaluate. What I ended up doing was this:
I am pretty sure a better solution will come along soon, but this works!
Sorry for being a bit late here. Yes this error is expected, we’ve recently changed the default behavior to language detection when not specifying which language is to be evaluated.
If you train your model on Hindi as shown in the notebook, can you make sure to pass:
so that the model doesn’t try to detect the language it has to transcribe?
cc @patrickvonplaten as well
There has a been a lot of updates to make the API a lot better for the user. The model card available here mentions the
generate_kwargs
which should help you.I am going to close this issue as both @patrickvonplaten and my comments should have adresse your inquiries.
I am getting a similar error during training. Any help is appreciated.
I to have the same error. Verified my dataset, this is 1 language.
Ok, can confirm that on 4.37.2 this bug does not appear. Something to do with https://github.com/huggingface/transformers/pull/28687 I guess?