transformers: WhisperTimeStampLogitsProcessor error while using Whisper pipelines. Was WhisperTimeStampLogitsProcessor used?
System Info
Hello,
When I tried this notebook, https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor?usp=sharing#scrollTo=Ca4YYdtATxzo, I encountered an error that is: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used? Especially sounds greater than the 30s, I encountered this error. On the other hand, it returns timestamps when sounds are lower than 30 seconds.
How can I fix it?
Specs:
transformers==4.27.0.dev0
from transformers import pipeline
MODEL_NAME = "openai/whisper-large-v2"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
device='cuda:0',
generate_kwargs = {"language":"<|tr|>","task": "transcribe"})
results = pipe(speech_file, return_timestamps=True, chunk_length_s=30, stride_length_s=[6,0], batch_size=32, generate_kwargs = {"language":"<|tr|>","task": "transcribe"})
Who can help?
@ArthurZucker @sanchit-gandhi @Narsil
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
MODEL_NAME = "openai/whisper-large-v2"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
device='cuda:0',
generate_kwargs = {"language":"<|tr|>","task": "transcribe"})
results = pipe(speech_file, return_timestamps=True, chunk_length_s=30, stride_length_s=[6,0], batch_size=32, generate_kwargs = {"language":"<|tr|>","task": "transcribe"})
Expected behavior
results = {'text':'Some Turkish results.',
'chunks':[
{'text': ' Some Turkish results.',
'timestamp': (0.0,4.4)},
{'text': ' Some Turkish results.',
'timestamp': (4.4,28.32)},
{'text': ' Some Turkish results.',
'timestamp': (28.32,45.6)}]
}
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 36 (13 by maintainers)
Proposed changes:
https://huggingface.co/openai/whisper-base/discussions/12 https://huggingface.co/openai/whisper-large/discussions/29 https://huggingface.co/openai/whisper-medium/discussions/12 https://huggingface.co/openai/whisper-large-v2/discussions/30 https://huggingface.co/openai/whisper-small/discussions/19 https://huggingface.co/openai/whisper-tiny/discussions/9
Hey @upskyy - in my experience, fine-tuning with LoRA / QLoRA is a fantastic way to prevent this ‘catastrophic forgetting’ effect where Whisper forgets how to predict timestamps after fine-tuning. For this, you can check-out the following repo: https://github.com/Vaibhavs10/fast-whisper-finetuning
And @ArthurZucker - cool that the latest tokenizer has the 1500 special tokens already added! This should make our lives a lot easier for encoding with timestamps, since the tokenizer is now able to map the timestamp strings to tokens.
All we really need to do then is have a small amount of data in our train set that has timestamps in the Whisper format, e.g.
Generally, you only need between 1-5% of your data to be timestamped to ensure you retain Whisper’s timestamp prediction abilities. The easiest way of getting this data is to use the pre-trained Whisper model to re-annotate 1% of your training data with timestamps. You can then merge this data into your full training corpus to train on both non-timestamped (99%) and timestamped (1%) data.
What we then want to do is enable/disable timestamps when we encode the labels, depending on whether the labels have timestamps or not:
Hey! For finetuning with timestamps, you should either use the latest tokenizer (which by default should add 1500 special tokens, not more) or use the previous one, which also supported them, but not for encoding. Pinging @sanchit-gandhi as he has been working on distil whisper, might have a training script to add timestamps. Also this kind of question would be better for the forum
@devxpy I have reproduced with your example. It seems this model never outputs timestamps.
I am guessing it was finetuned without timestamps and so the error is kind of normal. However it lead me to reduce the hard error to a soft error. The results are still nonsensical (check out the test).
I spent some time trying to find a better fix by fixing the logits processor itself, but to no avail. There’s just no way to fix models that refuse to output timestamp tokens. To be noted is that whisper models are never even forced to output increasing timestamp tokens, so there’s already a lot of room there. Soft error is better.