transformers: WhisperTimeStampLogitsProcessor error while using Whisper pipelines. Was WhisperTimeStampLogitsProcessor used?

System Info

Hello,

When I tried this notebook, https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor?usp=sharing#scrollTo=Ca4YYdtATxzo, I encountered an error that is: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used? Especially sounds greater than the 30s, I encountered this error. On the other hand, it returns timestamps when sounds are lower than 30 seconds. How can I fix it?

Specs: transformers==4.27.0.dev0

from transformers import pipeline
MODEL_NAME = "openai/whisper-large-v2"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    device='cuda:0',
   generate_kwargs = {"language":"<|tr|>","task": "transcribe"})

results = pipe(speech_file, return_timestamps=True, chunk_length_s=30, stride_length_s=[6,0], batch_size=32, generate_kwargs = {"language":"<|tr|>","task": "transcribe"})

Who can help?

@ArthurZucker @sanchit-gandhi @Narsil

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

MODEL_NAME = "openai/whisper-large-v2"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    device='cuda:0',
   generate_kwargs = {"language":"<|tr|>","task": "transcribe"})

results = pipe(speech_file, return_timestamps=True, chunk_length_s=30, stride_length_s=[6,0], batch_size=32, generate_kwargs = {"language":"<|tr|>","task": "transcribe"})

Expected behavior

results = {'text':'Some Turkish results.',
'chunks':[
{'text': ' Some Turkish results.',
            'timestamp': (0.0,4.4)},
{'text': ' Some Turkish results.',
             'timestamp': (4.4,28.32)},
{'text': ' Some Turkish results.',
             'timestamp': (28.32,45.6)}]
}

About this issue

Original URL
State: closed
Created a year ago
Comments: 36 (13 by maintainers)

Most upvoted comments

Proposed changes:

https://huggingface.co/openai/whisper-base/discussions/12 https://huggingface.co/openai/whisper-large/discussions/29 https://huggingface.co/openai/whisper-medium/discussions/12 https://huggingface.co/openai/whisper-large-v2/discussions/30 https://huggingface.co/openai/whisper-small/discussions/19 https://huggingface.co/openai/whisper-tiny/discussions/9

Narsil on Mar 9, 2023

Hey @upskyy - in my experience, fine-tuning with LoRA / QLoRA is a fantastic way to prevent this ‘catastrophic forgetting’ effect where Whisper forgets how to predict timestamps after fine-tuning. For this, you can check-out the following repo: https://github.com/Vaibhavs10/fast-whisper-finetuning

And @ArthurZucker - cool that the latest tokenizer has the 1500 special tokens already added! This should make our lives a lot easier for encoding with timestamps, since the tokenizer is now able to map the timestamp strings to tokens.

All we really need to do then is have a small amount of data in our train set that has timestamps in the Whisper format, e.g.

"<|0.00|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and<|6.24|><|6.24|> can discover in it but little of rocky Ithaca.<|9.44|>"

Generally, you only need between 1-5% of your data to be timestamped to ensure you retain Whisper’s timestamp prediction abilities. The easiest way of getting this data is to use the pre-trained Whisper model to re-annotate 1% of your training data with timestamps. You can then merge this data into your full training corpus to train on both non-timestamped (99%) and timestamped (1%) data.

What we then want to do is enable/disable timestamps when we encode the labels, depending on whether the labels have timestamps or not:

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # set tokenizer prefix tokens depending on whether we have timestamps or not
    predict_timestamps = batch["predict_timestamps"]  # boolean that tells us whether our labels have timestamps or not (add this column to your dataset to indicate)
    tokenizer.set_prefix_tokens(language=language, task="transcribe", predict_timestamps= predict_timestamps)

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

sanchit-gandhi on Jun 22, 2023

Hey! For finetuning with timestamps, you should either use the latest tokenizer (which by default should add 1500 special tokens, not more) or use the previous one, which also supported them, but not for encoding. Pinging @sanchit-gandhi as he has been working on distil whisper, might have a training script to add timestamps. Also this kind of question would be better for the forum

ArthurZucker on Jun 22, 2023

@devxpy I have reproduced with your example. It seems this model never outputs timestamps.

I am guessing it was finetuned without timestamps and so the error is kind of normal. However it lead me to reduce the hard error to a soft error. The results are still nonsensical (check out the test).

I spent some time trying to find a better fix by fixing the logits processor itself, but to no avail. There’s just no way to fix models that refuse to output timestamp tokens. To be noted is that whisper models are never even forced to output increasing timestamp tokens, so there’s already a lot of room there. Soft error is better.

Narsil on Mar 30, 2023