transformers: Logits size does not match vocabulary size when using pyctcdecode for a fine-tuned wav2vec 2.0 model

Environment info

  • transformers version: 4.17.0.dev0
  • Platform: Linux-5.4.0-1063-azure-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • PyTorch version (GPU?): 1.10.2+cu102 (True)
  • Tensorflow version (GPU?): 2.6.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes (device = ‘cuda’)
  • Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten, @anton-l

Dataset used

https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0

Information

Model I am using: https://huggingface.co/Iskaj/xlsr300m_cv_7.0_nl_lm, (dataset = https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) The model fine-tuned version of https://huggingface.co/facebook/wav2vec2-xls-r-300m

The problem arises when using:

import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F

model_id = "Iskaj/xlsr300m_cv_7.0_nl_lm"

sample_iter = iter(load_dataset("mozilla-foundation/common_voice_8_0", "nl", split="test", streaming=True, use_auth_token=True))

sample = next(sample_iter)
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()

model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

input_values = processor(resampled_audio, return_tensors="pt").input_values

with torch.no_grad():
    logits = model(input_values).logits

transcription = processor.batch_decode(logits.numpy()).text

To reproduce

Steps to reproduce the behavior:

  1. install packages using: !pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode !pip install git+https://github.com/huggingface/transformers.git !pip install git+https://github.com/huggingface/datasets.git !pip install torchaudio soundfile librosa Levenshtein telwoord wandb jiwer
  2. run:
import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F

model_id = "Iskaj/xlsr300m_cv_7.0_nl_lm"

sample_iter = iter(load_dataset("mozilla-foundation/common_voice_8_0", "nl", split="test", streaming=True, use_auth_token=True))

sample = next(sample_iter)
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()

model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

input_values = processor(resampled_audio, return_tensors="pt").input_values

with torch.no_grad():
    logits = model(input_values).logits

transcription = processor.batch_decode(logits.numpy()).text
  1. Observe the error: `ValueError: Input logits of size 48, but vocabulary is size 50

Expected behavior

I would expect pyctc-decode to work correctly and give me a transcription. I suspect it has something to do with <s> and </s>. I’ve been struggling a bit with the length of the logits not matching up with the length of the vocabulary, when using pyctcdecode. For example in this repo that uses the LM: https://huggingface.co/patrickvonplaten/wav2vec2-base-100h-with-lm/blob/main/vocab.json the vocab.json includes and , but in this repo it doesn’t: https://huggingface.co/hf-test/xls-r-300m-sv/blob/main/vocab.json Maybe that helps

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (9 by maintainers)

Most upvoted comments

I just took a more detailed look into it and it’s actually not really a bug in Wav2Vec2, but a somewhat of an edge case scenario coupled with unintuitive design in Transformers.

Let’s explain. Both your vocabulary file and your alphabet file have been correctly constructed:

However, in Transformers tokenizers can also have additional tokens that when not added to the vocabulary don’t necessarily have to have a corresponding output character / logit id. This was the case here. The EOS and BOS token were added as “additional tokens” even though they have no corresponding logit id - see: https://huggingface.co/Iskaj/xlsr300m_cv_7.0_nl_lm/blob/e0290ad21fd43cf69d9f4de754067c02f9d6641e/added_tokens.json . This means they correspond to the vocabulary of the tokenizer which in turn would force them to also be part of the alphabet which doesn’t make sense though since the alphabet has to correspond 1-to-1 to the logit ids and EOS and BOS have no logit id. => So to make your model work, we actually need to remove those special tokens which I have done in the last three commits:

So the above command now works as expected.

Finally there is one more difficulty to consider. By default Wav2Vec2CTCTokenizer sets the eos_token and the bos_token to a value, being <eos> and <bos> - see: https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/wav2vec2#transformers.Wav2Vec2CTCTokenizer

=> this means we actually need to overwrite this with None (or null in json). So to remove EOS and BOS one would have to do:

from transformers import Wav2Vec2CTCTokenizer

tok = Wav2Vec2CTCTokenizer.from_pretrained("<path/to/vocab/file/without/eos/and/bos>", eos_token=None, bos_token=None)

The original Wav2Vec2 Tokenizers all had EOS and BOS defined in the vocab and a corresponding logit id, which is why it is the default. However this doesn’t alwasy have to be the case.

The blog post: https://huggingface.co/blog/fine-tune-xlsr-wav2vec2 should be the most helpful one

I indeed tried removing the 2 tokens and encountering the error. Thought because of the error message that it was a fault on my end in terms of modelling. Thanks for the quick response!