transformers: Logits size does not match vocabulary size when using pyctcdecode for a fine-tuned wav2vec 2.0 model
Environment info
transformers
version: 4.17.0.dev0- Platform: Linux-5.4.0-1063-azure-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): 1.10.2+cu102 (True)
- Tensorflow version (GPU?): 2.6.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes (device = ‘cuda’)
- Using distributed or parallel set-up in script?: No
Who can help
Dataset used
https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0
Information
Model I am using: https://huggingface.co/Iskaj/xlsr300m_cv_7.0_nl_lm, (dataset = https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) The model fine-tuned version of https://huggingface.co/facebook/wav2vec2-xls-r-300m
The problem arises when using:
import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F
model_id = "Iskaj/xlsr300m_cv_7.0_nl_lm"
sample_iter = iter(load_dataset("mozilla-foundation/common_voice_8_0", "nl", split="test", streaming=True, use_auth_token=True))
sample = next(sample_iter)
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()
model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
input_values = processor(resampled_audio, return_tensors="pt").input_values
with torch.no_grad():
logits = model(input_values).logits
transcription = processor.batch_decode(logits.numpy()).text
To reproduce
Steps to reproduce the behavior:
- install packages using:
!pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode !pip install git+https://github.com/huggingface/transformers.git !pip install git+https://github.com/huggingface/datasets.git !pip install torchaudio soundfile librosa Levenshtein telwoord wandb jiwer
- run:
import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F
model_id = "Iskaj/xlsr300m_cv_7.0_nl_lm"
sample_iter = iter(load_dataset("mozilla-foundation/common_voice_8_0", "nl", split="test", streaming=True, use_auth_token=True))
sample = next(sample_iter)
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()
model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
input_values = processor(resampled_audio, return_tensors="pt").input_values
with torch.no_grad():
logits = model(input_values).logits
transcription = processor.batch_decode(logits.numpy()).text
- Observe the error: `ValueError: Input logits of size 48, but vocabulary is size 50
Expected behavior
I would expect pyctc-decode to work correctly and give me a transcription.
I suspect it has something to do with <s>
and </s>
. I’ve been struggling a bit with the length of the logits not matching up with the length of the vocabulary, when using pyctcdecode. For example in this repo that uses the LM: https://huggingface.co/patrickvonplaten/wav2vec2-base-100h-with-lm/blob/main/vocab.json the vocab.json includes and , but in this repo it doesn’t: https://huggingface.co/hf-test/xls-r-300m-sv/blob/main/vocab.json Maybe that helps
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19 (9 by maintainers)
I just took a more detailed look into it and it’s actually not really a bug in Wav2Vec2, but a somewhat of an edge case scenario coupled with unintuitive design in Transformers.
Let’s explain. Both your vocabulary file and your alphabet file have been correctly constructed:
However, in Transformers tokenizers can also have additional tokens that when not added to the vocabulary don’t necessarily have to have a corresponding output character / logit id. This was the case here. The EOS and BOS token were added as “additional tokens” even though they have no corresponding logit id - see: https://huggingface.co/Iskaj/xlsr300m_cv_7.0_nl_lm/blob/e0290ad21fd43cf69d9f4de754067c02f9d6641e/added_tokens.json . This means they correspond to the vocabulary of the tokenizer which in turn would force them to also be part of the alphabet which doesn’t make sense though since the alphabet has to correspond 1-to-1 to the logit ids and EOS and BOS have no logit id. => So to make your model work, we actually need to remove those special tokens which I have done in the last three commits:
So the above command now works as expected.
Finally there is one more difficulty to consider. By default Wav2Vec2CTCTokenizer sets the
eos_token
and thebos_token
to a value, being<eos>
and<bos>
- see: https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/wav2vec2#transformers.Wav2Vec2CTCTokenizer=> this means we actually need to overwrite this with
None
(ornull
in json). So to remove EOS and BOS one would have to do:The original Wav2Vec2 Tokenizers all had EOS and BOS defined in the vocab and a corresponding logit id, which is why it is the default. However this doesn’t alwasy have to be the case.
The blog post: https://huggingface.co/blog/fine-tune-xlsr-wav2vec2 should be the most helpful one
I indeed tried removing the 2 tokens and encountering the error. Thought because of the error message that it was a fault on my end in terms of modelling. Thanks for the quick response!