transformers: Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration

❓ Questions & Help

Pretrained T5Tokenizer has vocab size of 32100 (32000 tokens plus 100 extra_ids) but the shared embedding layer of T5ForConditionalGeneration has size of (32128, 768). I checked the google-research implementation of T5 and also found that they have vocab size of 32100 also.

Where did the extra 28 embeddings come from and how can we map it to the tokenizer?

To reproduce

from transformers import (
    T5Tokenizer, 
    T5ForConditionalGeneration,
)

tokenizer_pretrained = T5Tokenizer.from_pretrained('t5-base')
model_pretrained = T5ForConditionalGeneration.from_pretrained('t5-base')
len(tokenizer_pretrained.get_vocab()), model_pretrained.state_dict()['shared.weight'].shape

Output:

(32100, torch.Size([32128, 768]))

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 13
Comments: 15 (5 by maintainers)

Most upvoted comments

Hey @cstorm125,

I think, those 28 leftover embeddings are simply not used. The reason why the embedding matrix is of length 32128 as far as I know is simply because 32128 is a more GPU friendly number 32128 = 128 * 251 than 32100 = 4 * 8025. That means that the GPU is probably more efficient if it can directly deal with a power of 2 shape.

Also see: https://www.quora.com/Why-should-I-choose-a-mini-batch-size-of-32-64-128-256-etc-i-e-a-power-of-two-and-not-a-size-of-50-100-500-1000-Is-there-any-benefit-of-choosing-power-of-two-mini-batch-sizes

+17

patrickvonplaten on Jun 22, 2020

Temporary Solution : model.resize_token_embeddings(len(tokenizer))

Darshan2104 on Jan 31, 2022

I just found that it sometimes generates > 32100 input ids in generate function. Especially that happens if I evaluate a fine-tuned model in the very early step while training. Thanks, @Darshan2104 ! model.resize_token_embeddings(len(tokenizer)) temporally resolves my issue.

theejung on Feb 9, 2022

Hey @cstorm125,

I think, those 28 leftover embeddings are simply not used. The reason why the embedding matrix is of length 32128 as far as I know is simply because 32128 is a more GPU friendly number 32128 = 128 * 251 than 32100 = 4 * 8025. That means that the GPU is probably more efficient if it can directly deal with a power of 2 shape.

Also see: https://www.quora.com/Why-should-I-choose-a-mini-batch-size-of-32-64-128-256-etc-i-e-a-power-of-two-and-not-a-size-of-50-100-500-1000-Is-there-any-benefit-of-choosing-power-of-two-mini-batch-sizes

This is wrong. It shouldn’t be this way. In case model predicts wong index and when you calculate loss, it will cause serious issues. Its hard to believe no one cares this.

s4sarath on Dec 9, 2021

I found this mismatch recently and I think this may result in many bugs. Wish someone can fix it.

libing125 on May 28, 2021