transformers: Inconsistent number of vocab from pretrained T5Tokenizer and T5ForConditionalGeneration
❓ Questions & Help
Pretrained T5Tokenizer
has vocab size of 32100 (32000 tokens plus 100 extra_ids) but the shared embedding layer of T5ForConditionalGeneration
has size of (32128, 768). I checked the google-research implementation of T5 and also found that they have vocab size of 32100 also.
Where did the extra 28 embeddings come from and how can we map it to the tokenizer?
To reproduce
from transformers import (
T5Tokenizer,
T5ForConditionalGeneration,
)
tokenizer_pretrained = T5Tokenizer.from_pretrained('t5-base')
model_pretrained = T5ForConditionalGeneration.from_pretrained('t5-base')
len(tokenizer_pretrained.get_vocab()), model_pretrained.state_dict()['shared.weight'].shape
Output:
(32100, torch.Size([32128, 768]))
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 13
- Comments: 15 (5 by maintainers)
Hey @cstorm125,
I think, those
28
leftover embeddings are simply not used. The reason why the embedding matrix is of length 32128 as far as I know is simply because 32128 is a more GPU friendly number32128 = 128 * 251
than32100 = 4 * 8025
. That means that the GPU is probably more efficient if it can directly deal with a power of 2 shape.Also see: https://www.quora.com/Why-should-I-choose-a-mini-batch-size-of-32-64-128-256-etc-i-e-a-power-of-two-and-not-a-size-of-50-100-500-1000-Is-there-any-benefit-of-choosing-power-of-two-mini-batch-sizes
Temporary Solution :
model.resize_token_embeddings(len(tokenizer))
I just found that it sometimes generates > 32100 input ids in generate function. Especially that happens if I evaluate a fine-tuned model in the very early step while training. Thanks, @Darshan2104 ! model.resize_token_embeddings(len(tokenizer)) temporally resolves my issue.
This is wrong. It shouldn’t be this way. In case model predicts wong index and when you calculate loss, it will cause serious issues. Its hard to believe no one cares this.
I found this mismatch recently and I think this may result in many bugs. Wish someone can fix it.