transformers: Token embedding resizing does not work for TFGPT2Model

System Info

  • transformers version: 4.25.1
  • Platform: Linux-5.15.0-57-generic-x86_64-with-glibc2.35
  • Python version: 3.9.16
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): not installed (NA)
  • Tensorflow version (GPU?): 2.11.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@gante and @Rocketknight1

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

After add_special_tokens to tokenizer and resize_token_embeddings on TFGPT2Model, evaluating the model results in an error that indicates that the embeddings are not resized as expected.

Please see the example code and the execution output below:

from transformers import GPT2Tokenizer, TFGPT2Model

SPECIAL_TOKENS_MAPPING = {
    'bos_token': '<bos>',
    'eos_token': '<eos>',
    'pad_token': '<pad>',
    'additional_special_tokens': ['<speaker1>', '<speaker2>']
}

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = TFGPT2Model.from_pretrained("gpt2")

print("Evaluating TFGPT2Model BEFORE extending the tokenizer and model with additional tokens ...")

inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
print(f"inputs = \n{inputs}\n")

outputs = model(inputs)
print(f"DONE!")

print("Adding tokens...")
orig_num_tokens = len(tokenizer.get_vocab())
num_special_tokens = tokenizer.add_special_tokens(SPECIAL_TOKENS_MAPPING)
print(f"orig_num_tokens = {orig_num_tokens}, num_special_tokens={num_special_tokens}")

model.resize_token_embeddings(new_num_tokens=orig_num_tokens + num_special_tokens)

print("Evaluating TFGPT2Model AFTER extending the tokenizer and model with additional tokens ...")

inputs = tokenizer("<speaker1>Hello, my dog is cute<speaker2>I agree!", return_tensors="tf")
print(f"inputs = \n{inputs}\n")

outputs = model(inputs)
print(f"DONE!")
Evaluating TFGPT2Model BEFORE extending the tokenizer and model with additional tokens ...
inputs = 
{'input_ids': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[15496,    11,   616,  3290,   318, 13779]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1]], dtype=int32)>}

DONE!

Adding tokens...
orig_num_tokens = 50257, num_special_tokens=5

Evaluating TFGPT2Model AFTER extending the tokenizer and model with additional tokens ...
inputs = 
{'input_ids': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=
array([[50260, 15496,    11,   616,  3290,   318, 13779, 50261,    40,
         4236,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

Traceback (most recent call last):
  File "/home/freddy/workspace/Nuhame/mlpug/examples/chatbot/tensorflow/test_tf_resize_token_size.py", line 33, in <module>
    outputs = model(inputs)
  File "/home/freddy/.virtualenvs/mlpug-tf/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/freddy/.virtualenvs/mlpug-tf/lib/python3.9/site-packages/transformers/modeling_tf_utils.py", line 432, in run_call_with_unpacked_inputs
    return func(self, **unpacked_inputs)
  File "/home/freddy/.virtualenvs/mlpug-tf/lib/python3.9/site-packages/transformers/models/gpt2/modeling_tf_gpt2.py", line 773, in call
    outputs = self.transformer(
  File "/home/freddy/.virtualenvs/mlpug-tf/lib/python3.9/site-packages/transformers/modeling_tf_utils.py", line 432, in run_call_with_unpacked_inputs
    return func(self, **unpacked_inputs)
  File "/home/freddy/.virtualenvs/mlpug-tf/lib/python3.9/site-packages/transformers/models/gpt2/modeling_tf_gpt2.py", line 447, in call
    tf.debugging.assert_less(
tensorflow.python.framework.errors_impl.InvalidArgumentError: Exception encountered when calling layer 'transformer' (type TFGPT2MainLayer).

input_ids must be smaller than the embedding layer's input dimension (got 50261 >= 50257)
Condition x < y did not hold.
First 3 elements of x:
[50260 15496    11]
First 1 elements of y:
[50257]

Call arguments received by layer 'transformer' (type TFGPT2MainLayer):
  • input_ids=tf.Tensor(shape=(1, 11), dtype=int32)
  • past_key_values=None
  • attention_mask=tf.Tensor(shape=(1, 11), dtype=int32)
  • token_type_ids=None
  • position_ids=None
  • head_mask=None
  • inputs_embeds=None
  • encoder_hidden_states=None
  • encoder_attention_mask=None
  • use_cache=True
  • output_attentions=False
  • output_hidden_states=False
  • return_dict=True
  • training=False

Expected behavior

The model should have 50257 + 5 = 50262 embeddings after resizing and thus an input ID with value 50261 should not result in any errors. The above code should run without errors.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (7 by maintainers)

Commits related to this issue

Most upvoted comments

Fixed on all models, thanks to @susnato 🧡

Hey @tqye2000 – using the best possible reference, the code itself, you can see that you don’t need to shift the inputs. In other words, labels = inputs, all shifting happens inside the model. I hope this helps 🤗

Hi @gante May I ask another question. For fine tuning the gpt-2 model, should I pass the labels exactly the same as the inputs or should I shift the inputs by one token to create the labels? I get mixed information on the internet, some said the labels should be a copy of inputs, some examples showed the labels should be one-token shifted of the inputs. I apologise if here is not the right place for asking such questions! Many thanks!

Thank you very much, @gante! After having upgraded to the current source version, the resize_token_emeddings() seems to be working now. However I get “Allocation of 740033280 exceeds 10% of free system memory” messages. I guess this is my PC’s issue.

Hey @tqye2000 👋 You can upgrade your transformers installation to match the current source version with pip install --upgrade git+https://github.com/huggingface/transformers.git

@visionscaper thank you for raising the issue! It is a generalized problem with this check, which should only rely on the config’s vocab size (which is the only reliable source of the actual vocabulary size at any given moment).

@susnato opened a fix for GPT2, but other models will also need a fix as well