transformers: LLaMA FastTokenizer does not add `eos_token_id` at the end.

System Info

  • transformers version: 4.29.0.dev0
  • Platform: Linux-4.18.0-305.19.1.el8_4.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.7
  • Huggingface_hub version: 0.13.3
  • Safetensors version: 0.3.0
  • PyTorch version (GPU?): 2.1.0.dev20230411+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

As mentioned on the title, the LLaMA tokenizer does not add the eos_token at the end of the inputs. This only happens on the fast version (use_fast=True).

Steps to reproduce the behaviour:

  1. Load the LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(LLAMA_PATH, add_eos_token=True, use_fast=True)
  1. Tokenize something
simple_sentence = "This is a sentence to test if the tokenizer adds eos token."
simple_sentence_ids = tokenizer(
    simple_sentence, add_special_tokens=True
).input_ids
  1. Print the input_ids to check if the eos_token_id (2) is added at the end.
print(simple_sentence_ids)
  1. Output:
[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889]

Expected behavior

Expected output

[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889, 2]

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 55 (1 by maintainers)

Commits related to this issue

Most upvoted comments

@avacaondata - I have noticed this same issue, where the model is not learning to predict the EOS token. After doing some digging through several examples and source code, I’ve noticed something a bit strange particularly related to the DataCollatorForLanguageModeling. A very typical pattern that I have seen suggested is the following:

transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

However, the problem I see with this approach is that when the DataCollator overrides OR generates the labels field for the batch it sets all tokens == pad_token to be -100.

labels = batch["input_ids"].clone()
if self.tokenizer.pad_token_id is not None:
    labels[labels == self.tokenizer.pad_token_id] = -100
batch["labels"] = labels

Since the CrossEntropy loss ignores tokens with -100 even if the tokenizer we are using properly adds the eos_token, the loss function will actually ignore this token.

Ways that I have worked around this issue are either (1) to ensure that the eos_token_id != pad_token_id and make sure that the tokenizer includes the eos_token when tokenizing (some automatically do this such as the T5 tokenizer) OR (2) create the labels column myself when tokenizing - by cloning input_ids - and then using the DataCollatorForSeq2Seq. I actually really like the DataCollatorForSeq2Seq because it automatically pads the inputs and labels, but does not mess with tokens in unexpected ways, such as the eos_token.

Hope this is helpful!

Finally found the correct way to do this here: https://georgesung.github.io/ai/qlora-ift/

You need to do tokenizer.add_special_tokens({'pad_token': '[PAD]'}) instead of tokenizer.pad_token = tokenizer.eos_token

And you need to add the tokenizer.eos_token at the end of EACH training example.

I guess they would set the pad_token_id using the eos_token_id? model.config.pad_token_id = model.config.eos_token_i

I believe if you just set the pad_token = eos_token the model still is not learning to predict the eos_token because the corresponding attn_mask does not include the token and the labels ignores that token - i.e. no loss is computed for it. Not 100% sure about this, but that was what it seemed like from some self exploration.

Yes! Quick fix, use the slow tokenizer. Otherwise I’ll open a PR to add template processing! Thanks for reporting!

Adding the eos_token at the end of each training example can be activated using

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", add_eos_token = True)

Or simply:

>>> tokenizer.add_eos_token = True

@jonathangomesselman thanks a lot!

I was also running into this issue where the model was unable to output the eos_token after fine-tuning. I also followed examples where they set tokenizer.pad_token = tokenizer.eos_token. From your earlier comment, I made sure tokenizer.pad_token != tokenizer.eos_token by setting tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and using DataCollatorForLanguageModeling as before, e.g.

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Now the model finally outputs the eos_token as intended!

@avacaondata - You’re welcome!

I have generally followed this practice as well - just fine-tuning over the model outputs, since generally I don’t need the model to directly learn the statistical distribution over human instructions, but rather just how to “react” to them.

Continuing from above, to use the DataCollatorForSeq2Seq for decoder-only models we need to manually create the labels field when tokenizing our data - i.e. ensuring we have the fields input_ids, attention_mask, and labels. Since we create the labels ourselves we have control over what tokens we explicitly train over vs. which we want to ignore (using -100 as a label). Here is the skeleton of some code you could use to tokenize the inputs:

from transformers import LlamaTokenizerFast

tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
# By default the bos_token is added and not the eos_token. For instruction tuning I often ignore bos_token.
tokenizer.add_bos_token = False
tokenizer.add_eos_token = True

def create_instruction_tuned_format(data_row):
  return f"""<User Instruction>:{data_row["instruct"]}
<Agent Response>: {data_row['response']}
""".strip()

def tokenize(data_row):
  """Format and tokenize instruction tuning data

  1) Combine the user input (instruction) and agent response
  2) Create `labels` - ensuring we only fine tune over the 
  desired agent response
  """
  model_input_text = create_instruction_tuned_format(data_row)
  # Tokenize the full model input
  model_input = tokenizer(
        model_input_text, 
        truncation=True,
        padding=False,
        return_tensors=None
  )

  # Create `labels` - ignoring user input (instructions)
  agent_response = tokenizer(data_row['title']).input_ids
  num_tokens_ignore = len(model_input['labels']) - len(agent_response)
  ignored_tokens = [-100] * (num_tokens_ignore)
  # Copy over the ids for the desired agent response
  model_input['labels'] = ignored_tokens \
                            + model_input['input_ids'][-len(agent_response):]
  
  # Just to demonstrate length equality
  assert len(model_inputs['labels']) == len(model_inputs['input_ids'])

  return model_input

tokenized_ds = ds.map(tokenizer, remove_columns=ds.column_names)

A couple of things to note/highlight:

  1. We combine the user instruction and agent response using a very simple format. In the LIMA paper for example they introduce a new EOT (end-of-turn) token to separate the instruction and the response.
  2. We tokenize the response to figure out the number of fine-tuning tokens at the end of the full token sequence.

Now that we have our data tokenized and formatted we can use the DataCollatorForSeq2Seq as follows:

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForSeq2Seq(
    tokenizer, return_tensors="pt", padding=True
)

batch_size = 8
train_dataloader = DataLoader(
    tokenized_ds, shuffle=True, collate_fn=data_collator, batch_size=batch_size, pin_memory=True
)

Note that the LLAMA tokenizer by default does not have a pad_token so we have to set it. Because we are using the DataCollatorForSeq2Seq it is okay for us to set the padding token to the eos_token as the collator does not create the labels tensor but rather just pads our existing labels tensor with -100 - i.e. the eos_token will not be ignored/replaced.

This may not be the most standard approach for doing this - but this is an example of what I have found to work / have seen some repos roughly follow. The main idea being that by creating the labels ourselves we are able to set -100 for tokens that we don’t want to fine-tune over + ensure that we learn to generate the eos_token.

Actually I was talking about Falcon, not llama, because I’m facing an issue similar to the ones people are reporting with Llama. In fact I upgraded my transformers version to the last version in main branch, and the problem persists… The model never generates a EOS token, so it never stops generating… I have tried to explicitly add a string “<|endoftext|>” at the end of the inputs for fine-tuning, but still doesn’t work.

What can I do to make falcon generate a eos token ?

But it shouldn’t add an eos token right? The LM is not trained to generate a token after the eos I believe.

By default, but if specified with add_eos_token=True it should. You can always fine-tune the model to make the model learn when to stop.

@dtthanh1971 Your issue may be because len(tokenizer) != model.vocab_size, i.e. len(tokenizer) == model.vocab_size + 1. That was my experience. See Kumar Saurabh’s answer here: https://stackoverflow.com/questions/76633368/how-does-one-set-the-pad-token-correctly-not-to-eos-during-fine-tuning-to-avoi

@avacaondata you’re welcome! I had very similar questions to what you asked and found myself a bit surprised to not find many good resources. Thankfully the HuggingFace code repos are actually quite readable, especially in separating the complex model logic of the base pre-trained transform models (encoder-decoder + decoder only) vs. adding the “language modeling” head (see sub-classes with ...ConditionalGeneration, ...CausalLM, ...LMHeadModel).

If you’re curious yourself, I would definitely recommend looking at the code to learn more. Each model has a slightly different naming convention but you will see that the logic is nearly identical. Some to check out are:

Have fun exploring!

That it doesn’t generate <|endoftext|> (token id 11) when calling generate, therefore it never stops generating. I have tried by setting eos_token_id to 193, which corresponds to \n, but I don’t think that’s a clean fix. I have noticed that when tokenizing the inputs with the Falcon-40b tokenizer, it’s not adding eos_token_id at the end of input ids.

I guess they would set the pad_token_id using the eos_token_id? model.config.pad_token_id = model.config.eos_token_i

Of course they are not if the size of the matrix / these tokens are new. The warning is more general than it seems, but if you add new special tokens, they were not part of the vocab before thus not seen.

  1. You can force the usage of slow tokenizer by setting use_fast = False when loading with AutoTokenizer
  2. The outputs (adding a special tokens) should be the same. If they are not then this is an issue for us. However on main, if you use tokenizer("text", add_special_tokens= True) for both fast and slow you should have the same results. If not, and you are on main, feel free to open a new issue 😉

I solved it by changing AutoTokenizer to LlamaTokenizer to force to use the slow tokenizer and not the fast tokenizer that is automatically imported, I lost some functions but it works.

@robertheessels your answer solved my problem. you save my life. thank you so much!!!

You need to do tokenizer.add_special_tokens({‘pad_token’: ‘[PAD]’}) instead of tokenizer.pad_token = tokenizer.eos_token

And you need to add the tokenizer.eos_token at the end of EACH training example.

As a temporary fix I was able to accomplish the inference (of a Falcon 7b training) stopping correctly like this:

  • In each row of my training data, at the end I added “*****” (without the quotes), which encoded into one token: 39735
  • Then I do the normal training ( just using tokenizer.pad_token = tokenizer.eos_token)
  • And in the inference run I set eos_token_id=39735

This makes the inference generate token ***** at the end of the answer (because it is in all the training examples), at which point it will stop because it is set as the ending token.

    output_tokens = model.generate(
        input_ids = batch.input_ids, 
        max_new_tokens=100,
        temperature=0.001,
        top_p=0.7,
        num_return_sequences=1,
        pad_token_id=39735, # *****
        eos_token_id=39735, # *****
    )

The same is happening with Falcon…