transformers: LLaMA FastTokenizer does not add `eos_token_id` at the end.
System Info
transformers
version: 4.29.0.dev0- Platform: Linux-4.18.0-305.19.1.el8_4.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.7
- Huggingface_hub version: 0.13.3
- Safetensors version: 0.3.0
- PyTorch version (GPU?): 2.1.0.dev20230411+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
As mentioned on the title, the LLaMA tokenizer does not add the eos_token
at the end of the inputs. This only happens on the fast version (use_fast=True
).
Steps to reproduce the behaviour:
- Load the LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(LLAMA_PATH, add_eos_token=True, use_fast=True)
- Tokenize something
simple_sentence = "This is a sentence to test if the tokenizer adds eos token."
simple_sentence_ids = tokenizer(
simple_sentence, add_special_tokens=True
).input_ids
- Print the
input_ids
to check if theeos_token_id
(2
) is added at the end.
print(simple_sentence_ids)
- Output:
[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889]
Expected behavior
Expected output
[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889, 2]
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 55 (1 by maintainers)
@avacaondata - I have noticed this same issue, where the model is not learning to predict the EOS token. After doing some digging through several examples and source code, I’ve noticed something a bit strange particularly related to the
DataCollatorForLanguageModeling
. A very typical pattern that I have seen suggested is the following:However, the problem I see with this approach is that when the DataCollator overrides OR generates the
labels
field for the batch it sets alltokens == pad_token
to be-100
.Since the
CrossEntropy
loss ignores tokens with-100
even if the tokenizer we are using properly adds theeos_token
, the loss function will actually ignore this token.Ways that I have worked around this issue are either (1) to ensure that the
eos_token_id != pad_token_id
and make sure that the tokenizer includes theeos_token
when tokenizing (some automatically do this such as theT5 tokenizer
) OR (2) create the labels column myself when tokenizing - by cloninginput_ids
- and then using theDataCollatorForSeq2Seq
. I actually really like theDataCollatorForSeq2Seq
because it automatically pads the inputs and labels, but does not mess with tokens in unexpected ways, such as the eos_token.Hope this is helpful!
Finally found the correct way to do this here: https://georgesung.github.io/ai/qlora-ift/
You need to do
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
instead oftokenizer.pad_token = tokenizer.eos_token
And you need to add the
tokenizer.eos_token
at the end of EACH training example.I believe if you just set the
pad_token = eos_token
the model still is not learning to predict theeos_token
because the correspondingattn_mask
does not include the token and thelabels
ignores that token - i.e. no loss is computed for it. Not 100% sure about this, but that was what it seemed like from some self exploration.Yes! Quick fix, use the slow tokenizer. Otherwise I’ll open a PR to add template processing! Thanks for reporting!
Adding the
eos_token
at the end of each training example can be activated usingOr simply:
@jonathangomesselman thanks a lot!
I was also running into this issue where the model was unable to output the eos_token after fine-tuning. I also followed examples where they set
tokenizer.pad_token = tokenizer.eos_token
. From your earlier comment, I made suretokenizer.pad_token != tokenizer.eos_token
by settingtokenizer.add_special_tokens({'pad_token': '[PAD]'})
and usingDataCollatorForLanguageModeling
as before, e.g.Now the model finally outputs the eos_token as intended!
@avacaondata - You’re welcome!
I have generally followed this practice as well - just fine-tuning over the
model outputs
, since generally I don’t need the model to directly learn the statistical distribution over human instructions, but rather just how to “react” to them.Continuing from above, to use the
DataCollatorForSeq2Seq
for decoder-only models we need to manually create thelabels
field when tokenizing our data - i.e. ensuring we have the fieldsinput_ids
,attention_mask
, andlabels
. Since we create thelabels
ourselves we have control over what tokens we explicitly train over vs. which we want to ignore (using-100
as a label). Here is the skeleton of some code you could use to tokenize the inputs:A couple of things to note/highlight:
Now that we have our data tokenized and formatted we can use the
DataCollatorForSeq2Seq
as follows:Note that the LLAMA tokenizer by default does not have a
pad_token
so we have to set it. Because we are using theDataCollatorForSeq2Seq
it is okay for us to set the padding token to theeos_token
as the collator does not create the labels tensor but rather just pads our existing labels tensor with-100
- i.e. theeos_token
will not be ignored/replaced.This may not be the most standard approach for doing this - but this is an example of what I have found to work / have seen some repos roughly follow. The main idea being that by creating the
labels
ourselves we are able to set-100
for tokens that we don’t want to fine-tune over + ensure that we learn to generate theeos_token
.Actually I was talking about Falcon, not llama, because I’m facing an issue similar to the ones people are reporting with Llama. In fact I upgraded my transformers version to the last version in
main
branch, and the problem persists… The model never generates a EOS token, so it never stops generating… I have tried to explicitly add a string “<|endoftext|>” at the end of the inputs for fine-tuning, but still doesn’t work.What can I do to make falcon generate a eos token ?
By default, but if specified with
add_eos_token=True
it should. You can always fine-tune the model to make the model learn when to stop.@dtthanh1971 Your issue may be because len(tokenizer) != model.vocab_size, i.e. len(tokenizer) == model.vocab_size + 1. That was my experience. See Kumar Saurabh’s answer here: https://stackoverflow.com/questions/76633368/how-does-one-set-the-pad-token-correctly-not-to-eos-during-fine-tuning-to-avoi
@avacaondata you’re welcome! I had very similar questions to what you asked and found myself a bit surprised to not find many good resources. Thankfully the HuggingFace code repos are actually quite readable, especially in separating the complex model logic of the base pre-trained transform models (encoder-decoder + decoder only) vs. adding the “language modeling” head (see sub-classes with
...ConditionalGeneration
,...CausalLM
,...LMHeadModel
).If you’re curious yourself, I would definitely recommend looking at the code to learn more. Each model has a slightly different naming convention but you will see that the logic is nearly identical. Some to check out are:
Have fun exploring!
That it doesn’t generate <|endoftext|> (token id 11) when calling generate, therefore it never stops generating. I have tried by setting
eos_token_id
to 193, which corresponds to\n
, but I don’t think that’s a clean fix. I have noticed that when tokenizing the inputs with the Falcon-40b tokenizer, it’s not addingeos_token_id
at the end of input ids.I guess they would set the
pad_token_id
using theeos_token_id
?model.config.pad_token_id = model.config.eos_token_i
Of course they are not if the size of the matrix / these tokens are new. The warning is more general than it seems, but if you add new special tokens, they were not part of the vocab before thus not seen.
use_fast = False
when loading withAutoTokenizer
tokenizer("text", add_special_tokens= True)
for both fast and slow you should have the same results. If not, and you are on main, feel free to open a new issue 😉I solved it by changing AutoTokenizer to LlamaTokenizer to force to use the slow tokenizer and not the fast tokenizer that is automatically imported, I lost some functions but it works.
@robertheessels your answer solved my problem. you save my life. thank you so much!!!
As a temporary fix I was able to accomplish the inference (of a Falcon 7b training) stopping correctly like this:
tokenizer.pad_token = tokenizer.eos_token
)eos_token_id=39735
This makes the inference generate token ***** at the end of the answer (because it is in all the training examples), at which point it will stop because it is set as the ending token.
The same is happening with Falcon…