tokenizers: tokenizer is slow after adding new tokens
Hi,
I’m redirecting this issue to here as suggested in https://github.com/huggingface/transformers/issues/9958 I’ll just copy paste, here it goes:
The tokenizer is slow when adding new tokens even with the Fast class:
from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer
# Maybe this url for the files:
# https://huggingface.co/transformers/v3.1.0/_modules/transformers/tokenization_gpt2.html
paths = dict()
paths["tokenizer"] = "whatever/is/the/path/to/pretrained/vocab.json/merges.txt"
# They have to be sorted in reverse by length, otherwise the tokens arent
newtokens = range(0, 20000)
newtokens = list(newtokens)
newtokens.sort(reverse=True)
newtokens = ["new_" + str(x) for x in newtokens]
# loading tokenizer from the saved model path
tokenizers = dict()
tokenizers["fast"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["fast_custom"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["slow_custom"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])
tokenizers["slow"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])
tokenizer.add_special_tokens({
"eos_token": "</s>",
"bos_token": "<s>",
"unk_token": "<unk>",
"pad_token": "<pad>",
"mask_token": "<mask>"
})
# Add new vocab
# https://huggingface.co/transformers/v2.11.0/main_classes/tokenizer.html
# https://github.com/deepset-ai/FARM/issues/157
for k in tokenizers:
if "custom" in k:
print(k)
print("Vocab length before:", len(tokenizers[k].get_vocab()))
tokenizers[k].add_tokens(newtokens)
print("Vocab length after:", len(tokenizers[k].get_vocab()))
# creating the configurations from which the model can be made
config = GPT2Config(
vocab_size=len(tokenizer),
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)
# creating the model
# https://huggingface.co/transformers/_modules/transformers/configuration_gpt2.html
model = TFGPT2LMHeadModel(config)
# Differences when tokenising the text...
text = "this is a sentence containing new_200"
for k,v in tokenizers.items():
print(k, v.tokenize(text))
and then profiling the speed in jupyter:
for k in tokenizers:
print(k)
%timeit tokenizers[k].tokenize(text)
any ideas why this may be happening? I understand that I’m increasing the vocab size by ~20% and that may slow things down but in this code there’s a performance difference of 1000 fold in the speed. That doesn’t seem right?
Just a note: it’s crucial to add that many new tokens. I’m not considering reducing the number of new tokens. Many thanks!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 26 (4 by maintainers)
@manugarri I believe that you should consider the following:
Regardless the way you will choose to get to the new tokens list, the below steps is what I did, but be aware that I’m not sure it is the way to go. On the other hand, it works for me, and when I say “works” it means that I was able to train the model using MLM, and use the results to run pipeline to guess the masked word. Maybe I should run the MLM with a number of epochs to get a good set of weights to the new vocab, but that is another story.
A. To add new tokens I’ve just merge the list generated by following the article. Here is the code I’ve used:
B. Then, I added the new vocab to the tokenizer and resize the matrix:
C. Here is the catch. After
save_pretrained
, you will find aadded_tokens.json
in the folder. You will also see that thevocab.txt
remain the same.When you go to use the model with the new tokens it will explode the time as you are seeing. I believe it happens because the tokenizer tries to use the added_tokens.json.
What I did, and once again I quote that I don’t know if it is the correct way, was the following:
After all this I was able to use the tokenizer:
The token with id
30809
is a new token.The base model I’ve used was the bert-base-uncased.
Hope it can help.
Nice example @rdemorais, when you mention:
But, in the end, I'm still not sure that it was really needed to add more tokens... which leads me to that meme:
At my company we are adding additional tokens that dont really have any semantic meaning (they are group identifiers), and Ive noticed model performance at sequence classification improving significantly after adding those ‘custom’ tokens.
Hello @raphaelsty thank you for spending time over this matter. I believe it is a good way to generate discussion over a topic. Everybody benefits.
After struggling over this problematic, I moved on creating the actual model I was looking for. I figure out that breaking every word into tokens and only the ones supposed to be added as new tokens are actually added, helped the downstream task to behave as it was meant for.
I believe that it is not the job of the vocab.txt to hold business logic, for instance,
'star wars episode vi: return of the jedi'
is task dependent, but['star', 'wars', 'episode', 'vi', ':', 'return', 'of', 'the', 'jedi']
is not. You can use the words from the second approach to go training another model and leverage previous work.In the final results, I’ve managed to create something like this:
Translating: no new syncope episodes. maintains dyspnea on minimal exertion, under spontaneous ventilation with support of o2 per cn at 2l/min with a good respiratory pattern
The words
AUSENTE
andPRESENTE
means Absent and Present. It is Assertion Detection built using the same approach we are talking about.The thing is. Note that I was able to get compound terms by training the NER accordingly. The MLM with new tokens is the underline tech.
But, in the end, I’m still not sure that it was really needed to add more tokens… which leads me to that meme:
I found a workaround by manually adding the new tokens to the
vocab.json
file and updating the steps to generate these new tokens in themerges.txt
file, following the BytePairEncoding format. It works for me, it’s efficient. Is there any better way to do it?@manugarri can you provide a test script demoing the slowness ? in the test provided by @davidnarganes there’s a slowdown but it’s definitely not on the same level as the “slow” one.
Edit: and all references to
unique_no_split_tokens
as mentioned in previous answers is definitely “slow”@rdemorais it worked, after all, thanks!
@rdemorais it worked!
@lacls the
vocab.txt
is just a list of tokens like that:Your new token list is supposed to be added in the end of the file.
Nevertheless, try to confirm if you gonna need every symbol from the line added to the vocab.txt. The way you wrote the code, everything will be there including dots, stop words and so on.
@rdemorais i think that is exactly what is happening. will give a try to your “hack” 😃
I’m also looking forward to hear about it. I’ve added new tokens and now the time process 1.5 GB of documents jumped from 5 min to 8 h.
update
I’ve added new tokens by using
add_tokens
and, after that,model.resize_token_embeddings(len(tokenizer))
. Also I saved both tokenizer and model.The difference is that I deleted the added_tokens.json file, and manually append the new tokens to the vocab file.
I don’t know if what I did is the way to go, but it is working now.