tokenizers: Exception upon attempting to load a Tokenizer from file

Hi, I’m attempting to simply serialize and then unserialize a trained tokenizer. When I run the following code:

tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=280)
tokenizer.train(trainer, ["preprocessing/corpus/corpus.txt"])
save_to_filepath = 'preprocessing/tokenizer.json'
tokenizer.save(save_to_filepath)
tokenizer = Tokenizer.from_file(save_to_filepath)

I get the following traceback:

Traceback (most recent call last):
...
    tokenizer = Tokenizer.from_file(save_to_filepath)
Exception: data did not match any variant of untagged enum ModelWrapper at line 1 column 5408

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 12
  • Comments: 31 (2 by maintainers)

Most upvoted comments

I’ve had the same issue. Try adding a pre_tokenizer:

from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=280)
tokenizer.train(trainer, ["preprocessing/corpus/corpus.txt"])
save_to_filepath = 'preprocessing/tokenizer.json'
tokenizer.save(save_to_filepath)
tokenizer = Tokenizer.from_file(save_to_filepath)

In case this might be of help to others: I was getting this error when using the SentenceTranformers library, and in my case upgrading tokenizers to version 0.10.3 fixed the issue:

pip install tokenizers==0.10.3

If anyone is getting this error, I recommend also taking a look at the dependency requirements (e.g., which version of the tokenizers libraries is required).

I had the same error when loading LLama 2 models. Upgrading to transformers==4.33.2 and tokenizers==0.13.3 solved it for me.

any update to this problem? I’ve had the same issue

Hi @Narsil : I think I’ve a very weird issue, which seems similar to the same above error stack trace in this issue. Here are the steps how it goes:

  1. So I trained an instance of custom XLMRobertaFast tokenizer from scratch on my multi-lingual corpus. Point to note is that I trained it on transformers-4.26.0 version on a python 3.7 conda environment in a different EC2 instance. After I had trained this tokenizer, in a separate script I had loaded the same using XLMRobertaTokenizerFast.from_pretrained() and it had worked fine without any errors.
  2. Now few days later, due to certain reasons I had to change my instance - I’m on a different instance that doesn’t have python 3.7 and has python 3.6. So the latest version supported for python 3.6 is also transformers-4.18.0 which is installed on this instance. Now when I’m trying to load the same saved tokenizer which loaded perfectly with the 4.26.0 version as mentioned above, is failing now when loaded with the same function: XLMRobertaTokenizerFast.from_pretrained(). I tried it on transformers==4.2.1 to just double-check if it wasn’t any bug in the 4.26.0 version or not. The error stack trace on both the tried transformers version on python 3.6 is as below:
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 59 column 3

Is this expected? Are tokenizers supposed to be backwards incompatible across different transformer lib versions? Installing from scratch python 3.7 isn’t trivial on this instance, hence request you to please help if anything can be done here as a workaround. While training the tokenizer I didn’t do any extravagant - initialised a SentencePieceBPETokenizer() and just trained it from scratch by invoking .train() on my corpus.

Strangely the trained model on python 3.7 instance is loading perfectly on python 3.6 instance. So the issue is only with the tokenizer.

@Narsil request your help on this^. I can’t post the same tokenizer here due to confidentiality reasons. But if you need any other info from me to help with this, please feel free to request right away.