llama.cpp: Tokenization is not equal to Meta's tokenization.

I’m comparing the tokenization between original Meta repo and llama.cpp with LLaMA (also had same issue with LLaMA v2).

For example, tokenizing the prompt “Hello world” and " Hello world" gives the following:

For prompt “Hello world”: llama.cpp tokenizer: [10994, 3186] Meta tokenizer: [15043, 3186]

For prompt " Hello world": llama.cpp tokenizer: [15043, 3186] Meta tokenizer: [29871, 15043, 3186]

Exploring the tokens, doing the detokenization, I got:

For tokens “[10994, 3186]”: llama.cpp tokenizer: |b’Hello world’| Meta tokenizer: |Hello world|

For tokens “[15043, 3186]”: llama.cpp tokenizer: |b’ Hello world’| Meta tokenizer: |Hello world|

For tokens “[29871, 15043, 3186]”: llama.cpp tokenizer: |b’ Hello world’| Meta tokenizer: | Hello world|

*Adding | to ease visualization.

Exploring each token above with the id_to_piece functionality:

Looking the id_to_piece for llama.cpp: id 10994 |b’Hello’| id 3186 |b’ world’| id 15043 |b’ Hello’| id 29871 |b’ '|

Looking the id_to_piece for Meta: id 10994 |Hello| id 3186 |▁world| id 15043 |▁Hello| id 29871 |▁|

*Adding | to ease visualization. Note, the 29871 token is not the underline character but “\u2581” (See more about this here).

But, using the detokenizer in each id individually:

Using the llama.cpp detokenizer: id 10994 |b’Hello’| id 3186 |b’ world’| id 15043 |b’ Hello’| id 29871 |b’ '|

Using the Meta detokenizer: id 10994 |Hello| id 3186 |world| id 15043 |Hello| id 29871 ||

The code used to produce this results can be seen here. Use this file for the Meta tokenizer. The model ggml-model-f16.bin is the 7B LLaMA model after using the convert.py script as mentioned here.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 24 (5 by maintainers)

Commits related to this issue

Most upvoted comments

Let’s create a PR with a few failing cases and will try to give it more visibility so we get some help with it. I’m afraid I’m out of my depth here and won’t be able to understand how the normalization works and what it takes to implement it, so hopefully people familiar with it will help out

If @viniciusarruda and @ggerganov don’t disagree : I think we should close this issue and wait for possible Unicode normalization issues to show up (otherwise I need help to design a test case).

Nice. I’ll try this one next to cover the Unicode character set (that should at least demonstrate my ‘high byte’(>0xff) problem) . Any thoughts on Unicode normalization as mentioned here?

I was mistaken: testing high byte output with

for tok in range(0, 3 + 256):
    print("'" + tokenizer.decode([tok]) + "'")

and

    for (int i = 0; i < 3 + 256; ++i) {
        std::string str = llama_detokenize_spm(ctx, std::vector<int>(1, i));
        fprintf(stdout, "%s\n", str.c_str());
    }

shows identical output.

@goerch Thank you for looking into it. Things are indeed moving fast and mistakes can happen.

The top priority is having the SPM tokenizer work 100% correct. After #2810 I thought we have achieved this, but based on your findings, it might not be the case yet. The test-tokenizer-0 should be expanded with examples where we currently fail. The test-tokenizer-0.py script can be used to generated ground truth.

I’m comparing the tokenization between original Meta repo and llama.cpp with LLaMA (also had same issue with LLaMA v2).

I believe to understand the current llama.cpp tokenizer compromises between LLaMA compatibility and the ability to process similar LLMs. I also think we can increase compatibility (slightly?), but that will need some more testing for the other LLMs?

Yes, and the upstream tokenizer workarounds are not perfect so I will give you a method to test the ideal scenario. Our UI strips a space at the end of a generation because most models can not handle it well if its already there because their tokens often have spaces in front of them, so we let the model decide if there should be a space or not.

So imagine a sentence that does not end with a space such as “This is an example.” in a UI that lets users add to that text. The correct scenario would be this generation " It works very well." which combined forms “This is an example. This works very well.”

The original Llama tokenizer at least on HF doesn’t do this because they trained sentencepiece in a mode where it swallows the first space. So you’d get this “It works very well.” resulting in “This is an example.It works very well”.

Likewise you also don’t want to always insert the space if you don’t have to, “My favorite sport is foot” should auto complete to “ball and I play it every week” not " ball and I play it every week".

Adhering to that behavior would make it usable for continuous generation use cases because then the text flows naturally and the tokenizer picks the most appropriate token for the end result, rather than never picking a space or always picking a space.

On the huggingface side we force the tokenizers hand, we insert a random token such as a comma at the front of the input, and then after its tokenized we remove the first fake token which results in it doing a proper continuous generation. The behavior we typically observed from Llamacpp is very similar to that workaround. And we originally implemented workarounds like this because Llamacpp did a better job and its helped us in the past to identify things were wrong.

@viniciusarruda and others ; nice find, indeed! Now that I’ve learned a bit about the tokenizer: this kind of problem could not only occur with llama but with derived models using different tokenizers (w.r.t. vocabulary for example) ? But the common theme here is sentencepiece compatibility? You made a big step with your notebook (which I used for testing already, thanks!), could you advise us regarding our future testing strategy?

because Tokenizer.tokenize() always add a space in front of the string passed in.

29871 is the space

There are known problems with the tokenizer,