llama.cpp: MPT: tokenization crashes

Testing against 5974d617 and https://github.com/ggerganov/llama.cpp/pull/3538

While running

bin/main --mlock -m /mnt/f2fs/mpt/ggml-model-mpt-7b-storywriter-f16-q5_1.gguf -t 1 -ngl 999 -p 'Once upon a time' --temp 0.8 --top_p 0.98 -c 2048 --keep -1 --repeat_penalty 1 -n 1024

It consistently crashes after a few (~hundred) tokens with this backtrace:

Thread 1 "main" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fffcefae535 in __GI_abort () at abort.c:79
#2  0x00007fffcf376983 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fffcf37c8c6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffcf37c901 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fffcf37cb34 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fffcf37886b in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x0000555555612915 in std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned char>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned char> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::at (this=0x555555b2fa20 <unicode_to_bytes_bpe(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::map>, __k=" ")
    at /usr/include/c++/8/bits/hashtable_policy.h:760
#8  0x000055555560b295 in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned char, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned char> > >::at (this=0x555555b2fa20 <unicode_to_bytes_bpe(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::map>, __k=" ") at /usr/include/c++/8/bits/unordered_map.h:991
#9  0x00005555555cc65a in unicode_to_bytes_bpe (utf8=" ") at /mnt/seagate/dalai/llama.cpp.bak/unicode.h:460
#10 0x00005555555f8398 in llama_decode_text (text="  ") at /mnt/seagate/dalai/llama.cpp.bak/llama.cpp:9703
#11 0x00005555555f8663 in llama_token_to_piece (model=0x5555a4966c40, token=50276, buf=0x555555e3c6b0 "", length=8) at /mnt/seagate/dalai/llama.cpp.bak/llama.cpp:9746
#12 0x000055555557e0ce in llama_token_to_piece[abi:cxx11](llama_context const*, int) (ctx=0x5555b579e350, token=50276) at /mnt/seagate/dalai/llama.cpp.bak/common/common.cpp:894
#13 0x00005555555b028b in llama_sampling_sample (ctx=0x5555b579e350, ctx_guidance=0x0, ctx_sampling=..., last_tokens=std::vector of length 2048, capacity 2048 = {...}, 
    candidates=std::vector of length 50432, capacity 50432 = {...}, idx=0, seq=0) at /mnt/seagate/dalai/llama.cpp.bak/common/sampling.cpp:151
#14 0x000055555556a2aa in main (argc=22, argv=0x7fffffffdcd8) at /mnt/seagate/dalai/llama.cpp.bak/examples/main/main.cpp:648

(The GGUF/vocab was exported using convert-mpt-hf-to-gguf.py from the aforementioned commit.)

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 31 (7 by maintainers)

Most upvoted comments

I was testing something else when I noticed the entire vocab for mpt-7b-storywriter is marked as LLAMA_TOKEN_TYPE_NORMAL when it is loaded in llm_load_vocab, so just to be sure, I re-converted (and quantized) again with current master (11dc109) and it’s the same.

So if all (BPE) tokens are marked as normal in the .gguf, how can we tell which tokens should get the bytes_to_unicode/unicode_to_bytes treatment ?

In case of mosaicml/mpt-7b-storywriter, special/added tokens are only listed in tokenizer.json, except during the conversion, that json file seems to only be used (in gguf.py) to get IDs for special_token_types: tuple[str, ...] = ('bos', 'eos', 'unk', 'sep', 'pad')

I’m not sure if I’m missing something or I got this wrong.

I suspect you’re right and the convert scripts (not just for MPT, but for any model with BPE tokenizer) should be updated to classify tokens (based on https://github.com/ggerganov/llama.cpp/pull/3538#issuecomment-1758219448) as CONTROL (if found in added_tokens and special: true in tokenizer.json), USER_DEFINED (if found in added_tokens and special: false in tokenizer.json), UNUSED (if artificially added just to make vocab match model.vocab_size like the MPT pad tokens), NORMAL otherwise.

But as it stands, CONTROL is used for all BPE added_tokens in convert.py (regardless of the “special” flag)… and all the other convert scripts just output all tokens as NORMAL…

From what I understand, yes, that would be the source cause, and detokenizer is where it actually becomes a problem.

I looked at this code and am having doubts if we should count CONTROL and BYTE tokens as special tokens, for example. No need to change this yet, just a heads up. I’ll have to understand which tokens you add to the special tokens cache.

That bit of code, is for simple sanity check, and currently only is used to print to the log whether token labels in the vocab match the manually extracted special tokens.

I am still considering this approach, of manually extracting special tokens from the vocab, as nothing more than a stable workaround and eventual fallback solution.

Method I used is quite simple

Iterate over the entire vocab and for each token take its string representation

For the given token string representation, split it in two substrings ( at 1st byte, 2nd byte,…,strlen-1 byte ) and check if for any of those splits, both halves have a matching token in the vocab, if not, mark as special token.

That’s pretty much it.

The problem (this particular one) is still with the detokenizer rather than tokenizer.

Or with token classification in the conversion?

From what I understand, yes, that would be the source cause, and detokenizer is where it actually becomes a problem.

That seems to be the case

From the looks of it #3728 could be applicable.

With that particular model, you have to let it spin for quite a while sometimes to trigger the problem

Thanks, the original report states:

It consistently crashes after a few (~hundred) tokens with this backtrace…

and I let it run for the requested 1024 tokens. Not sure how often I should repeat that?

Edit: BTW the patch should make sure that added tokens are USER_DEFINED in the detokenizer.

I can confirm that the code from #3728 no longer crashes.

If you wish to reproduce the crash, you can do so using llama.cpp code from master (22c69a27), but only if you convert the model without USER_DEFINED, just using NORMAL tokens (i.e. using master’s version of the conversion script). Note that code on master does not crash when running with a model converted by #3728.

And interestingly, #3728 does not crash even if only NORMAL tokens are output.

Tracing it even further back, wouldn’t it mean convert-mpt-hf-to-gguf.py imports tokens ( their “text” ) wrong ?

Not anymore, because goerch recently moved all of the vocab conversion to llama.cpp. It is no longer the conversion script’s responsibility. The different parts of the JSON have different ways of encoding the vocabulary - added tokens are stored as-is, whereas normal tokens need bytes_to_unicode/unicode_to_bytes. My PR changed the conversion script to only decode the normal tokens, but we no longer decode any tokens in the conversion script, so that will not work.

I believe in the map.at(utf8) call, the input comes directly from the vocabulary, so it should only crash if the code is incorrect (which it is in this case, because we should not call this function on added tokens) or if the GGUF file is corrupt/incorrectly converted. I don’t think silently generating blank output tokens is a good idea in either of those cases.