llama.cpp: MPT: tokenization crashes
Testing against 5974d617 and https://github.com/ggerganov/llama.cpp/pull/3538
While running
bin/main --mlock -m /mnt/f2fs/mpt/ggml-model-mpt-7b-storywriter-f16-q5_1.gguf -t 1 -ngl 999 -p 'Once upon a time' --temp 0.8 --top_p 0.98 -c 2048 --keep -1 --repeat_penalty 1 -n 1024
It consistently crashes after a few (~hundred) tokens with this backtrace:
Thread 1 "main" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fffcefae535 in __GI_abort () at abort.c:79
#2 0x00007fffcf376983 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007fffcf37c8c6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007fffcf37c901 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007fffcf37cb34 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fffcf37886b in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000555555612915 in std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned char>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned char> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::at (this=0x555555b2fa20 <unicode_to_bytes_bpe(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::map>, __k=" ")
at /usr/include/c++/8/bits/hashtable_policy.h:760
#8 0x000055555560b295 in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned char, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned char> > >::at (this=0x555555b2fa20 <unicode_to_bytes_bpe(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::map>, __k=" ") at /usr/include/c++/8/bits/unordered_map.h:991
#9 0x00005555555cc65a in unicode_to_bytes_bpe (utf8=" ") at /mnt/seagate/dalai/llama.cpp.bak/unicode.h:460
#10 0x00005555555f8398 in llama_decode_text (text=" ") at /mnt/seagate/dalai/llama.cpp.bak/llama.cpp:9703
#11 0x00005555555f8663 in llama_token_to_piece (model=0x5555a4966c40, token=50276, buf=0x555555e3c6b0 "", length=8) at /mnt/seagate/dalai/llama.cpp.bak/llama.cpp:9746
#12 0x000055555557e0ce in llama_token_to_piece[abi:cxx11](llama_context const*, int) (ctx=0x5555b579e350, token=50276) at /mnt/seagate/dalai/llama.cpp.bak/common/common.cpp:894
#13 0x00005555555b028b in llama_sampling_sample (ctx=0x5555b579e350, ctx_guidance=0x0, ctx_sampling=..., last_tokens=std::vector of length 2048, capacity 2048 = {...},
candidates=std::vector of length 50432, capacity 50432 = {...}, idx=0, seq=0) at /mnt/seagate/dalai/llama.cpp.bak/common/sampling.cpp:151
#14 0x000055555556a2aa in main (argc=22, argv=0x7fffffffdcd8) at /mnt/seagate/dalai/llama.cpp.bak/examples/main/main.cpp:648
(The GGUF/vocab was exported using convert-mpt-hf-to-gguf.py from the aforementioned commit.)
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 31 (7 by maintainers)
I suspect you’re right and the convert scripts (not just for MPT, but for any model with BPE tokenizer) should be updated to classify tokens (based on https://github.com/ggerganov/llama.cpp/pull/3538#issuecomment-1758219448) as CONTROL (if found in added_tokens and special: true in tokenizer.json), USER_DEFINED (if found in added_tokens and special: false in tokenizer.json), UNUSED (if artificially added just to make vocab match model.vocab_size like the MPT pad tokens), NORMAL otherwise.
But as it stands, CONTROL is used for all BPE added_tokens in convert.py (regardless of the “special” flag)… and all the other convert scripts just output all tokens as NORMAL…
That bit of code, is for simple sanity check, and currently only is used to print to the log whether token labels in the vocab match the manually extracted special tokens.
I am still considering this approach, of manually extracting special tokens from the vocab, as nothing more than a stable workaround and eventual fallback solution.
Method I used is quite simple
Iterate over the entire vocab and for each token take its string representation
For the given token string representation, split it in two substrings ( at 1st byte, 2nd byte,…,strlen-1 byte ) and check if for any of those splits, both halves have a matching token in the vocab, if not, mark as special token.
That’s pretty much it.
From what I understand, yes, that would be the source cause, and detokenizer is where it actually becomes a problem.
From the looks of it #3728 could be applicable.
I can confirm that the code from #3728 no longer crashes.
If you wish to reproduce the crash, you can do so using llama.cpp code from master (22c69a27), but only if you convert the model without USER_DEFINED, just using NORMAL tokens (i.e. using master’s version of the conversion script). Note that code on master does not crash when running with a model converted by #3728.
And interestingly, #3728 does not crash even if only NORMAL tokens are output.
Not anymore, because goerch recently moved all of the vocab conversion to llama.cpp. It is no longer the conversion script’s responsibility. The different parts of the JSON have different ways of encoding the vocabulary - added tokens are stored as-is, whereas normal tokens need bytes_to_unicode/unicode_to_bytes. My PR changed the conversion script to only decode the normal tokens, but we no longer decode any tokens in the conversion script, so that will not work.
I believe in the
map.at(utf8)
call, the input comes directly from the vocabulary, so it should only crash if the code is incorrect (which it is in this case, because we should not call this function on added tokens) or if the GGUF file is corrupt/incorrectly converted. I don’t think silently generating blank output tokens is a good idea in either of those cases.