fastLLaMa: Unicode characters break tokenizer

When injesting model with multiple-bytes unicode characters, it prints failed to tokenize string!, and seems to ignore all tokens prior to said characters. That’s bad for not only emoji support, but also languages like japanese and chinese.

I haven’t tried it yet, but I think implementing the fix proposed in this PR for llama.cpp could solve this issue.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 18 (6 by maintainers)

Most upvoted comments

Try now fix/unicode. I fixed it. My approach was simple. I took the last invalid/partial UTF-8 codepoint and prepended it to the next token, and so on. It might look like the stream token is running slow, but it’s not. This approach introduces the interdependence of two tokens on each other, which makes it wait a little longer inside the buffer before it becomes valid.

The fix i’m thinking of requires me to fix buffer inside the bridge.cpp. The buffer will wait for character to become valid. That should make this monstrosity obsolete and make python much simpler. I’ll try to fix it by tomorrow.