transformers.js: [Bug] Tokenizing with Llama 2 model freezes

Describe the bug A clear and concise description of what the bug is.

I manually downloaded the new meta-llama/Llama-2-7b-hf model to my local disk and loaded it up with transformer.js tokenizer. When trying to tokenize a very particular string, tokenization halts and spins forever. Checking top, the node process is doing something but it’s not clear what. I’ve let this run for at least 10 minutes and it doesn’t finish.

How to reproduce Steps or a minimal working example to reproduce the behavior

const breaking_string = `They wrote the following:
Momentum in your online business comes from (crystal) clarity around who you serve and how you’re going to serve them. 🤔   Over on The Art of Paid Traffic podcast today, I’m pumped to share with you an interview with one of my Accelerator Group Coaching students, Trish Taylor. 🤗                                                                                           Trish is a stylist (and former celebrity stylist at that), coach and brand style expert for business women, teaching them to own their worth and show up in an authentic and meaningful way. 🙌🏼                                                        sI asked Trish to come on the show because she’s gone through what I think is a super relatable transformation in her business -- that of working with clients 1-on-1 to now scaling her business online and creating online courses and programs. 👩🏼🏼 YOU
<200d>💻
As you’ll hear, it wasn’t until she had clarity on who exactly she serves that she was able to clearly create offers that serve this audience that she’s so passionate about.  ♀️
You\'ll also hear how she’s using a quiz and challenge launch, why I don't think she should count out webinars, the metrics she should be paying attention to when evaluating her ad campaigns, and so much more. 📊
Follow the link in my bio to give today's episode a listen and I’d love to hear from you…  are there any transformations YOU
 RE wanting to make in your business? 👇🏼                                                                                  *
. . . . . . . . .`;

export default async (argv: any) => {
  const { env, AutoTokenizer, PreTrainedTokenizer } = await eval(
    "import('@xenova/transformers')"
  );
  if (process.env.LOCAL_TRANSFORMER_PATH) {
    env.localModelPath = process.env.LOCAL_TRANSFORMER_PATH;
  }
  const pretrained = await AutoTokenizer.from_pretrained(
    "meta-llama/Llama-2-7b-hf"
  );
  const tokens = pretrained.encode(breaking_string);                                                                          console.log(tokens.length);
};

Pre-steps requires downloading the model locally and setting LOCAL_TRANSFORMER_PATH.

Expected behavior A clear and concise description of what you expected to happen.

The string should be tokenized. The same string can be tokenized with mosaicml/mpt-7b-8k.

Logs/screenshots If applicable, add logs/screenshots to help explain your problem.

Environment

Transformers.js version: ^2.4.1
Browser (if applicable):
Operating system (if applicable): MacOS
Other: Node v18.16.0

Additional context Add any other context about the problem here.

About this issue

Original URL
State: closed
Created a year ago
Comments: 44 (21 by maintainers)

Commits related to this issue

Add new tokenizer unit test (#199) — committed to xenova/transformers.js by xenova a year ago

Most upvoted comments

Oh good grief that’s wild. Glad a fix finally got figured out tho and a way to replicate/test it. This was pretty maddening.

fozziethebeat on Jul 22, 2023

Opened a PR for it here: https://github.com/xenova/transformers.js/pull/208! As a last sanity check, do you mind checking that everything works as intended? I’ll merge after that (and once tests pass 🚀)

Oh good grief that’s wild. Glad a fix finally got figured out tho and a way to replicate/test it. This was pretty maddening.

100%! 😓 Thanks so much for your help!

xenova on Jul 22, 2023

So using the actual string, I was able to reproduce this issue 👍. Funnily enough, NFC normalization doesn’t fix this 👀

"\u0079\u006F\u0075\u2026\u00A0\u00A0".normalize('NFC') === "\u0079\u006F\u0075\u2026\u00A0\u00A0"
true

NFKC/NFKD seems to fix the space issue, but it also breaks many other tests (e.g., the ellipsis gets replaced with 3 dots)

I’ll do some testing to see what the best way to fix this is.

xenova on Jul 22, 2023

Yeah, I think browsers by default will normalize this character out. Having worked on sentencepiece, it’s a notorious character that always causes surprising problems.

fozziethebeat on Jul 22, 2023

Yes, doing let bpe_token_list = this.bpe(token.trim()).split(" "); fixes the problem entirely.

@fozziethebeat I’ve been looking into this for the past ~4 hours, trying to understand the exact reason behind this 😅. And yes, although the above does fix it (and surprisingly doesn’nt break any of the other unit tests), I’d like to try figure our where and why exactly there is an infinite loop in the first place.

So, if you could possibly test (and maybe add a bunch of debug statements) to 1. determine which while loop the infinite loop occurs in, and 2. which condition is never met that causes the infinite loop, that would be much appreciated! I might be able to get access to a mac, but I don’t think that will happen in the next few days.

Also, does this happen with other unknown characters followed by a space (not just the ellipsis)?

xenova on Jul 21, 2023

https://webllm.mlc.ai/

WebLLM/MLC-LLM is using a different infrastructure which is based on Apache TVM Unity, which is a deep learning compiler and accelerator, but more optimized and powerful I think.

jlia0 on Jul 20, 2023

I guess q4 in the name means 4 bit quantization? If the size is reduced to at least 7-8gb then it will fit easily. They have an amazing thing but it comes with completely different approach: they compile the model into an optimized webassembly executable. I’m trying to make it work with unmodified onnx models

dakenf on Jul 20, 2023

Okay great 👍 I’ll do some more testing tomorrow, and add the string you found as a unit test.

xenova on Jul 20, 2023

Yes, doing let bpe_token_list = this.bpe(token.trim()).split(" "); fixes the problem entirely.

fozziethebeat on Jul 20, 2023

I just tested

18.17.0
20.4.0
16.20.1

The firs two had the same infinite loop problem.

The last one failed to run due to an issue with Response not being defined within src/utils/hub.js. I didn’t see this written down but I’m guessing transformers.js requires a minimum node version of 18.x.

fozziethebeat on Jul 20, 2023

Yes that’s my next step. I just woke up in my timezone so I’ll start debugging more deeply.

fozziethebeat on Jul 19, 2023

I will test once I gain access to the llama-v2 models. In the meantime, I tested with the llama-v1 tokenizer and it works as intended, so it is most likely due to some change between tokenizer versions.

xenova on Jul 19, 2023