transformers.js: [Bug] Tokenizing with Llama 2 model freezes
Describe the bug A clear and concise description of what the bug is.
I manually downloaded the new meta-llama/Llama-2-7b-hf
model to my local disk and loaded it up with transformer.js tokenizer. When trying to tokenize a very particular string, tokenization halts and spins forever. Checking top, the node process is doing something but it’s not clear what. I’ve let this run for at least 10 minutes and it doesn’t finish.
How to reproduce Steps or a minimal working example to reproduce the behavior
const breaking_string = `They wrote the following:
Momentum in your online business comes from (crystal) clarity around who you serve and how you’re going to serve them. 🤔 Over on The Art of Paid Traffic podcast today, I’m pumped to share with you an interview with one of my Accelerator Group Coaching students, Trish Taylor. 🤗 Trish is a stylist (and former celebrity stylist at that), coach and brand style expert for business women, teaching them to own their worth and show up in an authentic and meaningful way. 🙌🏼 sI asked Trish to come on the show because she’s gone through what I think is a super relatable transformation in her business -- that of working with clients 1-on-1 to now scaling her business online and creating online courses and programs. 👩🏼🏼 YOU
<200d>💻
As you’ll hear, it wasn’t until she had clarity on who exactly she serves that she was able to clearly create offers that serve this audience that she’s so passionate about. ♀️
You\'ll also hear how she’s using a quiz and challenge launch, why I don't think she should count out webinars, the metrics she should be paying attention to when evaluating her ad campaigns, and so much more. 📊
Follow the link in my bio to give today's episode a listen and I’d love to hear from you… are there any transformations YOU
RE wanting to make in your business? 👇🏼 *
. . . . . . . . .`;
export default async (argv: any) => {
const { env, AutoTokenizer, PreTrainedTokenizer } = await eval(
"import('@xenova/transformers')"
);
if (process.env.LOCAL_TRANSFORMER_PATH) {
env.localModelPath = process.env.LOCAL_TRANSFORMER_PATH;
}
const pretrained = await AutoTokenizer.from_pretrained(
"meta-llama/Llama-2-7b-hf"
);
const tokens = pretrained.encode(breaking_string); console.log(tokens.length);
};
Pre-steps requires downloading the model locally and setting LOCAL_TRANSFORMER_PATH.
Expected behavior A clear and concise description of what you expected to happen.
The string should be tokenized. The same string can be tokenized with mosaicml/mpt-7b-8k
.
Logs/screenshots If applicable, add logs/screenshots to help explain your problem.
Environment
- Transformers.js version: ^2.4.1
- Browser (if applicable):
- Operating system (if applicable): MacOS
- Other: Node v18.16.0
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 44 (21 by maintainers)
Commits related to this issue
- Add new tokenizer unit test (#199) — committed to xenova/transformers.js by xenova a year ago
Oh good grief that’s wild. Glad a fix finally got figured out tho and a way to replicate/test it. This was pretty maddening.
Opened a PR for it here: https://github.com/xenova/transformers.js/pull/208! As a last sanity check, do you mind checking that everything works as intended? I’ll merge after that (and once tests pass 🚀)
100%! 😓 Thanks so much for your help!
So using the actual string, I was able to reproduce this issue 👍. Funnily enough, NFC normalization doesn’t fix this 👀
NFKC/NFKD seems to fix the space issue, but it also breaks many other tests (e.g., the ellipsis gets replaced with 3 dots)
I’ll do some testing to see what the best way to fix this is.
Yeah, I think browsers by default will normalize this character out. Having worked on sentencepiece, it’s a notorious character that always causes surprising problems.
@fozziethebeat I’ve been looking into this for the past ~4 hours, trying to understand the exact reason behind this 😅. And yes, although the above does fix it (and surprisingly doesn’nt break any of the other unit tests), I’d like to try figure our where and why exactly there is an infinite loop in the first place.
So, if you could possibly test (and maybe add a bunch of debug statements) to 1. determine which
while
loop the infinite loop occurs in, and 2. which condition is never met that causes the infinite loop, that would be much appreciated! I might be able to get access to a mac, but I don’t think that will happen in the next few days.Also, does this happen with other unknown characters followed by a space (not just the ellipsis)?
WebLLM/MLC-LLM is using a different infrastructure which is based on Apache TVM Unity, which is a deep learning compiler and accelerator, but more optimized and powerful I think.
I guess q4 in the name means 4 bit quantization? If the size is reduced to at least 7-8gb then it will fit easily. They have an amazing thing but it comes with completely different approach: they compile the model into an optimized webassembly executable. I’m trying to make it work with unmodified onnx models
Okay great 👍 I’ll do some more testing tomorrow, and add the string you found as a unit test.
Yes, doing
let bpe_token_list = this.bpe(token.trim()).split(" ");
fixes the problem entirely.I just tested
The firs two had the same infinite loop problem.
The last one failed to run due to an issue with
Response
not being defined withinsrc/utils/hub.js
. I didn’t see this written down but I’m guessing transformers.js requires a minimum node version of 18.x.Yes that’s my next step. I just woke up in my timezone so I’ll start debugging more deeply.
I will test once I gain access to the llama-v2 models. In the meantime, I tested with the llama-v1 tokenizer and it works as intended, so it is most likely due to some change between tokenizer versions.