transformers: Tokenizer failing to encode chatml correctly

System Info

transformers version: 4.31.0
Platform: Linux-5.14.0-284.18.1.el9_2.x86_64-x86_64-with-glibc2.34
Python version: 3.10.12
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Note: also tested and broken on:

641adca
4.30.2
4.30.1
4.30.0
4.29.2
4.29.1
4.29.0
4.28.1
4.28.0
4.27.4

Who can help?

@ArthurZucker @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

I’m attempting to finetune Llama2 with a ChatML format. No matter how I approach it, it seems to be failing to encode/decode correctly. I see multiple issues and PRs that are related, but this specific format seems to be hitting all of them with none of the workarounds being effective.

A repro is available here:

https://gist.github.com/ozreact/a4b565cd2c7fac65d6cb76c78dbdf9e2

#24565 recommends setting legacy=false, and further says that this only addresses a subset of issues with the slow tokenizer only. It also mentions that decode isn’t fixed, so validating that the encoding step is working is fiddly.

This format, when newlines are used, is also impacted by #21120.

#25073 also breaks this.

#25176 recommends setting legacy=True to fix an invalid unk token that effectively over-writes a final token in a partial ChatML response, but this conflicts with attempting to fix the issues in #24565.

Expected behavior

ChatML instruction format should ‘just work’, tokenize correctly, and decode correctly.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 3
Comments: 16

Most upvoted comments

@winglian please see for axolotl ^

teknium1 on Sep 20, 2023

You can also try #25224, should fix it (deals with extra space, unk and decoding extra space)

ArthurZucker on Aug 4, 2023

OK, thank you. I thought it was a full repository 😅

ydshieh on Aug 4, 2023