transformers: Tokenizer failing to encode chatml correctly
System Info
transformersversion: 4.31.0- Platform: Linux-5.14.0-284.18.1.el9_2.x86_64-x86_64-with-glibc2.34
- Python version: 3.10.12
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Note: also tested and broken on:
- 641adca
- 4.30.2
- 4.30.1
- 4.30.0
- 4.29.2
- 4.29.1
- 4.29.0
- 4.28.1
- 4.28.0
- 4.27.4
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I’m attempting to finetune Llama2 with a ChatML format. No matter how I approach it, it seems to be failing to encode/decode correctly. I see multiple issues and PRs that are related, but this specific format seems to be hitting all of them with none of the workarounds being effective.
A repro is available here:
https://gist.github.com/ozreact/a4b565cd2c7fac65d6cb76c78dbdf9e2
#24565 recommends setting legacy=false, and further says that this only addresses a subset of issues with the slow tokenizer only. It also mentions that decode isn’t fixed, so validating that the encoding step is working is fiddly.
This format, when newlines are used, is also impacted by #21120.
#25073 also breaks this.
#25176 recommends setting legacy=True to fix an invalid unk token that effectively over-writes a final token in a partial ChatML response, but this conflicts with attempting to fix the issues in #24565.
Expected behavior
ChatML instruction format should ‘just work’, tokenize correctly, and decode correctly.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 3
- Comments: 16
@winglian please see for axolotl ^
You can also try #25224, should fix it (deals with extra space, unk and decoding extra space)
OK, thank you. I thought it was a full repository 😅