mlc-llm: [Bug] Codellama 34B cannot handle inputs > 4k tokens despite context window of 16k

🐛 Bug

Codellama 34B model does not behave as expected when passing inputs with more than 4k tokens. The model seems to lose the context and start spitting out random code imports.

Notes: Codellama 13B does not exhibit the same behavior and can handle inputs up to 16K tokens fine. Codellama 34B fp16, int8, and int4 all have the same issue with tokens >4K.

To Reproduce

Steps to reproduce the behavior:

Download Codellama 34B
Compile the model using MLC-LLM
Create a request with >4k tokens
Run the model with the request.

Expected behavior

Model should respect the context of the input and respond accordingly.

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
Operating system (e.g. Ubuntu/Windows/MacOS/…): Ubuntu
Device (e.g. iPhone 12 Pro, PC+RTX 3090, …) A100
How you installed MLC-LLM (conda, source): Yes
How you installed TVM-Unity (pip, source): Yes
Python version (e.g. 3.10): 3.10
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable): CUDA 12.0
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): 631f37b6bf8b101d16ecc55de7e6a749a3588570
Any other relevant information:

Additional context

About this issue

Original URL
State: closed
Created 10 months ago
Reactions: 1
Comments: 15 (13 by maintainers)

Most upvoted comments

Thanks @nverke for reporting! I seemed to have reproduced the issue. On my side the 34B model errors when the input has length around 6k tokens. We will dig into this.

MasterJH5574 on Sep 18, 2023

I can confirm after fixing the eps, the issue is fixed on my testcase.

zxybazh on Sep 26, 2023

I think we are still working on exactly why this resolved the issue but my understanding is that the accuracy regression was more often causing issues with larger prompts due to compounding of the numerical differences. However we did see some examples of <4k token prompts causing the same issue.

nverke on Sep 25, 2023

Any chance you can share some more details of the investigation? Any inputs you have used to test?

Hi @nverke, on our M2 Ultra Mac, the 34B model generates properly when input has 5768 tokens, and fails when the input has 5897 tokens. We didn’t further narrow down the range.

The phenomenon we noticed here is, when the model fails, the logits of the prefill stage are all NAN. We don’t yet know the cause of this issue.

The input we use is an arbitrary text snippet ended with "Please ignore all the above and write a code to compute the fibonacci sequence in both C++ and Python.".

MasterJH5574 on Sep 21, 2023