mlc-llm: [Bug] Codellama 34B cannot handle inputs > 4k tokens despite context window of 16k

πŸ› Bug

Codellama 34B model does not behave as expected when passing inputs with more than 4k tokens. The model seems to lose the context and start spitting out random code imports.

Notes: Codellama 13B does not exhibit the same behavior and can handle inputs up to 16K tokens fine. Codellama 34B fp16, int8, and int4 all have the same issue with tokens >4K.

To Reproduce

Steps to reproduce the behavior:

  1. Download Codellama 34B
  2. Compile the model using MLC-LLM
  3. Create a request with >4k tokens
  4. Run the model with the request.

Expected behavior

Model should respect the context of the input and respond accordingly.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
  • Operating system (e.g. Ubuntu/Windows/MacOS/…): Ubuntu
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, …) A100
  • How you installed MLC-LLM (conda, source): Yes
  • How you installed TVM-Unity (pip, source): Yes
  • Python version (e.g. 3.10): 3.10
  • GPU driver version (if applicable):
  • CUDA/cuDNN version (if applicable): CUDA 12.0
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): 631f37b6bf8b101d16ecc55de7e6a749a3588570
  • Any other relevant information:

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Reactions: 1
  • Comments: 15 (13 by maintainers)

Most upvoted comments

Thanks @nverke for reporting! I seemed to have reproduced the issue. On my side the 34B model errors when the input has length around 6k tokens. We will dig into this.

I can confirm after fixing the eps, the issue is fixed on my testcase.

I think we are still working on exactly why this resolved the issue but my understanding is that the accuracy regression was more often causing issues with larger prompts due to compounding of the numerical differences. However we did see some examples of <4k token prompts causing the same issue.

Any chance you can share some more details of the investigation? Any inputs you have used to test?

Hi @nverke, on our M2 Ultra Mac, the 34B model generates properly when input has 5768 tokens, and fails when the input has 5897 tokens. We didn’t further narrow down the range.

The phenomenon we noticed here is, when the model fails, the logits of the prefill stage are all NAN. We don’t yet know the cause of this issue.

The input we use is an arbitrary text snippet ended with "Please ignore all the above and write a code to compute the fibonacci sequence in both C++ and Python.".