tensorrtllm_backend: Unable to load TensorRT LLM

I have tried this using the new triton inference (trtllm) 23.10 image and the official instructions mentioned here: https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/. I have been getting this same error when loading the model

Using 523 tokens in paged KV cache.
E1028 18:34:33.926976 4434 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: maxTokensInPagedKvCache must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width * tokensPerBlock * maxBlocksPerSeq)

I followed the instructions exactly and my cuda version is 12.2. Any help in debugging this would be really helpful - thanks

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 17

Most upvoted comments

max_tokens_in_paged_kv_cache is the maximum number of tokens that can be stored in the KV cache. kv_cache_free_gpu_mem_fraction controls what fraction of free GPU memory should be used to allocate the KV cache size. It is only used if max_tokens_in_paged_kv_cache is not specified. Default value is 0.85.

Maximum number of tokens in KV cache must be greater than max_beam_width * max_blocks_per_seq * tokens_per_block where max_blocks_per_seq = (max_seq_len + tokens_per_block -1 ) / tokens_per_block and max_seq_len = max_input_len + max_output_len This is to ensure that we have enough KV cache blocks to generate max_seq_len for 1 sequence.

@byshiue My conversion script looks like this

build.py --model_version v2_13b --max_input_len=2048 --max_output_len=4096 --max_batch_size 4 --model_dir /models/Baichuan2-13B-Chat/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --remove_input_padding --use_inflight_batching --paged_kv_cache

I try setting max_tokens_in_paged_kv_cache = (2048+4096)*4 = 24576 , But the error is still reported

image

Is this teste

Can you share your end to end scripts to help reproducing your issue?