tensorrtllm_backend: Unable to load TensorRT LLM

I have tried this using the new triton inference (trtllm) 23.10 image and the official instructions mentioned here: https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/. I have been getting this same error when loading the model

Using 523 tokens in paged KV cache.
E1028 18:34:33.926976 4434 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: maxTokensInPagedKvCache must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width * tokensPerBlock * maxBlocksPerSeq)

I followed the instructions exactly and my cuda version is 12.2. Any help in debugging this would be really helpful - thanks

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 17

Most upvoted comments

max_tokens_in_paged_kv_cache is the maximum number of tokens that can be stored in the KV cache. kv_cache_free_gpu_mem_fraction controls what fraction of free GPU memory should be used to allocate the KV cache size. It is only used if max_tokens_in_paged_kv_cache is not specified. Default value is 0.85.

Maximum number of tokens in KV cache must be greater than max_beam_width * max_blocks_per_seq * tokens_per_block where max_blocks_per_seq = (max_seq_len + tokens_per_block -1 ) / tokens_per_block and max_seq_len = max_input_len + max_output_len This is to ensure that we have enough KV cache blocks to generate max_seq_len for 1 sequence.

pcastonguay on Nov 8, 2023

@byshiue My conversion script looks like this

build.py --model_version v2_13b --max_input_len=2048 --max_output_len=4096 --max_batch_size 4 --model_dir /models/Baichuan2-13B-Chat/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --remove_input_padding --use_inflight_batching --paged_kv_cache

I try setting max_tokens_in_paged_kv_cache = (2048+4096)*4 = 24576 , But the error is still reported

callmezhangchenchenokay on Nov 8, 2023

Is this teste

Can you share your end to end scripts to help reproducing your issue?

byshiue on Nov 1, 2023