tensorrtllm_backend: Unable to load TensorRT LLM
I have tried this using the new triton inference (trtllm) 23.10 image and the official instructions mentioned here: https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/. I have been getting this same error when loading the model
Using 523 tokens in paged KV cache.
E1028 18:34:33.926976 4434 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: maxTokensInPagedKvCache must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width * tokensPerBlock * maxBlocksPerSeq)
I followed the instructions exactly and my cuda version is 12.2. Any help in debugging this would be really helpful - thanks
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 17
max_tokens_in_paged_kv_cacheis the maximum number of tokens that can be stored in the KV cache.kv_cache_free_gpu_mem_fractioncontrols what fraction of free GPU memory should be used to allocate the KV cache size. It is only used ifmax_tokens_in_paged_kv_cacheis not specified. Default value is 0.85.Maximum number of tokens in KV cache must be greater than
max_beam_width * max_blocks_per_seq * tokens_per_blockwheremax_blocks_per_seq = (max_seq_len + tokens_per_block -1 ) / tokens_per_blockandmax_seq_len = max_input_len + max_output_lenThis is to ensure that we have enough KV cache blocks to generatemax_seq_lenfor 1 sequence.@byshiue My conversion script looks like this
build.py --model_version v2_13b --max_input_len=2048 --max_output_len=4096 --max_batch_size 4 --model_dir /models/Baichuan2-13B-Chat/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --remove_input_padding --use_inflight_batching --paged_kv_cacheI try setting max_tokens_in_paged_kv_cache = (2048+4096)*4 = 24576 , But the error is still reported
Can you share your end to end scripts to help reproducing your issue?