TensorRT-LLM: Failed to launch Triton for Llama

What does this mean?

ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache. E1024 07:33:56.907690 74602 model_lifecycle.cc:622] failed to load ‘tensorrt_llm’ version 1: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.

[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1024 07:33:39.624732 74602 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1024 07:33:39.624917 74602 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1024 07:33:40.100045 74602 model_lifecycle.cc:819] successfully loaded 'postprocessing'
I1024 07:33:41.456487 74602 model_lifecycle.cc:819] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 12856 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13201, GPU 13175 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 13202, GPU 13185 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB)
E1024 07:33:56.907623 74602 backend_model.cc:553] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
E1024 07:33:56.907690 74602 model_lifecycle.cc:622] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
I1024 07:33:56.907712 74602 model_lifecycle.cc:757] failed to load 'tensorrt_llm'
E1024 07:33:56.907796 74602 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.;
I1024 07:33:56.907861 74602 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1024 07:33:56.908035 74602 server.cc:631]
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                                                             |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}                                                                                                                                                                                                 |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}                                     |
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 18

Most upvoted comments

Can you add the commands to reproduce the issue, please?

Based on that message:
ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
I guess you built an engine without GPT attention plugin with packed input and paged KV cache but you are trying to use in-flight batching in Triton.

Hello, I turn off the inflight batching and the server is started.

Then I want to find a client example file and I turn to the tensorrtllm_backend/tools/inflight_batcher_llm/end_to_end_test.py file.

But it needs a dataset file, Is there any example for it? I think it need specific format.

sleepwalker2017 on Oct 24, 2023

Can you add the commands to reproduce the issue, please?

Based on that message:

ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.

I guess you built an engine without GPT attention plugin with packed input and paged KV cache but you are trying to use in-flight batching in Triton.

jdemouth-nvidia on Oct 24, 2023