TensorRT-LLM: Failed to launch Triton for Llama
What does this mean?
ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache. E1024 07:33:56.907690 74602 model_lifecycle.cc:622] failed to load ‘tensorrt_llm’ version 1: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1024 07:33:39.624732 74602 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1024 07:33:39.624917 74602 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1024 07:33:40.100045 74602 model_lifecycle.cc:819] successfully loaded 'postprocessing'
I1024 07:33:41.456487 74602 model_lifecycle.cc:819] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 12856 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13201, GPU 13175 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 13202, GPU 13185 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB)
E1024 07:33:56.907623 74602 backend_model.cc:553] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
E1024 07:33:56.907690 74602 model_lifecycle.cc:622] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
I1024 07:33:56.907712 74602 model_lifecycle.cc:757] failed to load 'tensorrt_llm'
E1024 07:33:56.907796 74602 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.;
I1024 07:33:56.907861 74602 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I1024 07:33:56.908035 74602 server.cc:631]
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 18
Hello, I turn off the inflight batching and the server is started.
Then I want to find a client example file and I turn to the
tensorrtllm_backend/tools/inflight_batcher_llm/end_to_end_test.py
file.But it needs a dataset file, Is there any example for it? I think it need specific format.
Can you add the commands to reproduce the issue, please?
Based on that message:
I guess you built an engine without GPT attention plugin with packed input and paged KV cache but you are trying to use in-flight batching in Triton.