TensorRT-LLM: Unable to build Llama-2-13B-Chat on RTX 4070Ti
System Info
- GPU (Nvidia GeForce RTX 4070 Ti)
- CPU 13th Gen Intel® Core™ i5-13600KF
- 32 GB RAM
- 1TB SSD
- OS Windows 11
Package versions:
- TensorRT version 9.2.0.post12.dev5
- CUDA 12.2
- cuDNN 8.9.7
- Python 3.10.11
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
-
Download Llama-2-13b-chat-hf model from https://huggingface.co/meta-llama/Llama-2-13b-chat-hf
-
Download AWQ weights for building the TensorRT engine model.pt from https://catalog.ngc.nvidia.com/orgs/nvidia/models/llama2-13b/files?version=1.2
-
Initiate build of the model using a single GPU:
python build.py --model_dir .\tmp --quant_ckpt_path .\tmp\model.pt --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --enable_context_fmha --max_batch_size 1 --max_input_len 3000 --max_output_len 1024 --output_dir .\tmp\out
Expected behavior
Build a trt engine for GTX 4070 Ti
actual behavior
Build terminates after reporting memory allocation issue:
Requested amount of GPU memory (1024 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
Output:
[02/04/2024-20:42:02] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/lm_head/CONSTANT_2_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[02/04/2024-20:42:02] [TRT] [W] Missing scale and zero-point for tensor logits, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[02/04/2024-20:42:02] [TRT] [W] Detected layernorm nodes in FP16.
[02/04/2024-20:42:02] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[02/04/2024-20:42:02] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[02/04/2024-20:42:02] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +0, now: CPU 29820, GPU 12281 (MiB)
[02/04/2024-20:42:02] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +14, GPU +0, now: CPU 29839, GPU 12281 (MiB)
[02/04/2024-20:42:02] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.8.1
[02/04/2024-20:42:02] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[02/04/2024-20:42:02] [TRT] [E] 1: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::127] Error Code 1: Cuda Driver (invalid argument)
[02/04/2024-20:42:02] [TRT] [W] Requested amount of GPU memory (1024 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
additional notes
I have a somewhat limited understanding of what I’m doing here. I’m trying to run the developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM. Following these instructions: https://github.com/NVIDIA/trt-llm-rag-windows/tree/release/1.0?tab=readme-ov-file#building-trt-engine
I’m not sure whether I’m pushing the limits of my hardware here or do I have any options to tweak parameters and get this processed in smaller batches to avoid a memory issue. I tried playing with different --max_input_len
and --max_output_len
values, reducing down to 512 but that doesn’t seem to make any difference.
About this issue
- Original URL
- State: open
- Created 5 months ago
- Comments: 15
I have the same issue, with a 4090, 13900k, Win11, 32GB RAM.
I don’t think the RAG application is workable at the moment.
Pip Error
CMake Error
So far, none of the examples in the Get Started blog post or next steps listed in the windows/README.md are usable. The inclusion of examples/llama as a showcase seems fairly short-sighted, as in order to generate the quantized weights file required one has to have Triton, which is apparent due to the errors when running the recommended GPTQ Weight Quantization.
I’d like to have a fix for this just because of general principles, but the amount of time it takes to discover that even the documented workflows don’t work really makes me question the value in this context.
Good to know I’m not the only one running into this exact same issue.