TensorRT-LLM: Unable to build Llama-2-13B-Chat on RTX 4070Ti

System Info

  • GPU (Nvidia GeForce RTX 4070 Ti)
  • CPU 13th Gen Intel® Core™ i5-13600KF
  • 32 GB RAM
  • 1TB SSD
  • OS Windows 11

Package versions:

  • TensorRT version 9.2.0.post12.dev5
  • CUDA 12.2
  • cuDNN 8.9.7
  • Python 3.10.11

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. Download Llama-2-13b-chat-hf model from https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

  2. Download AWQ weights for building the TensorRT engine model.pt from https://catalog.ngc.nvidia.com/orgs/nvidia/models/llama2-13b/files?version=1.2

  3. Initiate build of the model using a single GPU: python build.py --model_dir .\tmp --quant_ckpt_path .\tmp\model.pt --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --enable_context_fmha --max_batch_size 1 --max_input_len 3000 --max_output_len 1024 --output_dir .\tmp\out

Expected behavior

Build a trt engine for GTX 4070 Ti

actual behavior

Build terminates after reporting memory allocation issue: Requested amount of GPU memory (1024 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.

Output:

[02/04/2024-20:42:02] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/lm_head/CONSTANT_2_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[02/04/2024-20:42:02] [TRT] [W] Missing scale and zero-point for tensor logits, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[02/04/2024-20:42:02] [TRT] [W] Detected layernorm nodes in FP16.
[02/04/2024-20:42:02] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[02/04/2024-20:42:02] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[02/04/2024-20:42:02] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +0, now: CPU 29820, GPU 12281 (MiB)
[02/04/2024-20:42:02] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +14, GPU +0, now: CPU 29839, GPU 12281 (MiB)
[02/04/2024-20:42:02] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.8.1
[02/04/2024-20:42:02] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[02/04/2024-20:42:02] [TRT] [E] 1: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::127] Error Code 1: Cuda Driver (invalid argument)
[02/04/2024-20:42:02] [TRT] [W] Requested amount of GPU memory (1024 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.

additional notes

I have a somewhat limited understanding of what I’m doing here. I’m trying to run the developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM. Following these instructions: https://github.com/NVIDIA/trt-llm-rag-windows/tree/release/1.0?tab=readme-ov-file#building-trt-engine

I’m not sure whether I’m pushing the limits of my hardware here or do I have any options to tweak parameters and get this processed in smaller batches to avoid a memory issue. I tried playing with different --max_input_len and --max_output_len values, reducing down to 512 but that doesn’t seem to make any difference.

About this issue

  • Original URL
  • State: open
  • Created 5 months ago
  • Comments: 15

Most upvoted comments

I have the same issue, with a 4090, 13900k, Win11, 32GB RAM.

I don’t think the RAG application is workable at the moment.

  • Installing by following the directions in the RAG repo and the TensorRT-LLM repo installs 0.7.1, which requires a custom TensorRT engine, the build of which fails due to memory issues.
  • The RAG application specifically calls for using TensorRT 9.1.0.4 and TensorRT-LLM release 0.5.0 in order to use the pre-built engine, however, trying to force pip to install tensorrt-llm 0.5.0 fails due to missing requirements (See Pip Error below).
  • Trying to build the TensorRT engine results in a memory failure.
  • Trying to build release/0.5.0 from source, the output has 20 or so “CMake Error at CMakeLists.txt:” entries, then fails following a set_target_properties error. (See CMake Error below)

Pip Error

PS C:\Users\rw\inference\TensorRT> pip install “tensorrt_llm==0.5.0” --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121 Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com, https://download.pytorch.org/whl/cu121 Collecting tensorrt_llm==0.5.0 Using cached https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.5.0-0-cp310-cp310-win_amd64.whl (431.5 MB) Collecting build (from tensorrt_llm==0.5.0) Using cached build-1.0.3-py3-none-any.whl.metadata (4.2 kB) INFO: pip is looking at multiple versions of tensorrt-llm to determine which version is compatible with other requirements. This could take a while. ERROR: Could not find a version that satisfies the requirement torch==2.1.0.dev20230828+cu121 (from tensorrt-llm) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.0+cu121, 2.1.1, 2.1.1+cu121, 2.1.2, 2.1.2+cu121, 2.2.0, 2.2.0+cu121) ERROR: No matching distribution found for torch==2.1.0.dev20230828+cu121

CMake Error

CMake Error at tensorrt_llm/plugins/CMakeLists.txt:106 (set_target_properties): set_target_properties called with incorrect number of arguments.

So far, none of the examples in the Get Started blog post or next steps listed in the windows/README.md are usable. The inclusion of examples/llama as a showcase seems fairly short-sighted, as in order to generate the quantized weights file required one has to have Triton, which is apparent due to the errors when running the recommended GPTQ Weight Quantization.

I’d like to have a fix for this just because of general principles, but the amount of time it takes to discover that even the documented workflows don’t work really makes me question the value in this context.

Good to know I’m not the only one running into this exact same issue.