TensorRT-LLM: High CPU memory usage (Llama build Killed)

I am trying to run CodeLlama with the following setup:

Model size: 34B GPUs: 2x A6000 (sm_86)

I’d like to to run the model tensor-parallel across the two GPUs. Correct me if I’m wrong, but the “rank” refers to a particular GPU. TensorRT builds separate engines for each rank. It seems the engine successfully builds for rank 0 but not rank 1: Here is my build command:

python build.py --meta_ckpt_dir ../../models/CodeLlama-34b-Instruct/ --dtype float16     --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_rmsno
rm_plugin float16     --enable_context_fmha --output_dir codellama_34b --rotary_base 1000000 --vocab_size 32000 --world_size 2 --tp_size 2

And it outputs a non-descriptive “Killed” message after building the engine for rank 0:

[10/24/2023-20:31:06] [TRT-LLM] [I] Serially build TensorRT engines.                                                                                                                                                                            
[10/24/2023-20:31:06] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 125, GPU 5521 (MiB)                                                                                                                                       
[10/24/2023-20:31:13] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2060, GPU 5833 (MiB)                                                                                                                
[10/24/2023-20:31:13] [TRT-LLM] [W] Invalid timing cache, using freshly created one                                                                                                                                                             
[10/24/2023-20:31:24] [TRT-LLM] [I] Loading weights from Meta LLaMA checkpoints ...   
023-20:32:26] [TRT-LLM] [I] Weights loaded. Total time: 00:01:01                                                                                                                                                                                
[10/24/2023-20:32:27] [TRT-LLM] [I] Context FMHA Enabled                                                                                                                                                                                        
[10/24/2023-20:32:27] [TRT-LLM] [I] Remove Padding Enabled                                                                                                                                                                                      
[10/24/2023-20:32:27] [TRT-LLM] [I] Build TensorRT engine llama_float16_tp2_rank0.engine                                                                                                                                                        
[10/24/2023-20:32:27] [TRT] [W] Unused Input: position_ids                                                                                                                                                                                      
[10/24/2023-20:32:27] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.                                                                                                  
[10/24/2023-20:32:27] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 34548, GPU 7347 (MiB)                                                                                                                           
[10/24/2023-20:32:27] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 34550, GPU 7357 (MiB)                                                                                                                                    
[10/24/2023-20:32:27] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.     
[10/24/2023-20:32:50] [TRT] [W] Tactic Device request: 66048MB Available: 48676MB. Device memory is insufficient to use tactic.
[10/24/2023-20:32:50] [TRT] [W] UNSUPPORTED_STATESkipping tactic 2 due to insufficient memory on requested size of 66048 detected for tactic 0x000000000000001a.
[10/24/2023-20:32:54] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[10/24/2023-20:32:54] [TRT] [I] Detected 57 inputs and 51 output network tensors.
[10/24/2023-20:33:02] [TRT] [I] Total Host Persistent Memory: 147984
[10/24/2023-20:33:02] [TRT] [I] Total Device Persistent Memory: 0
[10/24/2023-20:33:02] [TRT] [I] Total Scratch Memory: 33620096
[10/24/2023-20:33:02] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 924 steps to complete.
[10/24/2023-20:33:02] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 30.3485ms to assign 11 blocks to 924 nodes requiring 1384123392 bytes.
[10/24/2023-20:33:02] [TRT] [I] Total Activation Memory: 1384123392
[10/24/2023-20:33:02] [TRT] [I] Total Weights Memory: 34006908952
[10/24/2023-20:33:02] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 35115, GPU 39801 (MiB)
[10/24/2023-20:33:02] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 35115, GPU 39811 (MiB)
[10/24/2023-20:33:02] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1000 MiB, GPU 32432 MiB
[10/24/2023-20:33:02] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +32432, now: CPU 0, GPU 32432 (MiB)
[10/24/2023-20:33:12] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 103590 MiB
[10/24/2023-20:33:12] [TRT-LLM] [I] Total time of building llama_float16_tp2_rank0.engine: 00:00:44
[10/24/2023-20:33:12] [TRT-LLM] [I] Config saved to codellama_34b/config.json.
[10/24/2023-20:33:12] [TRT-LLM] [I] Serializing engine to codellama_34b/llama_float16_tp2_rank0.engine...
[10/24/2023-20:33:55] [TRT-LLM] [I] Engine serialized. Total time: 00:00:42
[10/24/2023-20:34:05] [TRT-LLM] [I] Loading weights from Meta LLaMA checkpoints ...
Killed

I’m using release-0.5 and the docker setup. Please let me know if there’s any additional information that would help with debugging this.

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 26

Most upvoted comments

Hi all, sorry for your inconvenience in engine build.

In the last week, we have updated to main branch for reducing the peak CPU memory footprint. Please use --load_by_shard option in LLaMA / BLOOM / Falcon models to reduce the memory footprint. For LLaMA model, --load_by_shard works for HF checkpoint only, so that please convert to HF first if you have Meta checkpoint (please refer the guide for the checkpoint conversion).

There were several root causes in this issue: Memory leak in engine build time / load the full HF models. We have fixed the memory leaks caused in engine build time as well as we allow to load the weights shard-by-shard (not full model), the memory footprint is now significantly reduced in engine build. Thanks for your valuable feedbacks.

jaedeok-nvidia on Nov 24, 2023