TensorRT-LLM: Tactic running out of memory during Code Llama 34B build
On machines with either 8x A100-80GB or 8x H100, I’m getting many tactic out of memory issues during the build.
The tactic says it requesting 530000 MB while the GPU has 80GB, yet I only observe ~10GB in GPU memory utilization during the build.
Here is my script:
python build.py --model_dir ./Phind-CodeLlama-34B-v2 \
--dtype bfloat16 \
--remove_input_padding \
--use_gpt_attention_plugin bfloat16 \
--enable_context_fmha \
--use_gemm_plugin bfloat16 \
--paged_kv_cache \
--use_parallel_embedding \
--use_inflight_batching \
--max_input_len 14848 \
--max_output_len 1536 \
--vocab_size 32000 \
--rotary_base 1000000 \
--output_dir ./Phind/Phind-CodeLlama-34B-v2/trt-engines/bf16/8-gpu \
--world_size 8 \
--tp_size 8 \
--parallel_build
The same issue happens for much smaller input and output lens as well, which suggests that isn’t the issue.
Phind-CodeLlama-34B is a standard 34B Code Llama that has been fine-tuned but is architecturally identical and is available here: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2.
- Are these tactic errors resulting in a less optimized model? The model is still usable but it’s slower than I expected.
- I also tried running with
--builder_opt=5
for max optimizations but that model fails to load into the Triton backend completely
The documentation here could be improved – I’d love to know what I can do to get the most optimized model possible @byshiue.
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 18
Thanks a lot for all your efforts and the great feedback @michaelroyzen! I truly appreciate it. Long story short, @shangz-ai and I started the work on that feature few months ago (for FasterTransformer) and we never had time to properly evaluate the impact on a sufficient number of workloads. Now, that we have a first release of TensorRT-LLM (phew 😉), we will do the work needed to better characterise how performance changes with the feature and improve our heuristic for it. If we do not find cases that regress, we will probably enable it by default.
Hi, @michaelroyzen, a quick way to enable multi_block_mode in Llama is adding
here https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/tensorrt_llm/models/llama/model.py#L81. Please take a try.
We don’t add the argument in current builder args and we cannot enable it directly. We will add it soon. Sorry for convenience.
Thanks Michael. Let me ask the engineer who implemented
multi_block_mode
about an example. And, let me talk to the other engineer who worked oncustom_all_reduce
regarding the crash 😉Hi @michaelroyzen , we’ve been able to reproduce the issue with
custom_all_reduce
and the parallel embedding. We will work on a fix and update the main branch with that fix (and a couple of other ones) when it’s ready. Sorry about that.