TensorRT-LLM: Tactic running out of memory during Code Llama 34B build

On machines with either 8x A100-80GB or 8x H100, I’m getting many tactic out of memory issues during the build.

The tactic says it requesting 530000 MB while the GPU has 80GB, yet I only observe ~10GB in GPU memory utilization during the build.

Here is my script:

python build.py --model_dir ./Phind-CodeLlama-34B-v2 \
                --dtype bfloat16 \
                --remove_input_padding \
                --use_gpt_attention_plugin bfloat16 \
                --enable_context_fmha \
                --use_gemm_plugin bfloat16 \
                --paged_kv_cache \
                --use_parallel_embedding \
                --use_inflight_batching \
                --max_input_len 14848 \
                --max_output_len 1536 \
		--vocab_size 32000 \
                --rotary_base 1000000 \
                --output_dir ./Phind/Phind-CodeLlama-34B-v2/trt-engines/bf16/8-gpu \
                --world_size 8 \
                --tp_size 8 \
                --parallel_build

The same issue happens for much smaller input and output lens as well, which suggests that isn’t the issue.

Phind-CodeLlama-34B is a standard 34B Code Llama that has been fine-tuned but is architecturally identical and is available here: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2.

  1. Are these tactic errors resulting in a less optimized model? The model is still usable but it’s slower than I expected.
  2. I also tried running with --builder_opt=5 for max optimizations but that model fails to load into the Triton backend completely

The documentation here could be improved – I’d love to know what I can do to get the most optimized model possible @byshiue.

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 18

Most upvoted comments

Thanks a lot for all your efforts and the great feedback @michaelroyzen! I truly appreciate it. Long story short, @shangz-ai and I started the work on that feature few months ago (for FasterTransformer) and we never had time to properly evaluate the impact on a sufficient number of workloads. Now, that we have a first release of TensorRT-LLM (phew 😉), we will do the work needed to better characterise how performance changes with the feature and improve our heuristic for it. If we do not find cases that regress, we will probably enable it by default.

Hi, @michaelroyzen, a quick way to enable multi_block_mode in Llama is adding

multi_block_mode=True

here https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/tensorrt_llm/models/llama/model.py#L81. Please take a try.

We don’t add the argument in current builder args and we cannot enable it directly. We will add it soon. Sorry for convenience.

Thanks Michael. Let me ask the engineer who implemented multi_block_mode about an example. And, let me talk to the other engineer who worked on custom_all_reduce regarding the crash 😉

Hi @michaelroyzen , we’ve been able to reproduce the issue with custom_all_reduce and the parallel embedding. We will work on a fix and update the main branch with that fix (and a couple of other ones) when it’s ready. Sorry about that.