TensorRT-LLM: Nccl error running run.py with llama2-7b on 2x4090

System Info

GPU:2xRTX4090 OS:docker(tensorrt-llm make to produce image) TensorRT-LLM version: 0.9.0.dev2024022000 driver:525.116.03 CUDA Version: 12.3

Who can help?

@byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

python convert_checkpoint.py --model_dir /tmp/llama2-7b-hf/ --output_dir /tmp/tllm_checkpoint_tp2 --dtype float16 --tp_size 2
trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_tp2/ --output_dir /tmp/2tp_fp16_batch32 --gemm_plugin float16 --max_batch_size 32 --use_custom_all_reduce disable
mpirun -n 2 --allow-run-as-root python run.py --engine_dir /tmp/2tp_fp16_batch32/ --max_output_len=50 --tokenizer_dir /tmp/llama2-7b-hf/

Expected behavior

output tokens

actual behavior

Failed, NCCL error /home/qiaoxj/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 ‘unknown result code’ Failed, NCCL error /home/qiaoxj/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 ‘unknown result code’

additional notes

Also tired on 2xL40S，the same error

About this issue

Original URL
State: open
Created 4 months ago
Comments: 24

Most upvoted comments

@kimbaol thanks for the experiments. I was just trying to ask you whether the nccl version is different. however, I see that the original log you have shared is also based on nccl 2.19.3 + cuda 12.3. Am I missing something here ? #1131 (comment)

@PerkzZheng You can try to build docker image using build script from tensorrtllm_backend, and I think you can reproduce the error.

kimbaol on Feb 27, 2024

Today I was able to test the exact same build and model on a server running 2x A100 with NVLINK and that worked correctly. So the problem must be somehow related to 4090s not supporting NVLINK?

I’m using A800 and got the same error

ZihanLiao on Feb 27, 2024