TensorRT-LLM: Nccl error running run.py with llama2-7b on 2x4090

System Info

GPU:2xRTX4090 OS:docker(tensorrt-llm make to produce image) TensorRT-LLM version: 0.9.0.dev2024022000 driver:525.116.03 CUDA Version: 12.3

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. python convert_checkpoint.py --model_dir /tmp/llama2-7b-hf/ --output_dir /tmp/tllm_checkpoint_tp2 --dtype float16 --tp_size 2
  2. trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_tp2/ --output_dir /tmp/2tp_fp16_batch32 --gemm_plugin float16 --max_batch_size 32 --use_custom_all_reduce disable
  3. mpirun -n 2 --allow-run-as-root python run.py --engine_dir /tmp/2tp_fp16_batch32/ --max_output_len=50 --tokenizer_dir /tmp/llama2-7b-hf/

Expected behavior

output tokens

actual behavior

Failed, NCCL error /home/qiaoxj/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 ‘unknown result code’ Failed, NCCL error /home/qiaoxj/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 ‘unknown result code’

additional notes

Also tired on 2xL40S,the same error

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 24

Most upvoted comments

@kimbaol thanks for the experiments. I was just trying to ask you whether the nccl version is different. however, I see that the original log you have shared is also based on nccl 2.19.3 + cuda 12.3. Am I missing something here ? #1131 (comment)

@PerkzZheng You can try to build docker image using build script from tensorrtllm_backend, and I think you can reproduce the error.

Today I was able to test the exact same build and model on a server running 2x A100 with NVLINK and that worked correctly. So the problem must be somehow related to 4090s not supporting NVLINK?

I’m using A800 and got the same error