TensorRT-LLM: Nccl error running run.py with llama2-7b on 2x4090
System Info
GPU:2xRTX4090 OS:docker(tensorrt-llm make to produce image) TensorRT-LLM version: 0.9.0.dev2024022000 driver:525.116.03 CUDA Version: 12.3
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
- python convert_checkpoint.py --model_dir /tmp/llama2-7b-hf/ --output_dir /tmp/tllm_checkpoint_tp2 --dtype float16 --tp_size 2
- trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_tp2/ --output_dir /tmp/2tp_fp16_batch32 --gemm_plugin float16 --max_batch_size 32 --use_custom_all_reduce disable
- mpirun -n 2 --allow-run-as-root python run.py --engine_dir /tmp/2tp_fp16_batch32/ --max_output_len=50 --tokenizer_dir /tmp/llama2-7b-hf/
Expected behavior
output tokens
actual behavior
Failed, NCCL error /home/qiaoxj/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 ‘unknown result code’ Failed, NCCL error /home/qiaoxj/TensorRT-LLM/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 ‘unknown result code’
additional notes
Also tired on 2xL40S,the same error
About this issue
- Original URL
- State: open
- Created 4 months ago
- Comments: 24
@PerkzZheng You can try to build docker image using build script from tensorrtllm_backend, and I think you can reproduce the error.
I’m using A800 and got the same error