vllm: Cuda failure 'peer access is not supported between these two devices'

Usage stats collection is enabled. To disable this, run the following command: ray disable-usage-stats before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. 2023-07-08 23:11:34,236 INFO worker.py:1610 – Started a local Ray instance. View the dashboard at 127.0.0.1:8265 INFO 07-08 23:11:35 llm_engine.py:60] Initializing an LLM engine with config: model=‘openlm-research/open_llama_13b’, tokenizer=‘openlm-research/open_llama_13b’, tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0) INFO 07-08 23:11:35 tokenizer.py:28] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using ‘hf-internal-testing/llama-tokenizer’ instead of the original tokenizer. (Worker pid=4225) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::Worker.init() (pid=4225, ip=172.31.68.176, actor_id=5dc662848f950df8d330eb8a01000000, repr=<vllm.worker.worker.Worker object at 0x7f4e9ea814e0>) (Worker pid=4225) File “/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py”, line 40, in init (Worker pid=4225) _init_distributed_environment(parallel_config, rank, (Worker pid=4225) File “/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py”, line 307, in _init_distributed_environment (Worker pid=4225) torch.distributed.all_reduce(torch.zeros(1).cuda()) (Worker pid=4225) File “/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py”, line 1451, in wrapper (Worker pid=4225) return func(*args, **kwargs) (Worker pid=4225) File “/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py”, line 1700, in all_reduce (Worker pid=4225) work = default_pg.allreduce([tensor], opts) (Worker pid=4225) torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 (Worker pid=4225) ncclInternalError: Internal check failed. (Worker pid=4225) Last error: (Worker pid=4225) Cuda failure ‘peer access is not supported between these two devices’

Code: llm = LLM(model=“openlm-research/open_llama_13b”, tensor_parallel_size=4)

Env: Single EC2 instance G5.12xlarge with 4 A10G GPU

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (3 by maintainers)

Most upvoted comments

@nivibilla I tried the above workaround in a notebook g5.12xlarge instance in SageMaker and It worked for me. I also tried reinstalling vllm from source adding os.environ["NCCL_IGNORE_DISABLED_P2P"] = '1' in the codebase just before this line and it worked again. I guess you tried on a EC2 VM. Can you try the second way ?

steps

  • clone the project
  • Add the following before this line (Import os also)
os.environ["NCCL_IGNORE_DISABLED_P2P"] = '1'
  • pip install .
  • run again your distributed inference