vllm: Nvidia drivers 545.29.02 broken --tensor-parallel-size

I just upgraded my drivers to 545.29.02 and it has broken being able to run models larger than a single GPU ram for me with vLLM.

If I pass in --tensor-parallel-size 2, things just hang when trying to create the engine. Without it, the model loads just fine (if it will fit in a single GPU’s ram)

(venv) user@pop-os:/media/user/Data/IdeaProjects/vllm$ python3 -m vllm.entrypoints.openai.api_server --model teknium/OpenHermes-2.5-Mistral-7B --tensor-parallel-size 2
INFO 11-27 12:46:10 api_server.py:648] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', model='teknium/OpenHermes-2.5-Mistral-7B', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2023-11-27 12:46:36,779 INFO worker.py:1673 -- Started a local Ray instance.
INFO 11-27 12:46:37 llm_engine.py:72] Initializing an LLM engine with config: model='teknium/OpenHermes-2.5-Mistral-7B', tokenizer='teknium/OpenHermes-2.5-Mistral-7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Pytorch version: ‘2.1.1+cu121’

(venv) user@pop-os:/media/user/Data/IdeaProjects/vllm$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

And the model never finishes loading. Nvidia-smi will show some load on the GPUs, and I have two CPU cores pegged as well.

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 16 (6 by maintainers)

Most upvoted comments

We updated our server with the two A100 40GB to latest Ubuntu + latest Nvidia driver + latest CUDA and now it works as expected. So it seems so that it is really a driver problem.

Ubuntu 22.04.3 LTS
NVIDIA-SMI 545.29.06              
Driver Version: 545.29.06    
CUDA Version: 12.3

But with disabling https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-p2p-disable - NCCL_P2P_DISABLE = 1 it worked before the driver update too.

mirkogolze on Dec 15, 2023