vllm: vllm hangs when reinitializing ray

I’d like to be able to unload a vllm model and re-load it later, in the same script. However, the following (on 0.1.7) causes the script to hang:

from vllm import LLM, SamplingParams

def process_prompts(prompts):
    llm = LLM(
        model="meta-llama/Llama-2-70b-chat-hf",
        tensor_parallel_size=2,
        trust_remote_code=True,
        load_format="pt")
    sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=500)
    return llm.generate(prompts, sampling_params)

prompt_batch_1 = ["Hello, my name is", "The president of the United States is"]
prompt_batch_2 = ["The capital of France is", "The future of AI is"]

batch_1_output = process_prompts(prompt_batch_1)
batch_2_output = process_prompts(prompt_batch_2)

Results in:

2023-09-15 11:43:25,943 INFO worker.py:1621 -- Started a local Ray instance.
INFO 09-15 11:43:51 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2
-70b-chat-hf', tokenizer='meta-llama/Llama-2-70b-chat-hf', tokenizer_mode=auto, trust_remote_code=True,
 dtype=torch.float16, download_dir='/scr/biggest/nfliu/cache/huggingface/', load_format=pt, tensor_para
llel_size=2, seed=0)
INFO 09-15 11:43:51 tokenizer.py:30] For some LLaMA-based models, initializing the fast tokenizer may t
ake a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokeni
zer' instead of the original tokenizer.
INFO 09-15 11:45:58 llm_engine.py:199] # GPU blocks: 2561, # CPU blocks: 1638
Processed prompts: 100%|█████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.17s/it]
2023-09-15 11:46:28,348 INFO worker.py:1453 -- Calling ray.init() again after it has already been called.

Then, it just hangs forever (been waiting 10 minutes, with no sign of life). Checking the GPUs shows that the model is indeed unloaded from the GPUs.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:C7:00.0 Off |                    0 |
| N/A   30C    P0              61W / 350W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   31C    P0              57W / 350W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I’m fairly sure this is related to ray, since this doesn’t happen if tensor parallelism is set to 1 (e.g., if you’re running a smaller model). When I ctrl+c out of the script after it hangs, it shows that it’s stuck on ray.get(current_placement_group.ready(), timeout=1800) https://github.com/vllm-project/vllm/blob/main/vllm/engine/ray_utils.py#L112C9-L112C63 .

Is there any way to “reset” the ray state, such that it initializes from scratch the second time?

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Reactions: 2
  • Comments: 15

Most upvoted comments

I encountered the same issue. It runs fine when I use tensor_parallel_size=1, but it hangs when I use tensor_parallel_size>1 . I have tried reinstalling many times but it didn’t help.

The final solution for me was to modify the vllm/engine/ray_utils.py file and limit the number of CPUs used. After making this change, it works properly. The modified code is: ray.init(num_cpus=32, num_gpus=4, address=ray_address, ignore_reinit_error=True).

Note: I encountered hanging issues while using tensor_parallel_size>1 on a 128-core machine. However, running tensor_parallel_size>1 on a 96-core machine works normally

In my case I have 4 GPUs and 3 RayServe deployments, 2 of which require 1 logical GPU with tensor_parallelism=1 and another one which requires 2 logical GPUs with tensor_parallelism=2. Looks like when vLLM tries to handle the tensor_parallelism=2 it got stuck because of not enough resources.

Resources
---------------------------------------------------------------
Usage:
 17.0/48.0 CPU
 4.0/4.0 GPU
 0B/104.83GiB memory
 44B/48.92GiB object_store_memory

Demands:
 {'GPU': 1.0} * 2 (PACK): 1+ pending placement groups

Yes, I did modify ray_utils.py, installed in my conda environment for vllm