vllm: vllm hangs when reinitializing ray
I’d like to be able to unload a vllm model and re-load it later, in the same script. However, the following (on 0.1.7) causes the script to hang:
from vllm import LLM, SamplingParams
def process_prompts(prompts):
llm = LLM(
model="meta-llama/Llama-2-70b-chat-hf",
tensor_parallel_size=2,
trust_remote_code=True,
load_format="pt")
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=500)
return llm.generate(prompts, sampling_params)
prompt_batch_1 = ["Hello, my name is", "The president of the United States is"]
prompt_batch_2 = ["The capital of France is", "The future of AI is"]
batch_1_output = process_prompts(prompt_batch_1)
batch_2_output = process_prompts(prompt_batch_2)
Results in:
2023-09-15 11:43:25,943 INFO worker.py:1621 -- Started a local Ray instance.
INFO 09-15 11:43:51 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2
-70b-chat-hf', tokenizer='meta-llama/Llama-2-70b-chat-hf', tokenizer_mode=auto, trust_remote_code=True,
dtype=torch.float16, download_dir='/scr/biggest/nfliu/cache/huggingface/', load_format=pt, tensor_para
llel_size=2, seed=0)
INFO 09-15 11:43:51 tokenizer.py:30] For some LLaMA-based models, initializing the fast tokenizer may t
ake a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokeni
zer' instead of the original tokenizer.
INFO 09-15 11:45:58 llm_engine.py:199] # GPU blocks: 2561, # CPU blocks: 1638
Processed prompts: 100%|█████████████████████████████████████████████████| 2/2 [00:14<00:00, 7.17s/it]
2023-09-15 11:46:28,348 INFO worker.py:1453 -- Calling ray.init() again after it has already been called.
Then, it just hangs forever (been waiting 10 minutes, with no sign of life). Checking the GPUs shows that the model is indeed unloaded from the GPUs.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:C7:00.0 Off | 0 |
| N/A 30C P0 61W / 350W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:CA:00.0 Off | 0 |
| N/A 31C P0 57W / 350W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
I’m fairly sure this is related to ray, since this doesn’t happen if tensor parallelism is set to 1 (e.g., if you’re running a smaller model). When I ctrl+c out of the script after it hangs, it shows that it’s stuck on ray.get(current_placement_group.ready(), timeout=1800)
https://github.com/vllm-project/vllm/blob/main/vllm/engine/ray_utils.py#L112C9-L112C63 .
Is there any way to “reset” the ray state, such that it initializes from scratch the second time?
About this issue
- Original URL
- State: open
- Created 10 months ago
- Reactions: 2
- Comments: 15
I encountered the same issue. It runs fine when I use
tensor_parallel_size=1
, but it hangs when I usetensor_parallel_size>1
. I have tried reinstalling many times but it didn’t help.The final solution for me was to modify the
vllm/engine/ray_utils.py
file and limit the number of CPUs used. After making this change, it works properly. The modified code is:ray.init(num_cpus=32, num_gpus=4, address=ray_address, ignore_reinit_error=True).
Note: I encountered hanging issues while using
tensor_parallel_size>1
on a 128-core machine. However, runningtensor_parallel_size>1
on a 96-core machine works normallyIn my case I have 4 GPUs and 3 RayServe deployments, 2 of which require 1 logical GPU with tensor_parallelism=1 and another one which requires 2 logical GPUs with tensor_parallelism=2. Looks like when vLLM tries to handle the tensor_parallelism=2 it got stuck because of not enough resources.
Yes, I did modify ray_utils.py, installed in my conda environment for vllm