vllm: ray OOM in tensor parallel

In my case , I can deploy the vllm service on single GPU. but when I use multi gpu, I meet the ray OOM error. Could you please help solve this problem? my model is yahma/llama-7b-hf my transformers version is 4.28.0 my cuda version is 11.4

2023-06-30 09:24:53,455 WARNING utils.py:593 – Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning. 2023-06-30 09:24:53,459 WARNING services.py:1826 – WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing ‘–shm-size=6.12gb’ to ‘docker run’ (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2023-06-30 09:24:53,584 INFO worker.py:1636 – Started a local Ray instance. INFO 06-30 09:24:54 llm_engine.py:59] Initializing an LLM engine with config: model=‘/opt/app/yahma-llama-lora’, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0) WARNING 06-30 09:24:54 config.py:131] Possibly too large swap space. 16.00 GiB out of the 32.00 GiB total CPU memory is allocated for the swap space. /opt/app/yahma-llama-lora Exception in thread ray_print_logs: Traceback (most recent call last): File “/usr/lib/python3.8/threading.py”, line 932, in _bootstrap_inner self.run() File “/usr/lib/python3.8/threading.py”, line 870, in run self._target(*self._args, **self._kwargs) File “/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py”, line 900, in print_logs global_worker_stdstream_dispatcher.emit(data) File “/usr/local/lib/python3.8/dist-packages/ray/_private/ray_logging.py”, line 264, in emit handle(data) File “/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py”, line 1788, in print_to_stdstream print_worker_logs(batch, sink) File “/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py”, line 1950, in print_worker_logs restore_tqdm() File “/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py”, line 1973, in restore_tqdm tqdm_ray.instance().unhide_bars() File “/usr/local/lib/python3.8/dist-packages/ray/experimental/tqdm_ray.py”, line 344, in instance _manager = _BarManager() File “/usr/local/lib/python3.8/dist-packages/ray/experimental/tqdm_ray.py”, line 256, in init self.should_colorize = not ray.widgets.util.in_notebook() File “/usr/local/lib/python3.8/dist-packages/ray/widgets/util.py”, line 205, in in_notebook shell = _get_ipython_shell_name() File “/usr/local/lib/python3.8/dist-packages/ray/widgets/util.py”, line 194, in _get_ipython_shell_name import IPython File “/usr/local/lib/python3.8/dist-packages/IPython/init.py”, line 30, in <module> raise ImportError( ImportError: IPython 8.13+ supports Python 3.9 and above, following NEP 29. IPython 8.0-8.12 supports Python 3.8 and above, following NEP 29. When using Python 2.7, please install IPython 5.x LTS Long Term Support version. Python 3.3 and 3.4 were supported up to IPython 6.x. Python 3.5 was supported with IPython 7.0 to 7.9. Python 3.6 was supported with IPython up to 7.16. Python 3.7 was still supported with the 7.x branch.

See IPython README.rst file for more information:

https://github.com/ipython/ipython/blob/main/README.rst

Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “/opt/app/vllm-0.1.1/vllm/entrypoints/llm.py”, line 55, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File “/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py”, line 151, in from_engine_args engine = cls(*engine_configs, distributed_init_method, devices, File “/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py”, line 102, in init self._init_cache() File “/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py”, line 114, in _init_cache num_blocks = self._run_workers( File “/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py”, line 317, in _run_workers all_outputs = ray.get(all_outputs) File “/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py”, line 18, in auto_init_wrapper return fn(*args, **kwargs) File “/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py”, line 103, in wrapper return func(*args, **kwargs) File “/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py”, line 2542, in get raise value ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Memory on the node (IP: 10.30.192.36, ID: 17400c6c9eee3bc1384c172eecd4e1ecf2992cbc7f50cb27d2dc60d7) where the task (task ID: ffffffffffffffff283e91f20257d747969124a201000000, name=Worker.init, pid=26332, memory used=4.54GB) was running was 31.27GB / 32.00GB (0.977298), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: cb6154315a0e1a33d85683935ae20cf76eecd48230c3c4b3a5563fe4) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 10.30.192.36. To see the logs of the worker, use ray logs worker-cb6154315a0e1a33d85683935ae20cf76eecd48230c3c4b3a5563fe4*out -ip 10.30.192.36. Top 10 memory users: PID MEM(GB) COMMAND 26333 4.60 ray::Worker.__init__ 26332 4.54 ray::Worker.__init__ 26331 4.51 ray::Worker.__init__ 26330 4.47 ray::Worker.__init__ 25044 0.23 python 25099 0.19 /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20... 25340 0.06 ray::IDLE 25174 0.06 /usr/bin/python /usr/local/lib/python3.8/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1 -... 25310 0.06 /usr/bin/python -u /usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py --node-ip-address=1... 25349 0.05 ray::IDLE Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variableRAY_memory_monitor_refresh_ms` to zero.

About this issue

Original URL
State: closed
Created a year ago
Comments: 27 (7 by maintainers)

Commits related to this issue

Fix KV Cache leak in async decoding (#322) Problem: we were not freeing sequences stopped by stop words (as they were not checked in the `process_model_outputs` loop which only considered running s... — committed to pcmoritz/vllm-public by Yard1 6 months ago

Most upvoted comments

It seems that if turn down the --max_model_len ,it’ll start。 for example： stat with the command like: python -m vllm.entrypoints.api_server --model /workspace/model/ --tensor-parallel-size 4 --max-model-len 6000

chaos318 on Jan 3, 2024

Me too. May be the Ray memory monitor detected memory usage incorrectly ? because I found there were a lot of memory occupied by system buffer/cache, and Ray regard them as unavailable according its error log

disable the ray memory monitor by export RAY_memory_monitor_refresh_ms=0 work for me : https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html#how-do-i-disable-the-memory-monitor

CtfGo on Jul 4, 2023

@WoosukKwon Thank you for answering my problem! When I try the swap_space, the problem has not been solved. my code is here: from vllm import LLM model_path = ‘yahma/llama-13b-hf’ llama_model = LLM(model = model_path, tensor_parallel_size=4, swap_space=1)

my CPU has 32GB memory, and I use 4 A100 40GB. and the error message is still the same: 2023-07-03 03:27:55,908 WARNING utils.py:593 – Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning. 2023-07-03 03:27:55,911 WARNING services.py:1826 – WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing ‘–shm-size=6.08gb’ to ‘docker run’ (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2023-07-03 03:27:56,045 INFO worker.py:1636 – Started a local Ray instance. INFO 07-03 03:27:56 llm_engine.py:59] Initializing an LLM engine with config: model=‘/opt/app/yahma-llama-lora’, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0) /opt/app/yahma-llama-lora Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “/opt/app/vllm-0.1.1/vllm/entrypoints/llm.py”, line 55, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File “/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py”, line 151, in from_engine_args engine = cls(*engine_configs, distributed_init_method, devices, File “/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py”, line 102, in init self._init_cache() File “/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py”, line 114, in _init_cache num_blocks = self._run_workers( File “/opt/app/vllm-0.1.1/vllm/engine/llm_engine.py”, line 317, in _run_workers all_outputs = ray.get(all_outputs) File “/usr/local/lib/python3.8/dist-packages/ray/_private/auto_init_hook.py”, line 18, in auto_init_wrapper return fn(*args, **kwargs) File “/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py”, line 103, in wrapper return func(*args, **kwargs) File “/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py”, line 2542, in get raise value ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Memory on the node (IP: 10.30.192.36, ID: 91847a2262e263f96264497d39d4641c385303a97ff78e3fc6f0e721) where the task (task ID: ffffffffffffffff27a08d091fe239dc78e7cd0c01000000, name=Worker.init, pid=51664, memory used=4.45GB) was running was 31.21GB / 32.00GB (0.97518), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: ddd4c0e44d6355f85eb5027fac7616a529d599bb4e3193b1df451167) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 10.30.192.36. To see the logs of the worker, use ray logs worker-ddd4c0e44d6355f85eb5027fac7616a529d599bb4e3193b1df451167*out -ip 10.30.192.36. Top 10 memory users: PID MEM(GB) COMMAND 51660 4.45 ray::Worker.__init__ 51664 4.45 ray::Worker.__init__ 51658 4.42 ray::Worker.__init__ 51662 4.41 ray::Worker.__init__ 45071 0.27 python 50443 0.18 /usr/local/lib/python3.8/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20... 50650 0.06 /usr/bin/python -u /usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py --node-ip-address=1... 50694 0.05 ray::IDLE 50681 0.05 ray::IDLE 50688 0.05 ray::IDLE Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_thresholdwhen starting Ray. To disable worker killing, set the environment variableRAY_memory_monitor_refresh_ms` to zero.

liulfy on Jul 3, 2023

Hi @liulfy, it’s because we allocate 4gb of cpu memory per gpu Adding swap_space=1 when initializing LLM will solve the problem.

WoosukKwon on Jun 30, 2023