vllm: Recent vLLMs ask for too much memory: ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

Since vLLM 0.2.5, we can’t even run llama-2 70B 4bit AWQ on 4*A10G anymore, have to use old vLLM. Similar problems even trying to be two 7b models on 80B A100.

For small models, like 7b with 4k tokens, vLLM fails for “cache blocks” even though alot more memory is left.

E.g. building docker image with cuda 11.8 and vllm 0.2.5 or 0.2.6 and running like:

port=5001
tokens=8192
docker run -d \
    --runtime=nvidia \
    --gpus '"device=1"' \
    --shm-size=10.24gb \
    -p $port:$port \
    --entrypoint /h2ogpt_conda/vllm_env/bin/python3.10 \
    -e NCCL_IGNORE_DISABLED_P2P=1 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:/workspace/.cache \
    --network host \
    gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 -m vllm.entrypoints.openai.api_server \
        --port=$port \
        --host=0.0.0.0 \
        --model=defog/sqlcoder2 \
        --seed 1234 \
        --trust-remote-code \
	--max-num-batched-tokens $tokens \
	--max-model-len=$tokens \
	--gpu-memory-utilization 0.4 \
        --download-dir=/workspace/.cache/huggingface/hub &>> logs.vllm_server.sqlcoder2.txt

port=5002
tokens=4096
docker run -d \
    --runtime=nvidia \
    --gpus '"device=1"' \
    --shm-size=10.24gb \
    -p $port:$port \
    --entrypoint /h2ogpt_conda/vllm_env/bin/python3.10 \
    -e NCCL_IGNORE_DISABLED_P2P=1 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:/workspace/.cache \
    --network host \
    gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 -m vllm.entrypoints.openai.api_server \
        --port=$port \
        --host=0.0.0.0 \
        --model=NumbersStation/nsql-llama-2-7B \
        --seed 1234 \
        --trust-remote-code \
	--max-num-batched-tokens $tokens \
	--gpu-memory-utilization 0.6 \
	--max-model-len=$tokens \
        --download-dir=/workspace/.cache/huggingface/hub &>> logs.vllm_server.nsql7b.txt

works. However, if the 2nd model was to have 0.4, one gets:

Traceback (most recent call last):
  File "/h2ogpt_conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/h2ogpt_conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/h2ogpt_conda/vllm_env/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 729, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/h2ogpt_conda/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/h2ogpt_conda/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 269, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/h2ogpt_conda/vllm_env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine
    return engine_class(*args, **kwargs)
  File "/h2ogpt_conda/vllm_env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 113, in __init__
    self._init_cache()
  File "/h2ogpt_conda/vllm_env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 227, in _init_cache
    raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

However, with 0.6 util from before, here is what GPU looks like:


Sun Dec 24 02:45:53 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:00:06.0 Off |                    0 |
| N/A   43C    P0              72W / 300W |  70917MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off | 00000000:00:07.0 Off |                    0 |
| N/A   45C    P0              66W / 300W |  49136MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      6232      C   /h2ogpt_conda/vllm_env/bin/python3.10     70892MiB |
|    1   N/A  N/A      6966      C   /h2ogpt_conda/vllm_env/bin/python3.10     32430MiB |
|    1   N/A  N/A      7685      C   /h2ogpt_conda/vllm_env/bin/python3.10     16670MiB |

Ignore GPU=0.

So 0.6 util is 17GB, why would 0.4 util out of 80GB be a problem?

About this issue

Original URL
State: open
Created 6 months ago
Reactions: 5
Comments: 40 (7 by maintainers)

Commits related to this issue

Revert https://github.com/vllm-project/vllm/pull/2031/files for https://github.com/vllm-project/vllm/issues/2248 — committed to h2oai/vllm by pseudotensor 5 months ago
use h2ogpt version of vllm with attempt to fix https://github.com/vllm-project/vllm/issues/2248 using https://github.com/h2oai/vllm — committed to Pandinosaurus/h2ogpt by pseudotensor 5 months ago
Add how to run 4*A10G on AWS using LLaMa-2 70B AWQ after vllm changes, for Issue https://github.com/vllm-project/vllm/issues/2248 — committed to h2oai/h2ogpt by pseudotensor 5 months ago

Most upvoted comments

Yet another version of this problem is that 01-ai/Yi-34B-Chat used to work perfectly fine on 4*H100 80GB when run like:

python -m vllm.entrypoints.openai.api_server --port=5000 --host=0.0.0.0 --model 01-ai/Yi-34B-Chat --seed 1234 --tensor-parallel-size=4 --trust-remote-code

But now it doesn’t since 0.2.5+ including 0.2.7. Get instead:

INFO 01-16 14:40:02 api_server.py:750] args: Namespace(host='0.0.0.0', port=5000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, ch>
2024-01-16 14:40:04,623 INFO worker.py:1673 -- Started a local Ray instance.
INFO 01-16 14:40:06 llm_engine.py:70] Initializing an LLM engine with config: model='01-ai/Yi-34B-Chat', tokenizer='01-ai/Yi-34B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust>
INFO 01-16 14:41:00 llm_engine.py:294] # GPU blocks: 0, # CPU blocks: 4369
Traceback (most recent call last):
  File "/home/fsuser/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/fsuser/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/fsuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 760, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/fsuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 544, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/home/fsuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 274, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/fsuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/fsuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in __init__
    self._init_cache()
  File "/home/fsuser/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 298, in _init_cache
    raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

When can we expect a fix? It seems a pretty serious bug.

BTW, curiously, I ran the same exact command a second time (both times nothing on the GPUs) and second time didn’t hit the error. So maybe there is a race in the memory size detection in vLLM.

pseudotensor on Jan 16, 2024

We are having the exact same issue on our end, cache usage grows and consumes more than the allocated gpu_memory_utilization, even by using enforce-eager.

We had the same problem before with 0.2.1

baptistejamin on Dec 28, 2023

I dived in a bit and here are some findings:

When serving large models (e.g. 70B), the model forward itself introduces memory fragmentation. I logged the free memory after each decoder layer and found that the free memory reduces after every layer. In the case of 70B model, after 80 layers the free memory is only ~2GBs our of 40 GBs per GPU.
Profile run samples top_k=vocab-1. This results in a bit high memory usage when vocab size is large.
GPU cache block estimation does not consider fragmentation. Combining the above 2, the free memory is less than 1GB, which results in a very small batch or even no available GPU blocks to be used for kv cache.

My temporary solution is as follows:

Manually add torch.cuda.empty_cache() in worker.py before the line free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info(). This removes the impact of fragmentation.
The above change makes OOM possible when actual serving the model, because empty_cache() also removes the impact of intermediate tensors when running forward pass. As a result, tuning the --gpu-memory-utilization becomes more important, as we have to use it to cover the forward intermediate tensors. Here are my testing results with different util values:
- 0.8: 2828 = 45248 tokens
- 0.9: 3644 = 58304 tokens
- 1.0: OOM

comaniac on Jan 9, 2024

having the same issue on cuda 11.8 and vllm 0.2.5 and 0.2.6

ronaldpanape on Dec 29, 2023

FYI @pseudotensor

I’ve tested the memory footprint of 0.2.4 and 0.2.7 and this is my finding:

I’m sure that https://github.com/vllm-project/vllm/pull/2031 is correct and should be there.

|<-------------------------------------total GPU memory---------------------------------------->|
|<---Allocated by torch allocator--->|<--Allocated by NCCL, cuBLAS, etc-->|<--free GPU memory-->|

before #2031 non-torch-related allocations were completely ignored.

#2031 just computes it correctly. We still need to fix peak memory consumption in case of multiple memory-consuming processes.
Running 0.2.4 and 0.2.7 consume exactly the same amount of memory(measuring by old and new way) by a model.
Changing nccl version doesn’t change memory consumption significantly (~10MB).
When using --enforce-eager the memory consumption is a little bit lower.
Using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True helps and also makes execution w and w\o --enforce-eager identical. I’m not sure how stable it’s as it’s marked as experimental.
I believe that by carefully tuning gpu_memory_utilization we can get the original behavior as I don’t see an increase in memory consumption.
It’s better to fully dedicate a sub-set of GPUs to a single vllm-model and don’t share a GPU across multiple models as NCCL’s, cuBLAS’s, torch’s overhead will multiply.

sh1ng on Jan 31, 2024

Reverting avoided the title message, but it went GPU OOM unlike 0.2.4 with same long-context query. FYI @sh1ng

pseudotensor on Jan 30, 2024

@Snowdar @hanzhi713 et al. I want to be clear again. The primary issue is that even single sharded model across GPUs no longer works. Forget about multiple models per GPU for now.

That is, on AWS 4*A10G, vLLM 0.2.4 and lower work perfectly fine and leave plenty of room without any failure.

However, on 0.2.5+ no matter any settings of gpu utilitization etc., never will llama 70B AWQ model fit on the 4 A10G while before it was perfectly fine (even under heavy use for long periods).

pseudotensor on Jan 5, 2024

same here

DaBossCoda on Dec 29, 2023