vllm: IMPORTANT Bug: Model return empty response (output len = 0), when recieved multiple concurrent request.
When I did a bunch of load test the vLLM endpoint with OpenAI API, I see that the server will return 20% to 50% empty responses when it recieved multiple concurrent request. Configuration:
vLLM==0.3.0
Model: Zephyr-7b-beta, although it even worse with bigger model like Llama-2-70b and Mixtral 8x7b
Cuda 12.2
1 x A100 80GB GPU
Number of concurrent request: 100, request rate set as default = “inf”, meaning 100 request will be send concurrently.
How to replicate: I’m using latest benchmark_serving.py (5.3.2024) with following modification: I comment out L69-L72: https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py#L69 and modify following:
- Line 88 to:
if prompt_len < MIN_PROMPT_LEN or output_len < MIN_OUTPUT_LEN:
- Line 93 to:
if prompt_len > MAX_PROMPT_LEN or output_len > MAX_OUTPUT_LEN:
and set the params like that:
MIN_PROMPT_LEN = 400
MAX_PROMPT_LEN = 700
MAX_OUTPUT_LEN= 300
MIN_OUTPUT_LEN= 100
Moreover, since in the current code of https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L265 doesn’t check for the output len, I need to add 2 more line of code like this:
if generated_text=="":
output.success = False
else:
output.success = True
With above configuration, the success rate is 76 / 100, there are 24 request returned with empty response.
NOTE: I tried replace the aiohttp with httpx, openai package as http client. The problem persist. Moreover, the emtpy response happen even with small number of concurrent request. If we send 5 or 10 request concurrently, we will get 2, 3 empty responses.
I tested with Mixtral 8x7b, Llama-2-70b unquantized, the number of empty response even worse., up to 50%.
My suggestion is put a Rate Limiter (maybe just a decorator or await rate_limit() function) to the api_server.py to limit the request rate that server can handle to some predefine number like 10 request / 5 seconds,…I could provide a MR later if needed.
Please take a look and let me know if something wrong here. Thanks in advance.
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 21
Hi, thanks for your quick response. As said above, I modifed https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L265 to add
And it is INSIDE the if statement:
(https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L248)
It mean that even the response.status == 200, the model output empty text.
I’m setting the request-rate to “inf” in the benchmark_serving.py to simulate concurrent user.
But what I suggest is put the Rate Limiter directly to the API endpoint code in openai/api_server.py, NOT in the benchmark_serving.py. with the hope that it will fix the problem by reduce the rate the endpoint must be handle in parallel.