vllm: IMPORTANT Bug: Model return empty response (output len = 0), when recieved multiple concurrent request.

When I did a bunch of load test the vLLM endpoint with OpenAI API, I see that the server will return 20% to 50% empty responses when it recieved multiple concurrent request. Configuration:

vLLM==0.3.0
Model: Zephyr-7b-beta, although it even worse with bigger model like Llama-2-70b and Mixtral 8x7b
Cuda 12.2
1 x A100 80GB GPU

Number of concurrent request: 100, request rate set as default = “inf”, meaning 100 request will be send concurrently.

How to replicate: I’m using latest benchmark_serving.py (5.3.2024) with following modification: I comment out L69-L72: https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py#L69 and modify following:

  • Line 88 to:
if prompt_len < MIN_PROMPT_LEN or output_len < MIN_OUTPUT_LEN:
  • Line 93 to:
if prompt_len > MAX_PROMPT_LEN or output_len > MAX_OUTPUT_LEN:

and set the params like that:

MIN_PROMPT_LEN = 400
MAX_PROMPT_LEN = 700
MAX_OUTPUT_LEN= 300
MIN_OUTPUT_LEN= 100

Moreover, since in the current code of https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L265 doesn’t check for the output len, I need to add 2 more line of code like this:

if generated_text=="":
   output.success = False
else:
   output.success = True

With above configuration, the success rate is 76 / 100, there are 24 request returned with empty response.

NOTE: I tried replace the aiohttp with httpx, openai package as http client. The problem persist. Moreover, the emtpy response happen even with small number of concurrent request. If we send 5 or 10 request concurrently, we will get 2, 3 empty responses.

I tested with Mixtral 8x7b, Llama-2-70b unquantized, the number of empty response even worse., up to 50%.

My suggestion is put a Rate Limiter (maybe just a decorator or await rate_limit() function) to the api_server.py to limit the request rate that server can handle to some predefine number like 10 request / 5 seconds,…I could provide a MR later if needed.

Please take a look and let me know if something wrong here. Thanks in advance.

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 21

Most upvoted comments

Hi, thanks for your quick response. As said above, I modifed https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L265 to add

if generated_text=="":
   output.success = False
else:
   output.success = True

And it is INSIDE the if statement:

if response.status == 200: 

(https://github.com/vllm-project/vllm/blob/main/benchmarks/backend_request_func.py#L248)

It mean that even the response.status == 200, the model output empty text.

I’m setting the request-rate to “inf” in the benchmark_serving.py to simulate concurrent user.

But what I suggest is put the Rate Limiter directly to the API endpoint code in openai/api_server.py, NOT in the benchmark_serving.py. with the hope that it will fix the problem by reduce the rate the endpoint must be handle in parallel.