BentoML: --production flag slows down my requests

Describe the bug

When I add the --production flag to the bentoml serve command, model serving becomes extremely slow compared to without the flag. The --production flag seems to make many predictions in parallel and thus slow down processing. The exact same data (and outputs) are obtained in both cases.

To Reproduce

Expected behavior I would expect the time to process an input to be identical (or faster) with --production Screenshots/Logs

Serve the model with: bentoml serve my_model:latest Ping server: curl -X POST -H "content-type: application/json" --data "path_to_file" http://127.0.0.1:5000/predict Output: image

Serve the model with: bentoml serve my_model:latest --production Ping server: curl -X POST -H "content-type: application/json" --data "path_to_file" http://127.0.0.1:5000/predict Output: image

Environment:

  • OS: Ubuntu 18.05
  • Python Version [Python 3.9.7]
  • BentoML Version [bentoml @ git+https://github.com/bentoml/BentoML.git@3b4bc6a7e061285908a8c4b248c57ad52224c2d2]

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 23 (12 by maintainers)

Most upvoted comments

Thank you for looking into this so quickly! Just did some benchmarks - I installed the merge using pip install git+https://github.com/bentoml/BentoML.git and double checked the updated source in site-packages. Given the same experimental setup:

  • non-production throughput ~210/sec
  • production throughput ~210/sec

It looks like the throughput is now similar in production, nice!

I did some additional tests which gave me some more observations:

  1. If I skip the call to the model inference runner, the throughput is ~540/sec (average image size 120kb)
  2. If I skip the call to the model inference runner, the throughput is ~740/sec (average image size 3.6kb)
  3. It seems the logging to the console also slows things down on my machine. If I redirect output to /dev/null this increases the above from ~740/sec to ~870/sec.

So this gives an idea of the latency caused by everything except the model runner.

When trying to push the maximum throughput even higher I found the following:

  1. The highest I got is ~270/sec on production using reduced image size of 3.6kb using default settings
  2. The GPU load gets close to 100% in non-production and to around 40% in production
  3. The CPU load stays below 40% on all cores in all cases (production and non-production)

Overall conclusions:

  1. The fix helped solve the slower production performance!
  2. I am still going to look for how to increase the performance even more since the GPU load still stays ~40% and the CPU load is also low. I think there is some headroom somewhere in the form of latency. Because I am only happy when either CPU or GPU get maxed out 😉 If I find something I will post it.

Thank you for the fix

Some performance graphs…

Best run non-production: image

Best run production: image

Thanks for the detailed report @udevnl!

Just wanted to pop in here and let you all know that this is now being worked on, and performance is going to be much more of a priority moving forward.

We’ve managed to reproduce, and think we’ve found a major cause of this slowdown, and are working on fixing it right now.

Hi, people! Experiencing similar behavior here.

  • OnnxRuntime using GPU on a resnet18-like model
  • Using 64 concurrent requests

Running without --production gives around 200/sec. Running with --production gives around 149.sec, Running the same code without BentoML in a single thread using batches of 32 gives around 350/sec.

Observations:

  • Non-production uses a single process
  • production uses multi processing. I tried different setting from the default to specifying --api-workers of 0, 1, 2 up to 8. without improvement
  • non-production CPU load ~70% on all cores, GPU load ~50%
  • production CPU load ~70% on 1 core and ~40% on other cores, GPU load ~30%
  • The GPU load in production is around 30%
  • I tried different settings for max_batch_size in the to_runner with little improvement
  • Increasing the number of concurrent requests to 128, 256 or 1024 does not improve throughput rate

Theory:

  • Somehow the multi-processing approach used in production makes the total throughput lower?
  • I also used an older version of BentoML before (0.12.x) a while ago and I remember that in this version I was able to get the GPU load close to 100% after some tuning to the number of batch workers. Because there is some pre- and post- processing I want to use the multi-process approach of BentoML. Ideally the pre-/post-processing is done parallel on the CPU while the model invocations can be done as often as needed to keep the GPU close to 100% load. Currently neither CPU nor GPU is at 100% even if I increase the conccurrent requests to 128, 256 or even 1024.

I will keep searching and digging. But also following this issue.

just did a fresh install with the latest main branch, still observing the same behaviour