BentoML: --production flag slows down my requests
Describe the bug
When I add the --production
flag to the bentoml serve
command, model serving becomes extremely slow compared to without the flag. The --production
flag seems to make many predictions in parallel and thus slow down processing. The exact same data (and outputs) are obtained in both cases.
To Reproduce
Expected behavior
I would expect the time to process an input to be identical (or faster) with --production
Screenshots/Logs
Serve the model with:
bentoml serve my_model:latest
Ping server:
curl -X POST -H "content-type: application/json" --data "path_to_file" http://127.0.0.1:5000/predict
Output:
Serve the model with:
bentoml serve my_model:latest --production
Ping server:
curl -X POST -H "content-type: application/json" --data "path_to_file" http://127.0.0.1:5000/predict
Output:
Environment:
- OS: Ubuntu 18.05
- Python Version [Python 3.9.7]
- BentoML Version [bentoml @ git+https://github.com/bentoml/BentoML.git@3b4bc6a7e061285908a8c4b248c57ad52224c2d2]
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 23 (12 by maintainers)
Thank you for looking into this so quickly! Just did some benchmarks - I installed the merge using
pip install git+https://github.com/bentoml/BentoML.git
and double checked the updated source in site-packages. Given the same experimental setup:production
throughput ~210/secproduction
throughput ~210/secIt looks like the throughput is now similar in production, nice!
I did some additional tests which gave me some more observations:
So this gives an idea of the latency caused by everything except the model runner.
When trying to push the maximum throughput even higher I found the following:
production
using reduced image size of 3.6kb using default settingsproduction
and to around 40% inproduction
production
and non-production
)Overall conclusions:
Thank you for the fix
Some performance graphs…
Best run non-
production
:Best run
production
:Thanks for the detailed report @udevnl!
Just wanted to pop in here and let you all know that this is now being worked on, and performance is going to be much more of a priority moving forward.
We’ve managed to reproduce, and think we’ve found a major cause of this slowdown, and are working on fixing it right now.
Hi, people! Experiencing similar behavior here.
Running without
--production
gives around 200/sec. Running with--production
gives around 149.sec, Running the same code without BentoML in a single thread using batches of 32 gives around 350/sec.Observations:
production
uses a single processproduction
uses multi processing. I tried different setting from the default to specifying--api-workers
of 0, 1, 2 up to 8. without improvementproduction
CPU load ~70% on all cores, GPU load ~50%production
CPU load ~70% on 1 core and ~40% on other cores, GPU load ~30%max_batch_size
in theto_runner
with little improvementTheory:
production
makes the total throughput lower?I will keep searching and digging. But also following this issue.
just did a fresh install with the latest
main
branch, still observing the same behaviour