BentoML: --production flag slows down my requests
Describe the bug
When I add the --production flag to the bentoml serve command, model serving becomes extremely slow compared to without the flag. The --production flag seems to make many predictions in parallel and thus slow down processing. The exact same data (and outputs) are obtained in both cases.
To Reproduce
Expected behavior
I would expect the time to process an input to be identical (or faster) with --production
Screenshots/Logs
Serve the model with:
bentoml serve my_model:latest
Ping server:
curl -X POST -H "content-type: application/json" --data "path_to_file" http://127.0.0.1:5000/predict
Output:

Serve the model with:
bentoml serve my_model:latest --production
Ping server:
curl -X POST -H "content-type: application/json" --data "path_to_file" http://127.0.0.1:5000/predict
Output:

Environment:
- OS: Ubuntu 18.05
- Python Version [Python 3.9.7]
- BentoML Version [bentoml @ git+https://github.com/bentoml/BentoML.git@3b4bc6a7e061285908a8c4b248c57ad52224c2d2]
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 23 (12 by maintainers)
Thank you for looking into this so quickly! Just did some benchmarks - I installed the merge using
pip install git+https://github.com/bentoml/BentoML.gitand double checked the updated source in site-packages. Given the same experimental setup:productionthroughput ~210/secproductionthroughput ~210/secIt looks like the throughput is now similar in production, nice!
I did some additional tests which gave me some more observations:
So this gives an idea of the latency caused by everything except the model runner.
When trying to push the maximum throughput even higher I found the following:
productionusing reduced image size of 3.6kb using default settingsproductionand to around 40% inproductionproductionand non-production)Overall conclusions:
Thank you for the fix
Some performance graphs…
Best run non-
production:Best run
production:Thanks for the detailed report @udevnl!
Just wanted to pop in here and let you all know that this is now being worked on, and performance is going to be much more of a priority moving forward.
We’ve managed to reproduce, and think we’ve found a major cause of this slowdown, and are working on fixing it right now.
Hi, people! Experiencing similar behavior here.
Running without
--productiongives around 200/sec. Running with--productiongives around 149.sec, Running the same code without BentoML in a single thread using batches of 32 gives around 350/sec.Observations:
productionuses a single processproductionuses multi processing. I tried different setting from the default to specifying--api-workersof 0, 1, 2 up to 8. without improvementproductionCPU load ~70% on all cores, GPU load ~50%productionCPU load ~70% on 1 core and ~40% on other cores, GPU load ~30%max_batch_sizein theto_runnerwith little improvementTheory:
productionmakes the total throughput lower?I will keep searching and digging. But also following this issue.
just did a fresh install with the latest
mainbranch, still observing the same behaviour