serve: Extremely slow inference speed
🐛 Describe the bug
I am trying to deploy a PyTorch model retrieved from Huggingface (SentenceBert-like model) inside TorchServe in Docker.
I have done the following:
- Create a
.mar
file of the model, moved it into a directorymodel_store
, - Create a custom
config.properties
because I want to serve it in KServe later on. Note that inside here it exposes the model on port 8085. - Deploy the model inside docker using
docker run --rm -it -p 8085:8085 -p 8082:8082 -v $(pwd)/model_store:/home/model-server/model-store
-v $(pwd)/config.properties:/home/model-server/config.properties pytorch/torchserve:latest-cpu torchserve --start --model-store model-store --models model_name=model_name.mar --ncs
- Call the model using a cURL request. This takes approximately 8 seconds to respond. I know that by just running a small python script, this model should respond within milliseconds. So this is extremely slow
As an alternative, I have tried building a custom Docker image with TorchServe. This worked, and responded with similar speeds as the normal python script. I have no idea what could be causing this difference in inference performance.
Some additional details:
transformers==4.25.1
is an dependency- I have tried different versions of PyTorch and Transformers, and they all have the same problem, unless I create my own build.
Error logs
Performance issue
Installation instructions
The custom build that actually worked properly:
FROM python:3.9-bullseye
RUN apt-get update && \
apt-get install -y --no-install-recommends \
gcc \
libmariadb-dev \
openjdk-11-jdk \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN pip install \
--disable-pip-version-check \
--no-cache-dir \
--no-compile \
--upgrade \
install transformers==4.25.1 torchtext==0.14.1 torchserve==0.7.1 torch==1.13.1 captum==0.6.0
Model Packaing
torch-model-archiver
config.properties
inference_address=http://0.0.0.0:8085
management_address=http://0.0.0.0:8085
metrics_address=http://0.0.0.0:8082
grpc_inference_port=7070
grpc_management_port=7071
enable_metrics_api=true
metrics_format=prometheus
enable_envvars_config=true
install_py_dep_per_model=true
model_store=/mnt/models/model-store
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"model_name":{"1.0":{"defaultVersion":true,"marName":"model_name.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":10,"responseTimeout":120}}}}
Versions
- pytorch/torchserve: 0.7.1
- torch: 1.13.1+cpu
- torchtext: 0.14.1
- transformers: 4.25.1
Repro instructions
See description of the bug.
Possible Solution
No response
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 15 (2 by maintainers)
I think I found the cause of the issue. The problem seems to be related to my architecture. I am currently running the container on an Mac M2 chip. As shown above all runs ok, but slow.
I ran the experiment on my old laptop having an Intel chip and there the performance seems similar between both images.