serve: Extremely slow inference speed

🐛 Describe the bug

I am trying to deploy a PyTorch model retrieved from Huggingface (SentenceBert-like model) inside TorchServe in Docker.

I have done the following:

  1. Create a .mar file of the model, moved it into a directory model_store,
  2. Create a custom config.properties because I want to serve it in KServe later on. Note that inside here it exposes the model on port 8085.
  3. Deploy the model inside docker using
docker run --rm -it -p 8085:8085 -p 8082:8082 -v $(pwd)/model_store:/home/model-server/model-store
 -v $(pwd)/config.properties:/home/model-server/config.properties pytorch/torchserve:latest-cpu torchserve --start --model-store model-store --models model_name=model_name.mar --ncs
  1. Call the model using a cURL request. This takes approximately 8 seconds to respond. I know that by just running a small python script, this model should respond within milliseconds. So this is extremely slow

As an alternative, I have tried building a custom Docker image with TorchServe. This worked, and responded with similar speeds as the normal python script. I have no idea what could be causing this difference in inference performance.

Some additional details:

  • transformers==4.25.1 is an dependency
  • I have tried different versions of PyTorch and Transformers, and they all have the same problem, unless I create my own build.

Error logs

Performance issue

Installation instructions

The custom build that actually worked properly:

FROM python:3.9-bullseye

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    gcc \
    libmariadb-dev \
    openjdk-11-jdk \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN pip install \
      --disable-pip-version-check \
      --no-cache-dir \
      --no-compile \
      --upgrade \
      install transformers==4.25.1 torchtext==0.14.1 torchserve==0.7.1 torch==1.13.1 captum==0.6.0

Model Packaing

torch-model-archiver

config.properties

inference_address=http://0.0.0.0:8085
management_address=http://0.0.0.0:8085
metrics_address=http://0.0.0.0:8082
grpc_inference_port=7070
grpc_management_port=7071
enable_metrics_api=true
metrics_format=prometheus
enable_envvars_config=true
install_py_dep_per_model=true
model_store=/mnt/models/model-store
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"model_name":{"1.0":{"defaultVersion":true,"marName":"model_name.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":10,"responseTimeout":120}}}}

Versions

  • pytorch/torchserve: 0.7.1
  • torch: 1.13.1+cpu
  • torchtext: 0.14.1
  • transformers: 4.25.1

Repro instructions

See description of the bug.

Possible Solution

No response

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 15 (2 by maintainers)

Most upvoted comments

I think I found the cause of the issue. The problem seems to be related to my architecture. I am currently running the container on an Mac M2 chip. As shown above all runs ok, but slow.

I ran the experiment on my old laptop having an Intel chip and there the performance seems similar between both images.