server: Python backend on CPU is slower when serving a pytorch model
Description I have a python model that uses pre-trained roberta model for the inference. I have added this model to Triton to use python backend to serve. We also have the exact same python code/model being served using an fastapi application. Both are running on hardware with same specs. When I compared both the models in terms of performance on CPU, the latency with Triton is very high. I used pytorch profiler to profile the code to debug what is causing the higher latencies with Triton. Below screenshots shows the outputs of pytorch profiler.
Triton-CPU

FastAPI-CPU

Based on the screenshots I can see that particularly the native_layer_norm is taking significantly longer with Triton when compared with model running using our fastapi application. native_layer_norm is part of the pre-trained roberta model.
Triton Information What version of Triton are you using? Version: 21.07
Are you using the Triton container or did you build it yourself? I built the image myself based on r21.07 but I have also tested serving the model using Official Triton Containers-r21.07 and r21.08 the issue still remains the same
To Reproduce Steps to reproduce the behavior.
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
Dependencies: torch==1.6.0 transformers==3.5.1
config.pbtxt
name: "sample-model"
backend: "python"
max_batch_size: 8
input [
{
name: "INPUT0"
data_type: TYPE_STRING
dims: [1]
}
]
output [
{
name: "OUTPUT0"
data_type: TYPE_STRING
dims: [1]
}
]
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "<path to execution env>"}
}
instance_group [
{
count: 1
kind: KIND_CPU
}
]
Expected behavior Ideally the performance should be similar when the same model is being run with Triton
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 29 (16 by maintainers)
@tanmayv25 In my initial testing the results looks good. The performance is greatly improved. Below is some summary from initial testing
Before FixAfter FixI have some more testing pending. I will update here Once I am done with the complete testing.
@tanmayv25 ok, thank you very much.
@tanmayv25 Thank you for running the tests and sharing the results. For my testing, I used Jmeter in non-gui mode. Also, the tests are run in an instance that is separate from where actually the Triton and FastAPI app are running. So it shouldn’t affect the performance of Triton. Let me re-run the tests from my end and I will share the results and also share the fastapi script.
Ok… Let me run with concurrency of 10 and share that with you.
@tanmayv25 Unfortunately, I couldn’t share the actual model but I tried to reproduce the issue using a different model. It is not as slow as our model but with increase in request load the model is performing slower and slower. please find below the required files. you can download the model files from here: https://drive.google.com/drive/folders/1nzC2_GFh27mt8KP4dfGxewFP8BkEQEHH?usp=sharing
I built this model using the below notebook and saved the model state_dict and used it for inference. https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb
Triton Model.py
Payload for Triton:
python app.py
@SaratM34 were the same version of PyTorch used in both cases? The slowdown appears to be framework specific and not from inside Triton. cc @Tabrizian