server: Newer versions of triton server have a consirable slowdown in start time
Description At my work we are currently working and deploying our models with triton server 22.05, we were willing to change to 23.05 when we realize the start up of the new version is more than 3x slower with onnx with TRT optimization flag and 16x with offline TRT optimization. After some testing we determinate 2 different architectures that are having troubles(the start time is considerable longer), from 5 models deployed in the same container.
The average startup time table of 10 runs for onnx models with TRT optimization flag is:
| version | seconds |
|---|---|
| 22.05 | 205 |
| 23.01 | 422 |
| 23.05 | 750 |
The time table of the average startup time for TRT models is:
| version | seconds |
|---|---|
| 22.05 | 4 |
| 23.05 | 65 |
Triton Information What version of Triton are you using?
For purpose of this test 22.05, 23.01 and 23.05. We want to focus in 22.05 and 23.05.
Are you using the Triton container or did you build it yourself?
We build the containers using the compose.py script in the repository python3 compose.py --backend tensorrt --backend pytorch --backend dali --backend=onnxruntime --repoagent checksum --enable-gpu
To Reproduce
We run this tests in 3 benchmark machines all of them setup with Ubuntu 20.04 server and NVIDIA driver 530.41 one with 10XX geforce series card, another with a 20XX RTX series card and finally one with 30XX RTX series card, all of the with the same problem.
For the version 22.05 we start the trinton server with --strict-readiness true --strict-model-config true
For the version 23.01 and 23.05 we start the trinton server with --strict-readiness true --disable-auto-complete-config
for the offline TRT optimization we use the nvidia pytorch containers with the same version i.e. triton 22.05 with pytorch 22.05
The onnx with TRT optimization config file looks like:
name: "example_model"
platform: "onnxruntime_onnx"
default_model_filename: "model.onnx"
max_batch_size : 4
dynamic_batching {}
input [
{
name: "input"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 256, 256 ]
}
]
output [
{
name: "output_1"
data_type: TYPE_FP32
dims: [ 3, 256, 256 ]
}
]
optimization {
execution_accelerators {
gpu_execution_accelerator : [
{
name : "tensorrt"
}
]
}
}
model_warmup {
name: "warmup_1"
batch_size: 1
inputs {
key: "input"
value {
data_type: TYPE_FP32
dims: [ 3, 256, 256 ]
random_data: true
}
}
}
model_warmup {
name: "warmup_2"
batch_size: 2
inputs {
key: "input"
value {
data_type: TYPE_FP32
dims: [ 3, 256, 256 ]
random_data: true
}
}
}
model_warmup {
name: "warmup_3"
batch_size: 3
inputs {
key: "input"
value {
data_type: TYPE_FP32
dims: [ 3, 256, 256 ]
random_data: true
}
}
}
model_warmup {
name: "warmup_4"
batch_size: 4
inputs {
key: "input"
value {
data_type: TYPE_FP32
dims: [ 3, 256, 256 ]
random_data: true
}
}
}
Meanwhile for the offline TRT config looks like:
name: "example_model"
platform: "tensorrt_plan"
default_model_filename: "model.trt"
max_batch_size : 4
dynamic_batching {}
input [
{
name: "input"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 256, 256 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 3, 256, 256 ]
}
]
Unfortunately I can not share any model, but if it is need I can search for opensource models that have the same issue. Please let me know if there is any other information needed.
Expected behavior The startup time for different Triton server version are similar
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 15 (8 by maintainers)
@jnlarrain Thanks for the feedback. I didn’t have a chance to test
compose.pyyet. I’ll try to prioritize it next week.@jnlarrain Apologies for the long wait. I’ll prioritize this issue this week and let you know in case I need something else
Thanks! We’ll look into this.
@oandreeva-nv thanks for your response, this opensource model architecture can be used to replicated the issue. You will need to run the onnx file step. I hope this is useful