server: Newer versions of triton server have a consirable slowdown in start time

Description At my work we are currently working and deploying our models with triton server 22.05, we were willing to change to 23.05 when we realize the start up of the new version is more than 3x slower with onnx with TRT optimization flag and 16x with offline TRT optimization. After some testing we determinate 2 different architectures that are having troubles(the start time is considerable longer), from 5 models deployed in the same container.

The average startup time table of 10 runs for onnx models with TRT optimization flag is:

version	seconds
22.05	205
23.01	422
23.05	750

The time table of the average startup time for TRT models is:

version	seconds
22.05	4
23.05	65

Triton Information What version of Triton are you using?

For purpose of this test 22.05, 23.01 and 23.05. We want to focus in 22.05 and 23.05.

Are you using the Triton container or did you build it yourself?

We build the containers using the compose.py script in the repository python3 compose.py --backend tensorrt --backend pytorch --backend dali --backend=onnxruntime --repoagent checksum --enable-gpu

To Reproduce We run this tests in 3 benchmark machines all of them setup with Ubuntu 20.04 server and NVIDIA driver 530.41 one with 10XX geforce series card, another with a 20XX RTX series card and finally one with 30XX RTX series card, all of the with the same problem. For the version 22.05 we start the trinton server with --strict-readiness true --strict-model-config true For the version 23.01 and 23.05 we start the trinton server with --strict-readiness true --disable-auto-complete-config

for the offline TRT optimization we use the nvidia pytorch containers with the same version i.e. triton 22.05 with pytorch 22.05

The onnx with TRT optimization config file looks like:

name: "example_model"
platform: "onnxruntime_onnx"
default_model_filename: "model.onnx"
max_batch_size : 4
dynamic_batching {}
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 256, 256 ]
  }
]
output [
  {
    name: "output_1"
    data_type: TYPE_FP32
    dims: [ 3, 256, 256 ]
  }
]
optimization {
  execution_accelerators {
    gpu_execution_accelerator : [
      {
        name : "tensorrt"
      }
    ]
  }
}
model_warmup {
  name: "warmup_1"
  batch_size: 1
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [ 3, 256, 256 ]
      random_data: true
    }
  }
}
model_warmup {
  name: "warmup_2"
  batch_size: 2
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [ 3, 256, 256 ]
      random_data: true
    }
  }
}
model_warmup {
  name: "warmup_3"
  batch_size: 3
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [ 3, 256, 256 ]
      random_data: true
    }
  }
}
model_warmup {
  name: "warmup_4"
  batch_size: 4
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [ 3, 256, 256 ]
      random_data: true
    }
  }
}

Meanwhile for the offline TRT config looks like:

name: "example_model"
platform: "tensorrt_plan"
default_model_filename: "model.trt"
max_batch_size : 4
dynamic_batching {}
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 256, 256 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 3, 256, 256 ]
  }
]

Unfortunately I can not share any model, but if it is need I can search for opensource models that have the same issue. Please let me know if there is any other information needed.

Expected behavior The startup time for different Triton server version are similar

About this issue

Original URL
State: open
Created a year ago
Comments: 15 (8 by maintainers)

Most upvoted comments

@jnlarrain Thanks for the feedback. I didn’t have a chance to test compose.py yet. I’ll try to prioritize it next week.

oandreeva-nv on Aug 23, 2023

@jnlarrain Apologies for the long wait. I’ll prioritize this issue this week and let you know in case I need something else

oandreeva-nv on Aug 7, 2023

Thanks! We’ll look into this.

oandreeva-nv on Jul 7, 2023

@oandreeva-nv thanks for your response, this opensource model architecture can be used to replicated the issue. You will need to run the onnx file step. I hope this is useful

jnlarrain on Jul 7, 2023