server: Allow tritonserver to stay up on model load failure?

Hi, I’m running triton using tritonserver --model-repository=/models --model-control-mode=none

Is there a way, in this mode, to allow models to fail to load during startup?

My usecase is I have tensorrt models targeting different NVIDIA GPU families, where only one of the models is expected to load properly.

Workaround

My workaround right now is to enable explicit model control, and manually enable all models (allowing some models to fail to load). In this way, triton does not terminate itself on startup.

Logs

Here are truncated startup logs

I0110 21:30:54.920681 60 server.cc:592]
+---------------+---------+---------------------------------------------------------+
| Model         | Version | Status                                                  |
+---------------+---------+---------------------------------------------------------+
| <redacted> | 1       | READY                                                   |
| <redacted> | 1       | UNAVAILABLE: Internal: unable to create TensorRT engine |
+---------------+---------+---------------------------------------------------------+

I0110 21:30:54.920786 60 tritonserver.cc:1920]
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                  |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                 |
| server_version                   | 2.16.0                                                                                                                                                 |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memo |
|                                  | ry binary_tensor_data statistics                                                                                                                       |
| model_repository_path[0]         | /models                                                                                                                                                |
| model_control_mode               | MODE_NONE                                                                                                                                              |
| strict_model_config              | 1                                                                                                                                                      |
| rate_limit                       | OFF                                                                                                                                                    |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                              |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                               |
| response_cache_byte_size         | 0                                                                                                                                                      |
| min_supported_compute_capability | 6.0                                                                                                                                                    |
| strict_readiness                 | 1                                                                                                                                                      |
| exit_timeout                     | 30                                                                                                                                                     |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+

I0110 21:30:54.920814 60 server.cc:252] Waiting for in-flight requests to complete.
I0110 21:30:54.920823 60 model_repository_manager.cc:1055] unloading: yolox-m-p1000:1
I0110 21:30:54.920868 60 server.cc:267] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0110 21:30:54.920947 60 tensorrt.cc:5272] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0110 21:30:54.937010 60 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 607, GPU 2303 (MiB)
I0110 21:30:54.960855 60 tensorrt.cc:5211] TRITONBACKEND_ModelFinalize: delete model state
I0110 21:30:54.961569 60 model_repository_manager.cc:1166] successfully unloaded 'yolox-m-p1000' version 1
W0110 21:30:55.771926 60 metrics.cc:406] Unable to get power limit for GPU 0. Status:Success, value:0.000000
W0110 21:30:55.771991 60 metrics.cc:424] Unable to get power usage for GPU 0. Status:Success, value:0.000000
W0110 21:30:55.772005 60 metrics.cc:448] Unable to get energy consumption for GPU 0. Status:Success, value:0
I0110 21:30:55.920967 60 server.cc:267] Timeout 29: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

Instead of the above behavior, I would like triton to continue serving whatever models have successfully loaded.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (1 by maintainers)

Most upvoted comments

Hi!

How can the “0 in-flight non-inference requests” be fixed? On my server all models are loaded successfully but then I get the following:

I0511 04:23:45.550649 1 server.cc:264] Waiting for in-flight requests to complete. I0511 04:23:45.550655 1 server.cc:280] Timeout 30: Found 0 model versions that have in-flight inferences I0511 04:23:45.551024 1 server.cc:295] All models are stopped, unloading models

Which then leads to error: creating server: Internal - failed to load all models

I am a bit confused because the server worked fine before and then changed to this behavior without any obvious changes made …

I solved it by removing all unneeded files from the model directory ale model-level directory structure. Sometimes also few restarts were needed. Triton is not informing anywhere which files disturb him. @muralinow

Could you please help me by saying what are uneeded files?

Means leave only files that are needed by the model, not additional files, no additional directories, no additional directory levels.

If I have such model path in SeldonDeployment:

modelUri: pvc://ts-seldon-volume/triton/multi

Then in ts-seldon-volume root directory I should have a triton directory with only multi directory inside. In multi directory I should have only dirs with models and versions nested along with configuration in pbtxt and model binary itself.

(base) jovyan@ts-seldon-4-0:~$ ls -la triton/**/**/**
-rw-r--r-- 1 jovyan users  239 Jun 22  2023 triton/multi/cifar10/config.pbtxt
-rw-r--r-- 1 jovyan users  370 Jun 22  2023 triton/multi/simple/config.pbtxt

triton/multi/cifar10/1:
total 12
drwxr-xr-x 3 jovyan users 6144 Jun 22  2023 .
drwxr-xr-x 3 jovyan users 6144 Jun 22  2023 ..
drwxr-xr-x 3 jovyan users 6144 Jun 22  2023 model.savedmodel

triton/multi/simple/1:
total 12
drwxr-xr-x 2 jovyan users 6144 Jun 22  2023 .
drwxr-xr-x 3 jovyan users 6144 Jun 22  2023 ..
-rw-r--r-- 1 jovyan users  310 Jun 22  2023 model.graphdef

multi.zip

I am attaching zipped models from Seldon example docs and below working YAML for SeldonDeployment using PVC and modelUri.

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: multi
  namespace: tomasz
spec:
  predictors:
  - graph:
      implementation: TRITON_SERVER
      logger:
        mode: all
      modelUri: pvc://ts-seldon-volume/triton/multi
      name: multi
      type: MODEL
    name: default
    replicas: 1
    labels:
      sidecar.istio.io/inject: "false"
  protocol: v2