server: error: creating server: INTERNAL - failed to load all models

Description A clear and concise description of what the bug is.

I just run a simple demo. The model is downloaded by tensorflow.keras.applications.resnet50 and saved with model.save('./resnet50', save_format='tf').

structure:

                                                                                (tf)
models
├── resnet50
│  ├── 1
│  │  └── model.savedmodel
│  │     ├── assets
│  │     ├── saved_model.pb
│  │     └── variables
│  │        ├── variables.data-00000-of-00001
│  │        └── variables.index
│  ├── config.pbtxt
│  └── resnet50_label.txt

config.pbtxt:

name: "resnet50"
platform: "tensorflow_savedmodel"
max_batch_size: 128
input [
{
    name: "input_1"
    data_type: TYPE_FP32
    format: FORMAT_NHWC
    dims: [224, 224, 3]
}
]
output [
{
    name: "predictions"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "resnet50_label.txt"
}
]
instance_group [
{
    count: 1
    kind: KIND_GPU
    gpus: [0]
}
]
dynamic_batching {
    preferred_batch_size: [32, 64]
    max_queue_delay_microseconds: 10
}

Triton Information What version of Triton are you using? nvcr.io/nvidia/tensorrtserver:19.10-py3

To Reproduce Steps to reproduce the behavior.

nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v /home/me/models:/models nvcr.io/nvidia/tensorrtserver:19.10-py3 trtserver --model-repository=/models

I’m sure the GPU driver is compatible with this docker. I can run another TensorFlow model successfully.

output:

===============================
== TensorRT Inference Server ==
===============================

NVIDIA Release 19.10 (build 8266503)

Copyright (c) 2018-2019, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

I0415 02:47:12.663739 1 metrics.cc:160] found 3 GPUs supporting NVML metrics
I0415 02:47:12.669485 1 metrics.cc:169]   GPU 0: Tesla V100-PCIE-16GB
I0415 02:47:12.675331 1 metrics.cc:169]   GPU 1: Tesla V100-PCIE-16GB
I0415 02:47:12.681248 1 metrics.cc:169]   GPU 2: Tesla V100-PCIE-16GB
I0415 02:47:12.681421 1 server.cc:110] Initializing TensorRT Inference Server
E0415 02:47:12.795065 1 model_repository_manager.cc:1453] failed to open text file for read /models/.git/config.pbtxt: No such file or directory
E0415 02:47:12.795112 1 model_repository_manager.cc:1453] failed to open text file for read /models/.vscode/config.pbtxt: No such file or directory
E0415 02:47:12.795224 1 model_repository_manager.cc:1453] failed to open text file for read /models/models/config.pbtxt: No such file or directory
I0415 02:47:12.799405 1 server_status.cc:83] New status tracking for model 'resnet50'
I0415 02:47:12.799463 1 model_repository_manager.cc:663] loading: resnet50:1
I0415 02:47:12.802320 1 base_backend.cc:166] Creating instance resnet50_0_0_gpu0 on GPU 0 (7.0) using model.savedmodel
2020-04-15 02:47:12.916423: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/resnet50/1/model.savedmodel
2020-04-15 02:47:12.979608: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-04-15 02:47:13.096501: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2020-04-15 02:47:13.100001: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa836c7bd40 executing computations on platform Host. Devices:
2020-04-15 02:47:13.100032: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2020-04-15 02:47:13.100141: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-04-15 02:47:13.348829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
2020-04-15 02:47:13.349927: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:af:00.0
2020-04-15 02:47:13.352179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:d8:00.0
2020-04-15 02:47:13.352188: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-04-15 02:47:13.358074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2
2020-04-15 02:47:21.339782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-15 02:47:21.339820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2
2020-04-15 02:47:21.339827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y Y
2020-04-15 02:47:21.339831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N Y
2020-04-15 02:47:21.339835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   Y Y N
2020-04-15 02:47:21.345171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14485 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2020-04-15 02:47:21.347898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14485 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:af:00.0, compute capability: 7.0)
2020-04-15 02:47:21.350249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14485 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-16GB, pci bus id: 0000:d8:00.0, compute capability: 7.0)
2020-04-15 02:47:21.354405: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa3f2795160 executing computations on platform CUDA. Devices:
2020-04-15 02:47:21.354421: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2020-04-15 02:47:21.354427: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla V100-PCIE-16GB, Compute Capability 7.0
2020-04-15 02:47:21.354432: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (2): Tesla V100-PCIE-16GB, Compute Capability 7.0
2020-04-15 02:47:21.595657: I tensorflow/cc/saved_model/loader.cc:204] Restoring SavedModel bundle.
2020-04-15 02:47:22.342995: I tensorflow/cc/saved_model/loader.cc:153] Running initialization op on SavedModel bundle at path: /models/resnet50/1/model.savedmodel
2020-04-15 02:47:22.600629: I tensorflow/cc/saved_model/loader.cc:332] SavedModel load for tags { serve }; Status: success. Took 9684222 microseconds.
I0415 02:47:22.600987 1 model_repository_manager.cc:807] successfully loaded 'resnet50' version 1
I0415 02:47:22.801777 1 model_repository_manager.cc:793] successfully unloaded 'resnet50' version 1
E0415 02:47:22.801841 1 main.cc:1099] error: creating server: INTERNAL - failed to load all models

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior A clear and concise description of what you expected to happen.

Sorry, I cannot find any useful error messages.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16

Most upvoted comments

that’s correct, all models in the dirs will be checked when trtserver started. There is an option --exit-on-error=false may help.

I agree all the models in the dirs should be checked. But maybe it should skip the dirs named started with a dot? For example, .git/, .vscode/.

kemingy on Apr 16, 2020

sorry, using your zip file, i cannot reproduce your results on 20.02, 19.09 , 19.10 trtserver images. everything seems ok and says “successfully loaded ‘resnet50’ version 1”.

I moved it to a clean directory and it works. Is it because of the old folder contains several non-model directories? This doesn’t make sense.

E0415 02:47:12.795065 1 model_repository_manager.cc:1453] failed to open text file for read /models/.git/config.pbtxt: No such file or directory
E0415 02:47:12.795112 1 model_repository_manager.cc:1453] failed to open text file for read /models/.vscode/config.pbtxt: No such file or directory
E0415 02:47:12.795224 1 model_repository_manager.cc:1453] failed to open text file for read /models/models/config.pbtxt: No such file or directory

Then I try to git init. After that, it failed to load the models.

kemingy on Apr 15, 2020

whether this model is public available? if so, may you send me a link of the resnet50 archive rt models, i can test it. if not , just forget it.

This model is downloaded with tensorflow.keras.applications.resnet50. Follow the 1st example in the Keras doc: https://keras.io/applications/.

The output in config.pbtxt is not valid. using ‘probs’ instead of ‘predictions’.

The original error msg from trtserver-20.02 is model_repository_manager.cc:840] failed to load 'resnet50' version 1: Invalid argument: unexpected inference output 'predictions', allowed outputs are: probs.

After fixing this error, everything goes well in my 20.02 rtserver images.

you can use saved_model_cli to view your saved model info:saved_model_cli show --all --dir ./restnet50. Output:

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input_1'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 224, 224, 3)
        name: serving_default_input_1:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['probs'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1000)
        name: StatefulPartitionedCall:0
  Method name is: tensorflow/serving/predict

qibaoyuan on Apr 15, 2020