onnxruntime_backend: Memory Leaks Cause Server OOMs (CPU, TF2/ONNX)

Description Using latest Triton v22.01 with ONNXBackend (self-built) or with the TensorFlow Backend (image from Nvidia Container Registry) and performing the CPU inference there seems to be a memory leak: memory usage increases with each request, finally leading to an OOM for the Triton Inference Server instance.

Triton Information What version of Triton are you using? v22.01

Are you using the Triton container or did you build it yourself? Both the Triton container version and custom ONNX build are affected.

To Reproduce Steps to reproduce the behavior.

  1. Run the Inference server to host the CRAFT model either in a model.savedmodel format (for TensorFlow backend) or model.onnx format.
  2. Perform multiple inferences using tritonclient[all]==2.18.0.
  3. Watch the memory usage of Triton Inference Server to increase (locally using docker stats).

Model configuration file:

name: "craft"
backend: "onnxruntime"
max_batch_size: 1
input [
  {
    name: "input_2"
    data_type: TYPE_FP32
    dims: [-1, -1, 3]
  }
]
output [
  {
    name: "conv_cls.8"
    data_type: TYPE_FP32
    dims: [-1, -1, 2]
  }
]
instance_group {
  kind: KIND_CPU
  count: 1
}
model_warmup {                                                                        
    name: "CRAFT Warmup"
    batch_size: 1                                                                     
    inputs: {                                                                         
        key: "input_2"                                                                  
        value: {                                                                      
            data_type: TYPE_FP32                                                      
            dims: [1024, 1024, 3]                                                                 
            zero_data: false
        }                                                                             
    }                                                                                 
}  

Example tritonclient usage (Python 3.10):

        import tritonclient.grpc as grpcclient
        from tritonclient.grpc import InferenceServerClient, InferenceServerException
  
        client = InferenceServerClient(
            url=TRITON_GRPC_SERVICE,
        )
        TRITON_CRAFT_MODEL_NAME = "craft"
        TRITON_CRAFT_MODEL_VERSION = "1"
        TRITON_CRAFT_MODEL_INPUT_NAME = "input_2"
        TRITON_CRAFT_MODEL_OUTPUT_NAME = "conv_cls.8"

        input = grpcclient.InferInput(
            name=TRITON_CRAFT_MODEL_INPUT_NAME,
            shape=[1, input_image.shape[0], input_image.shape[1], input_image.shape[2]],
            datatype="FP32",
        )
        output = grpcclient.InferRequestedOutput(TRITON_CRAFT_MODEL_OUTPUT_NAME)
        input.set_data_from_numpy(np.array([input_image]))
            
         response = self.client.infer(
                model_name=TRITON_CRAFT_MODEL_NAME, inputs=[input], outputs=[output], 
                client_timeout=TRITON_GRPC_TIMEOUT
            )
        result = response.as_numpy(name=TRITON_CRAFT_MODEL_OUTPUT_NAME)

Example output of Triton Inference Server with --log-verbose=1 flag:

I0217 15:49:51.941706 1 grpc_server.cc:3206] New request handler for ModelInferHandler, 89
I0217 15:49:51.941713 1 model_repository_manager.cc:590] GetModel() 'craft' version -1
I0217 15:49:51.941723 1 model_repository_manager.cc:590] GetModel() 'craft' version -1
I0217 15:49:51.941744 1 infer_request.cc:566] prepared: [0x0x7f43c401c490] request id: , model: craft, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f43c401c748] input: input_2, type: FP32, original shape: [1,1024,1024,3], batch + shape: [1,1024,1024,3], shape: [1024,1024,3]
override inputs:
inputs:
[0x0x7f43c401c748] input: input_2, type: FP32, original shape: [1,1024,1024,3], batch + shape: [1,1024,1024,3], shape: [1024,1024,3]
original requested outputs:
conv_cls.8
requested outputs:
conv_cls.8

I0217 15:49:51.941788 1 onnxruntime.cc:2427] model craft, instance craft_0, executing 1 requests
I0217 15:49:51.941796 1 onnxruntime.cc:1334] TRITONBACKEND_ModelExecute: Running craft_0 with 1 requests
2022-02-17 15:49:51.941856332 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for Cpu. bin_num:20 (requested) num_bytes: 549453824 (actual) rounded_bytes:549453824
2022-02-17 15:49:51.941888464 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1073741824 bytes.
2022-02-17 15:49:51.941898397 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 2281701376
2022-02-17 15:49:51.941904942 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7f435bffe040 to 0x7f439bffe040
2022-02-17 15:49:51.959883920 [I:onnxruntime:, sequential_executor.cc:155 Execute] Begin execution

The last part is especially important, since it seems that the ONNX Runtime (in this case) allocates an additional 1 GB+ of memory:

2022-02-17 15:49:51.941888464 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1073741824 bytes.
2022-02-17 15:49:51.941898397 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 2281701376
2022-02-17 15:49:51.941904942 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7f435bffe040 to 0x7f439bffe040
2022-02-17 15:49:51.959883920 [I:onnxruntime:, sequential_executor.cc:155 Execute] Begin execution

OOM happens after the last log.

Expected behavior

Nvidia Triton Inference Server should not allow for the OOM. Memory should be freed after the inference request is performed.

How can I fix this issue?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (8 by maintainers)

Most upvoted comments

Thank you for providing the models and doing on investigation on this. After reading the further details that you have provided, to me it doesn’t look like a memory leak. Some frameworks do not release their memory after the inference is done and they keep making the memory pool larger if they ever need a larger chunk of memory. For example, if you send larger inputs (because of variable dimensions) the framework may need to allocate additional memory and if the memory is not available it is going to cause OOM.

Maybe there’s a way to pass the strategy to the ONNXRuntime for the CPU config, as well?

Looking at the readme, I couldn’t find any options that corresponds to this. @tanmayv25 are you aware of any options in this regard?

Maybe we could pass a config/flag to Triton Inference Server on startup so that it knows about the memory limits under which it has to operate?

Unfortunately, this is not an option right now. Some frameworks do not provide the appropriate APIs to control the memory usage and it’s out of Triton’s control on how these frameworks internally manage their memory.