onnxruntime_backend: Memory Leaks Cause Server OOMs (CPU, TF2/ONNX)
Description
Using latest Triton v22.01 with ONNXBackend (self-built) or with the TensorFlow Backend (image from Nvidia Container Registry) and performing the CPU inference there seems to be a memory leak: memory usage increases with each request, finally leading to an OOM for the Triton Inference Server instance.
Triton Information What version of Triton are you using? v22.01
Are you using the Triton container or did you build it yourself? Both the Triton container version and custom ONNX build are affected.
To Reproduce Steps to reproduce the behavior.
- Run the Inference server to host the CRAFT model either in a
model.savedmodelformat (for TensorFlow backend) ormodel.onnxformat. - Perform multiple inferences using
tritonclient[all]==2.18.0. - Watch the memory usage of Triton Inference Server to increase (locally using
docker stats).
Model configuration file:
name: "craft"
backend: "onnxruntime"
max_batch_size: 1
input [
{
name: "input_2"
data_type: TYPE_FP32
dims: [-1, -1, 3]
}
]
output [
{
name: "conv_cls.8"
data_type: TYPE_FP32
dims: [-1, -1, 2]
}
]
instance_group {
kind: KIND_CPU
count: 1
}
model_warmup {
name: "CRAFT Warmup"
batch_size: 1
inputs: {
key: "input_2"
value: {
data_type: TYPE_FP32
dims: [1024, 1024, 3]
zero_data: false
}
}
}
Example tritonclient usage (Python 3.10):
import tritonclient.grpc as grpcclient
from tritonclient.grpc import InferenceServerClient, InferenceServerException
client = InferenceServerClient(
url=TRITON_GRPC_SERVICE,
)
TRITON_CRAFT_MODEL_NAME = "craft"
TRITON_CRAFT_MODEL_VERSION = "1"
TRITON_CRAFT_MODEL_INPUT_NAME = "input_2"
TRITON_CRAFT_MODEL_OUTPUT_NAME = "conv_cls.8"
input = grpcclient.InferInput(
name=TRITON_CRAFT_MODEL_INPUT_NAME,
shape=[1, input_image.shape[0], input_image.shape[1], input_image.shape[2]],
datatype="FP32",
)
output = grpcclient.InferRequestedOutput(TRITON_CRAFT_MODEL_OUTPUT_NAME)
input.set_data_from_numpy(np.array([input_image]))
response = self.client.infer(
model_name=TRITON_CRAFT_MODEL_NAME, inputs=[input], outputs=[output],
client_timeout=TRITON_GRPC_TIMEOUT
)
result = response.as_numpy(name=TRITON_CRAFT_MODEL_OUTPUT_NAME)
Example output of Triton Inference Server with --log-verbose=1 flag:
I0217 15:49:51.941706 1 grpc_server.cc:3206] New request handler for ModelInferHandler, 89
I0217 15:49:51.941713 1 model_repository_manager.cc:590] GetModel() 'craft' version -1
I0217 15:49:51.941723 1 model_repository_manager.cc:590] GetModel() 'craft' version -1
I0217 15:49:51.941744 1 infer_request.cc:566] prepared: [0x0x7f43c401c490] request id: , model: craft, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f43c401c748] input: input_2, type: FP32, original shape: [1,1024,1024,3], batch + shape: [1,1024,1024,3], shape: [1024,1024,3]
override inputs:
inputs:
[0x0x7f43c401c748] input: input_2, type: FP32, original shape: [1,1024,1024,3], batch + shape: [1,1024,1024,3], shape: [1024,1024,3]
original requested outputs:
conv_cls.8
requested outputs:
conv_cls.8
I0217 15:49:51.941788 1 onnxruntime.cc:2427] model craft, instance craft_0, executing 1 requests
I0217 15:49:51.941796 1 onnxruntime.cc:1334] TRITONBACKEND_ModelExecute: Running craft_0 with 1 requests
2022-02-17 15:49:51.941856332 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for Cpu. bin_num:20 (requested) num_bytes: 549453824 (actual) rounded_bytes:549453824
2022-02-17 15:49:51.941888464 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1073741824 bytes.
2022-02-17 15:49:51.941898397 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 2281701376
2022-02-17 15:49:51.941904942 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7f435bffe040 to 0x7f439bffe040
2022-02-17 15:49:51.959883920 [I:onnxruntime:, sequential_executor.cc:155 Execute] Begin execution
The last part is especially important, since it seems that the ONNX Runtime (in this case) allocates an additional 1 GB+ of memory:
2022-02-17 15:49:51.941888464 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1073741824 bytes.
2022-02-17 15:49:51.941898397 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 2281701376
2022-02-17 15:49:51.941904942 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7f435bffe040 to 0x7f439bffe040
2022-02-17 15:49:51.959883920 [I:onnxruntime:, sequential_executor.cc:155 Execute] Begin execution
OOM happens after the last log.
Expected behavior
Nvidia Triton Inference Server should not allow for the OOM. Memory should be freed after the inference request is performed.
How can I fix this issue?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (8 by maintainers)
Thank you for providing the models and doing on investigation on this. After reading the further details that you have provided, to me it doesn’t look like a memory leak. Some frameworks do not release their memory after the inference is done and they keep making the memory pool larger if they ever need a larger chunk of memory. For example, if you send larger inputs (because of variable dimensions) the framework may need to allocate additional memory and if the memory is not available it is going to cause OOM.
Looking at the readme, I couldn’t find any options that corresponds to this. @tanmayv25 are you aware of any options in this regard?
Unfortunately, this is not an option right now. Some frameworks do not provide the appropriate APIs to control the memory usage and it’s out of Triton’s control on how these frameworks internally manage their memory.