server: CPU memory slowly increases when reusing an InferContext object for many times

Description I noticed that after few hours of sending 4 * 500k requests to Triton server (deployed with 4 Tensor RT models), the cpu memory increased about 2% of 32GB. I let it run for 1 day and the memory increased more. However, if I create InferContext object for every request, the memory usage didn’t go up after sending same amount of requests.

I used http protocol and synchronous api call

Triton Information Server: nvcr.io/nvidia/tritonserver:20.03-py3 Client: nvcr.io/nvidia/tritonserver:20.03-py3-clientsdk

Are you using the Triton container or did you build it yourself? Triton container

To Reproduce Here is the class I used for triton client to send request to server. Note that I used 4 TritonClient objects for 4 models

import sys
import logging
import tensorrtserver.api as triton


def get_model_info(url, protocol, model_name, verbose=False):
    ctx = triton.ServerStatusContext(url, protocol, model_name, verbose)
    server_status = ctx.get_server_status()

    if model_name not in server_status.model_status:
        raise Exception("unable to get status for {}".format(model_name))

    status = server_status.model_status[model_name]
    config = status.config

    input_nodes = config.input
    output_nodes = config.output

    return input_nodes, output_nodes


class TritonClient:
    def __init__(self, url, protocol, model_name, model_version, verbose=False):
        self.url = url
        self.protocol = triton.ProtocolType.from_str(protocol)
        self.model_name = model_name
        self.model_version = model_version
        input_nodes, output_nodes = get_model_info(self.url, self.protocol, self.model_name)
        self.input_name = input_nodes[0].name
        self.output_names = []
        for output in output_nodes:
            self.output_names.append(output.name)
        self.trt_ctx = triton.InferContext(self.url, self.protocol, self.model_name, self.model_version, verbose=verbose) 
       # **move this line to do_inference will resolve the memory increasing**

        self.output_dict = {}
        for i in range(len(self.output_names)):
            self.output_dict[self.output_names[i]] = triton.InferContext.ResultFormat.RAW

    def do_inference(self, x: list, keep_name=False):
        batch_size = len(x)
        try:
            output = self.trt_ctx.run({self.input_name: x}, self.output_dict, batch_size)
        except triton.InferenceServerException as e:
            logging.info(e)
            sys.exit()

        if not keep_name:
            return [output[self.output_names[i]] for i in range(len(self.output_names))]
        return output

Expected behavior CPU memory should not increase if reusing same InferContext object for different requests

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

The growth is mostly due to the underlying frameworks growing. We are moving Triton to a arch where it will be easier to remove unwanted frameworks from the container. You can actually do it now by using a multistage build and pulling over only the parts you want… but it can be tricky if you are familiar with Docker.

20.03 is only V1, so you could use 20.06-v1 client with it. Once V2 matures a little more we will take it out of beta and will then have some backwards compatibility guarantees for V2, but for now you should use V2 clients and server from the same release.