server: Dynamic Batching not creating batches correctly and incorrect inference results

Description I am deploying a triton server to GKE via the gke-marketplace-app documentation. When I try to use dynamic batching, the requests are not batched and it is only sent with a batch size of 1. Additionally, the inference only results in one detection when it should be multiple.

Triton Information The version is 2.17 as this is what the marketplace feature deploys.

Are you using the Triton container or did you build it yourself? Deployed via gcp marketplace

To Reproduce I create the inference server with the following config:

name: "sample"
platform: "pytorch_libtorch"
max_batch_size : 16
input [
  {
    name: "INPUT__0"
    data_type: TYPE_UINT8
    format: FORMAT_NCHW
    dims: [ 3, 512, 512 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ -1, 4 ]
  },
 {
    name: "OUTPUT__1"
    data_type: TYPE_INT64
    dims: [ -1 ]
   label_filename: "sample.txt"
  },
  {
    name: "OUTPUT__2"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
 dynamic_batching {
    max_queue_delay_microseconds: 50000
}

I am calling inference as follows:

model = "sample"
client = httpclient.InferenceServerClient( url = url )

input_1 = httpclient.InferInput(name = "INPUT__0", shape = list(data.shape), datatype = "UINT8")
input_2 = httpclient.InferInput(name = "INPUT__0", shape = list(data.shape), datatype = "UINT8")

input_1.set_data_from_numpy(data, binary_data = True)
input_2.set_data_from_numpy(data, binary_data = True)

output_00 = httpclient.InferRequestedOutput(name = "OUTPUT__0", binary_data = False)
output_01 = httpclient.InferRequestedOutput(name = "OUTPUT__1", binary_data = False)
output_02 = httpclient.InferRequestedOutput(name = "OUTPUT__2", binary_data = False)

output_10 = httpclient.InferRequestedOutput(name = "OUTPUT__0", binary_data = False)
output_11 = httpclient.InferRequestedOutput(name = "OUTPUT__1", binary_data = False)
output_12 = httpclient.InferRequestedOutput(name = "OUTPUT__2", binary_data = False)

# Is this correct? I tried using reshape in the config, but it did not work. Without this I get errors about data shape.
input_1.set_shape([1, 3, 512, 512]
input_2.set_shape([1, 3, 512, 512]

response_1 = client.async_infer(model_name = model, inputs = [input_1], outputs = [output_00, output_01, output_02])
response_2 = client.async_infer(model_name = model, inputs = [input_2], outputs = [output_10, output_11, output_12])

Expected behavior With the above code, when I run print(response_1.get_result().get_response()) I am only seeing one detection, but I know that the model detects multiple objects during direct inference on local:

{... [{'name': 'OUTPUT__0', 'datatype': 'FP32', 'shape': [1, 4], 'data': [x_min, y_min, x_max, y_max]}, ...}

Additionally, when I run print(client.get_inference_statistics()) I am seeing only a batch size of 1 when I expect 2 in this case:

{ ... 'batch_stats': [{'batch_size': 1, 'compute_input' : {'count': 2 ...}}] ... }

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 28 (14 by maintainers)

Most upvoted comments

Hi @omrifried ,

Thanks for the reference. Do you mind sharing

  1. a script I can run as-is to generate the torchscript model (or share the model itself)?
  2. the corresponding Triton config.pbtxt to serve the model
  3. client script with sample inputs to run

(I saw some pieces of these above, but having complete versions would help greatly to save time to look into this, thanks.)


Ticket ref: DLIS-3633

Also note that for HTTP Python client, you will need to set the concurrency for sending requests concurrently https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_async_infer_client.py#L55-L58