server: Dynamic batching does not work on server side

Description Dynamic batching not wokring in bert-based model

Triton Information nvcr.io/nvidia/tritonserver:21.10-py3

Are you using the Triton container or did you build it yourself? Container

To Reproduce Steps to reproduce the behavior.

Create docker network, launch triton and triton-sdk container within the same network

# with one terminal
docker run -it --name triton-dev --gpus all -v /home:/home --network triton nvcr.io/nvidia/tritonserver:21.10-py3-sdk tritonserver bash
# with new terminal
docker run -it --name triton-perf --gpus all -v /home:/home --network triton nvcr.io/nvidia/tritonserver:21.10-py3-sdk bash

Serve model with the following configuartion (baseline setting) in container triton-dev

platform: "onnxruntime_onnx"

input [
    {
        name: "input_ids"
        data_type: TYPE_INT64
        dims: [-1, -1]
    },
    {
        name: "attention_mask"
        data_type: TYPE_INT64
        dims: [-1, -1]
    },
    {
        name: "token_type_ids"
        data_type: TYPE_INT64
        dims: [-1, -1]
    }
]
output [
    {
        name: "last_hidden_state"
        data_type: TYPE_FP32
        dims: [-1, -1, 768]
    },
    {
        name: "1525"
        data_type: TYPE_FP32
        dims: [-1, 768]
    }
]

run perf_analyzer:

root@60c8579e0920:/workspace# perf_analyzer -m bert-base-chinese --percentile=95 --concurrency-range 1:4 --shape attention_mask:1,256 --shape token_type_ids:1,256 --shape input_ids:1,256 -u triton-dev:8000
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 4 concurrent requests
  Using synchronous calls for inference
  Stabilizing using p95 latency

Request concurrency: 1
  Client:
    Request count: 488
    Throughput: 97.6 infer/sec
    p50 latency: 10094 usec
    p90 latency: 11349 usec
    p95 latency: 11823 usec
    p99 latency: 16075 usec
    Avg HTTP time: 10205 usec (send/recv 349 usec + response wait 9856 usec)
  Server:
    Inference count: 587
    Execution count: 587
    Successful request count: 587
    Avg request latency: 9578 usec (overhead 44 usec + queue 43 usec + compute input 207 usec + compute infer 9191 usec + compute output 93 usec)

Request concurrency: 2
  Client:
    Request count: 475
    Throughput: 95 infer/sec
    p50 latency: 20970 usec
    p90 latency: 23418 usec
    p95 latency: 24444 usec
    p99 latency: 25711 usec
    Avg HTTP time: 21034 usec (send/recv 1124 usec + response wait 19910 usec)
  Server:
    Inference count: 570
    Execution count: 570
    Successful request count: 570
    Avg request latency: 19533 usec (overhead 44 usec + queue 9024 usec + compute input 207 usec + compute infer 10164 usec + compute output 94 usec)

Request concurrency: 3
  Client:
    Request count: 463
    Throughput: 92.6 infer/sec
    p50 latency: 32371 usec
    p90 latency: 34651 usec
    p95 latency: 35557 usec
    p99 latency: 36871 usec
    Avg HTTP time: 32406 usec (send/recv 888 usec + response wait 31518 usec)
  Server:
    Inference count: 556
    Execution count: 556
    Successful request count: 556
    Avg request latency: 31166 usec (overhead 42 usec + queue 20376 usec + compute input 209 usec + compute infer 10449 usec + compute output 90 usec)

Request concurrency: 4
  Client:
    Request count: 454
    Throughput: 90.8 infer/sec
    p50 latency: 44212 usec
    p90 latency: 46953 usec
    p95 latency: 48125 usec
    p99 latency: 54882 usec
    Avg HTTP time: 44282 usec (send/recv 831 usec + response wait 43451 usec)
  Server:
    Inference count: 542
    Execution count: 542
    Successful request count: 542
    Avg request latency: 43100 usec (overhead 43 usec + queue 32041 usec + compute input 209 usec + compute infer 10718 usec + compute output 89 usec)

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 97.6 infer/sec, latency 11823 usec
Concurrency: 2, throughput: 95 infer/sec, latency 24444 usec
Concurrency: 3, throughput: 92.6 infer/sec, latency 35557 usec
Concurrency: 4, throughput: 90.8 infer/sec, latency 48125 usec

Now serve the model with dynamic batching enabled

platform: "onnxruntime_onnx"

input [
    {
        name: "input_ids"
        data_type: TYPE_INT64
        dims: [-1, -1]
    },
    {
        name: "attention_mask"
        data_type: TYPE_INT64
        dims: [-1, -1]
    },
    {
        name: "token_type_ids"
        data_type: TYPE_INT64
        dims: [-1, -1]
    }
]
output [
    {
        name: "last_hidden_state"
        data_type: TYPE_FP32
        dims: [-1, -1, 768]
    },
    {
        name: "1525"
        data_type: TYPE_FP32
        dims: [-1, 768]
    }
]

dynamic_batching {
}

Test with perf_analyzer with the same argument set

root@60c8579e0920:/workspace# perf_analyzer -m bert-base-chinese --percentile=95 --concurrency-range 1:4 --shape attention_mask:1,256 --shape token_type_ids:1,256 --shape input_ids:1,256  -u triton-dev:8000
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 4 concurrent requests
  Using synchronous calls for inference
  Stabilizing using p95 latency

Request concurrency: 1
  Client:
    Request count: 457
    Throughput: 91.4 infer/sec
    p50 latency: 10964 usec
    p90 latency: 12173 usec
    p95 latency: 12670 usec
    p99 latency: 16599 usec
    Avg HTTP time: 10918 usec (send/recv 337 usec + response wait 10581 usec)
  Server:
    Inference count: 549
    Execution count: 549
    Successful request count: 549
    Avg request latency: 10297 usec (overhead 44 usec + queue 111 usec + compute input 213 usec + compute infer 9840 usec + compute output 89 usec)

Request concurrency: 2
  Client:
    Request count: 447
    Throughput: 89.4 infer/sec
    p50 latency: 22338 usec
    p90 latency: 24242 usec
    p95 latency: 25239 usec
    p99 latency: 29408 usec
    Avg HTTP time: 22347 usec (send/recv 672 usec + response wait 21675 usec)
  Server:
    Inference count: 537
    Execution count: 537
    Successful request count: 537
    Avg request latency: 21337 usec (overhead 47 usec + queue 10190 usec + compute input 206 usec + compute infer 10803 usec + compute output 91 usec)

Request concurrency: 3
  Client:
    Request count: 438
    Throughput: 87.6 infer/sec
    p50 latency: 34307 usec
    p90 latency: 36334 usec
    p95 latency: 37163 usec
    p99 latency: 39162 usec
    Avg HTTP time: 34294 usec (send/recv 575 usec + response wait 33719 usec)
  Server:
    Inference count: 525
    Execution count: 525
    Successful request count: 525
    Avg request latency: 33392 usec (overhead 48 usec + queue 21994 usec + compute input 231 usec + compute infer 11024 usec + compute output 95 usec)

Request concurrency: 4
  Client:
    Request count: 438
    Throughput: 87.6 infer/sec
    p50 latency: 45677 usec
    p90 latency: 47769 usec
    p95 latency: 48829 usec
    p99 latency: 51770 usec
    Avg HTTP time: 45697 usec (send/recv 641 usec + response wait 45056 usec)
  Server:
    Inference count: 526
    Execution count: 526
    Successful request count: 526
    Avg request latency: 44707 usec (overhead 48 usec + queue 33334 usec + compute input 209 usec + compute infer 11020 usec + compute output 96 usec)

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 91.4 infer/sec, latency 12670 usec
Concurrency: 2, throughput: 89.4 infer/sec, latency 25239 usec
Concurrency: 3, throughput: 87.6 infer/sec, latency 37163 usec
Concurrency: 4, throughput: 87.6 infer/sec, latency 48829 usec

Test “client-side bathcing” scenario

root@60c8579e0920:/workspace# perf_analyzer -m bert-base-chinese --percentile=95 --concurrency-range 1:4 --shape a
ttention_mask:8,256 --shape token_type_ids:8,256 --shape input_ids:8,256  -u triton-dev:8000
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 4 concurrent requests
  Using synchronous calls for inference
  Stabilizing using p95 latency

Request concurrency: 1
  Client:
    Request count: 72
    Throughput: 14.4 infer/sec
    p50 latency: 69336 usec
    p90 latency: 71017 usec
    p95 latency: 71381 usec
    p99 latency: 72006 usec
    Avg HTTP time: 69057 usec (send/recv 2670 usec + response wait 66387 usec)
  Server:
    Inference count: 87
    Execution count: 87
    Successful request count: 87
    Avg request latency: 65992 usec (overhead 58 usec + queue 156 usec + compute input 373 usec + compute infer 64844 usec + compute output 561 usec)

Request concurrency: 2
  Client:
    Request count: 73
    Throughput: 14.6 infer/sec
    p50 latency: 139080 usec
    p90 latency: 140973 usec
    p95 latency: 142347 usec
    p99 latency: 143299 usec
    Avg HTTP time: 138764 usec (send/recv 3458 usec + response wait 135306 usec)
  Server:
    Inference count: 87
    Execution count: 87
    Successful request count: 87
    Avg request latency: 134881 usec (overhead 56 usec + queue 65587 usec + compute input 316 usec + compute infer 68354 usec + compute output 568 usec)

Request concurrency: 3
  Client:
    Request count: 72
    Throughput: 14.4 infer/sec
    p50 latency: 208849 usec
    p90 latency: 212004 usec
    p95 latency: 213297 usec
    p99 latency: 215298 usec
    Avg HTTP time: 208996 usec (send/recv 3543 usec + response wait 205453 usec)
  Server:
    Inference count: 86
    Execution count: 86
    Successful request count: 86
    Avg request latency: 205010 usec (overhead 59 usec + queue 135397 usec + compute input 326 usec + compute infer 68653 usec + compute output 575 usec)

Request concurrency: 4
  Client:
    Request count: 72
    Throughput: 14.4 infer/sec
    p50 latency: 280456 usec
    p90 latency: 284061 usec
    p95 latency: 284732 usec
    p99 latency: 287123 usec
    Avg HTTP time: 280509 usec (send/recv 3952 usec + response wait 276557 usec)
  Server:
    Inference count: 86
    Execution count: 86
    Successful request count: 86
    Avg request latency: 276127 usec (overhead 59 usec + queue 206034 usec + compute input 323 usec + compute infer 69150 usec + compute output 561 usec)

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 14.4 infer/sec, latency 71381 usec
Concurrency: 2, throughput: 14.6 infer/sec, latency 142347 usec
Concurrency: 3, throughput: 14.4 infer/sec, latency 213297 usec
Concurrency: 4, throughput: 14.4 infer/sec, latency 284732 usec

Test performance with batch size explicitliy specified in perf_analyzer Failed.

root@60c8579e0920:/workspace# perf_analyzer -m bert-base-chinese --percentile=95 --concurrency-range 1:4 --shape attention_mask:8,256 --shape token_type_ids:8,256 --shape input_ids:8,256 -b 8 -u triton-dev:8000
can not specify batch size > 1 as the model does not support batching

To sum up, enabling dynamic batching feature does not seem to truly activate dynamic batching behavior on the server side, as:

no throughput gain were observed with respect to the baseline setting, its drops from 90.8~97.6 to 87.6~91.4 infer/sec
enforcing client side batching logic does give throughput gain (14.5 infer/sec * 8 sample/batch) compared with the baseline setting (around 90~97 infer/sec * 1 sampole/batch)

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

I used bert-base-chinese from huggingface and exported model as onnx format with optimum CLI. Here’s what the model looks like in onnxexplorer

onnxexp glance -m models/bert-base-chinese/1/model.onnx
13:19:53 04.30 WARNING onnxexplorer.py:43]: disable TensorRT since it was not found.
Exploring on onnx model: models/bert-base-chinese/1/model.onnx
╭────────────────────────────────── model.onnx Summary ──────────────────────────────────╮
│ IR Version: 6                                                                          │
│ Opset Version: 11,                                                                     │
│ Doc:                                                                                   │
│ Producer Name: pytorch                                                                 │
│ All Ops:                                                                               │
│ Constant,Unsqueeze,Cast,Sub,Mul,Shape,Gather,Slice,Add,ReduceMean,Pow,Sqrt,Div,MatMul, │
│ Concat,Reshape,Transpose,Softmax,Erf,Gemm,Tanh                                         │
╰────────────────────────────────────────────────────────────────────────────────────────╯
                                    model.onnx Detail
╭─────────────────────────────┬───────────────────────┬─────────────────────┬────────────╮
│ Name                        │ Shape                 │ Input/Output        │ Dtype      │
├─────────────────────────────┼───────────────────────┼─────────────────────┼────────────┤
│ input_ids                   │ [-1, -1]              │ input               │ int64      │
│ attention_mask              │ [-1, -1]              │ input               │ int64      │
│ token_type_ids              │ [-1, -1]              │ input               │ int64      │
│ last_hidden_state           │ [-1, -1, 768]         │ output              │ float32    │
│ 1525                        │ [-1, 768]             │ output              │ float32    │
╰─────────────────────────────┴───────────────────────┴─────────────────────┴────────────╯
                             Table generated by onnxexplorer

It’s clear that the first two dimensions of inputs are organaized as [BATCH_SIZE, SEQ_LENGTH] Expected behavior Gain in throughput should be observed when dynamic bathching is enabled during inference

About this issue

Original URL
State: closed
Created a year ago
Comments: 18 (9 by maintainers)

Most upvoted comments

Thank you for providing the solution, Ceyda! Apologies for my delay, I was away from computers for a bit. Boris, please let me know if Ceyda’s solution fixes your issue. Otherwise, it seems like this might be perf analyzer-specific and we can take a look. Alternatively, it might be worth looking at resource usage (e.g. using Model Analyzer to see CPU and GPU usage) to see if that’s the issue. Interestingly, it looks like the queue time shoots way up as does the compute time (even in the concurrency==1 case).

CC: @matthewkotila

I’ve been busy for now. I’ve saved this post and I’ll validate this solution later.

BorisPolonsky on May 19, 2023

@BorisPolonsky I have a different model but had a similar problem. I had to specify the batch size & shape for the requests to be sent

perf_analyzer -m bert-base-chinese --concurrency-range 1:4 --shape attention_mask:8,256 --shape token_type_ids:8,256 --shape input_ids:8,256 -b 8

and on the model config side use

max_batch_size:32 #or bigger whatever your model can handle
dynamic_batching {
  preferred_batch_size: [32] # just a multiple of 8=batchsize of single request, this should group 4 request together
  max_queue_delay_microseconds: 1000, # use something big for testing can adjust later
}

then also use --log-verbose 2 --log-file triton_logs.log when launching the server & & grep: tail -f triton_logs.log|grep --line-buffered executing to check if requests are being grouped. should see something like: I0517 11:41:01.923802 98 tensorrt.cc:334] model bert-base-chinese, instance bert-base-chinese_0_1, executing 4 requests

cceyda on May 17, 2023