server: Dynamic batching does not work on server side
Description Dynamic batching not wokring in bert-based model
Triton Information
nvcr.io/nvidia/tritonserver:21.10-py3
Are you using the Triton container or did you build it yourself? Container
To Reproduce Steps to reproduce the behavior.
- Create docker network, launch triton and triton-sdk container within the same network
# with one terminal
docker run -it --name triton-dev --gpus all -v /home:/home --network triton nvcr.io/nvidia/tritonserver:21.10-py3-sdk tritonserver bash
# with new terminal
docker run -it --name triton-perf --gpus all -v /home:/home --network triton nvcr.io/nvidia/tritonserver:21.10-py3-sdk bash
Serve model with the following configuartion (baseline setting) in container triton-dev
platform: "onnxruntime_onnx"
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [-1, -1]
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [-1, -1]
},
{
name: "token_type_ids"
data_type: TYPE_INT64
dims: [-1, -1]
}
]
output [
{
name: "last_hidden_state"
data_type: TYPE_FP32
dims: [-1, -1, 768]
},
{
name: "1525"
data_type: TYPE_FP32
dims: [-1, 768]
}
]
run perf_analyzer:
root@60c8579e0920:/workspace# perf_analyzer -m bert-base-chinese --percentile=95 --concurrency-range 1:4 --shape attention_mask:1,256 --shape token_type_ids:1,256 --shape input_ids:1,256 -u triton-dev:8000
*** Measurement Settings ***
Batch size: 1
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Latency limit: 0 msec
Concurrency limit: 4 concurrent requests
Using synchronous calls for inference
Stabilizing using p95 latency
Request concurrency: 1
Client:
Request count: 488
Throughput: 97.6 infer/sec
p50 latency: 10094 usec
p90 latency: 11349 usec
p95 latency: 11823 usec
p99 latency: 16075 usec
Avg HTTP time: 10205 usec (send/recv 349 usec + response wait 9856 usec)
Server:
Inference count: 587
Execution count: 587
Successful request count: 587
Avg request latency: 9578 usec (overhead 44 usec + queue 43 usec + compute input 207 usec + compute infer 9191 usec + compute output 93 usec)
Request concurrency: 2
Client:
Request count: 475
Throughput: 95 infer/sec
p50 latency: 20970 usec
p90 latency: 23418 usec
p95 latency: 24444 usec
p99 latency: 25711 usec
Avg HTTP time: 21034 usec (send/recv 1124 usec + response wait 19910 usec)
Server:
Inference count: 570
Execution count: 570
Successful request count: 570
Avg request latency: 19533 usec (overhead 44 usec + queue 9024 usec + compute input 207 usec + compute infer 10164 usec + compute output 94 usec)
Request concurrency: 3
Client:
Request count: 463
Throughput: 92.6 infer/sec
p50 latency: 32371 usec
p90 latency: 34651 usec
p95 latency: 35557 usec
p99 latency: 36871 usec
Avg HTTP time: 32406 usec (send/recv 888 usec + response wait 31518 usec)
Server:
Inference count: 556
Execution count: 556
Successful request count: 556
Avg request latency: 31166 usec (overhead 42 usec + queue 20376 usec + compute input 209 usec + compute infer 10449 usec + compute output 90 usec)
Request concurrency: 4
Client:
Request count: 454
Throughput: 90.8 infer/sec
p50 latency: 44212 usec
p90 latency: 46953 usec
p95 latency: 48125 usec
p99 latency: 54882 usec
Avg HTTP time: 44282 usec (send/recv 831 usec + response wait 43451 usec)
Server:
Inference count: 542
Execution count: 542
Successful request count: 542
Avg request latency: 43100 usec (overhead 43 usec + queue 32041 usec + compute input 209 usec + compute infer 10718 usec + compute output 89 usec)
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 97.6 infer/sec, latency 11823 usec
Concurrency: 2, throughput: 95 infer/sec, latency 24444 usec
Concurrency: 3, throughput: 92.6 infer/sec, latency 35557 usec
Concurrency: 4, throughput: 90.8 infer/sec, latency 48125 usec
Now serve the model with dynamic batching enabled
platform: "onnxruntime_onnx"
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [-1, -1]
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [-1, -1]
},
{
name: "token_type_ids"
data_type: TYPE_INT64
dims: [-1, -1]
}
]
output [
{
name: "last_hidden_state"
data_type: TYPE_FP32
dims: [-1, -1, 768]
},
{
name: "1525"
data_type: TYPE_FP32
dims: [-1, 768]
}
]
dynamic_batching {
}
Test with perf_analyzer with the same argument set
root@60c8579e0920:/workspace# perf_analyzer -m bert-base-chinese --percentile=95 --concurrency-range 1:4 --shape attention_mask:1,256 --shape token_type_ids:1,256 --shape input_ids:1,256 -u triton-dev:8000
*** Measurement Settings ***
Batch size: 1
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Latency limit: 0 msec
Concurrency limit: 4 concurrent requests
Using synchronous calls for inference
Stabilizing using p95 latency
Request concurrency: 1
Client:
Request count: 457
Throughput: 91.4 infer/sec
p50 latency: 10964 usec
p90 latency: 12173 usec
p95 latency: 12670 usec
p99 latency: 16599 usec
Avg HTTP time: 10918 usec (send/recv 337 usec + response wait 10581 usec)
Server:
Inference count: 549
Execution count: 549
Successful request count: 549
Avg request latency: 10297 usec (overhead 44 usec + queue 111 usec + compute input 213 usec + compute infer 9840 usec + compute output 89 usec)
Request concurrency: 2
Client:
Request count: 447
Throughput: 89.4 infer/sec
p50 latency: 22338 usec
p90 latency: 24242 usec
p95 latency: 25239 usec
p99 latency: 29408 usec
Avg HTTP time: 22347 usec (send/recv 672 usec + response wait 21675 usec)
Server:
Inference count: 537
Execution count: 537
Successful request count: 537
Avg request latency: 21337 usec (overhead 47 usec + queue 10190 usec + compute input 206 usec + compute infer 10803 usec + compute output 91 usec)
Request concurrency: 3
Client:
Request count: 438
Throughput: 87.6 infer/sec
p50 latency: 34307 usec
p90 latency: 36334 usec
p95 latency: 37163 usec
p99 latency: 39162 usec
Avg HTTP time: 34294 usec (send/recv 575 usec + response wait 33719 usec)
Server:
Inference count: 525
Execution count: 525
Successful request count: 525
Avg request latency: 33392 usec (overhead 48 usec + queue 21994 usec + compute input 231 usec + compute infer 11024 usec + compute output 95 usec)
Request concurrency: 4
Client:
Request count: 438
Throughput: 87.6 infer/sec
p50 latency: 45677 usec
p90 latency: 47769 usec
p95 latency: 48829 usec
p99 latency: 51770 usec
Avg HTTP time: 45697 usec (send/recv 641 usec + response wait 45056 usec)
Server:
Inference count: 526
Execution count: 526
Successful request count: 526
Avg request latency: 44707 usec (overhead 48 usec + queue 33334 usec + compute input 209 usec + compute infer 11020 usec + compute output 96 usec)
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 91.4 infer/sec, latency 12670 usec
Concurrency: 2, throughput: 89.4 infer/sec, latency 25239 usec
Concurrency: 3, throughput: 87.6 infer/sec, latency 37163 usec
Concurrency: 4, throughput: 87.6 infer/sec, latency 48829 usec
Test “client-side bathcing” scenario
root@60c8579e0920:/workspace# perf_analyzer -m bert-base-chinese --percentile=95 --concurrency-range 1:4 --shape a
ttention_mask:8,256 --shape token_type_ids:8,256 --shape input_ids:8,256 -u triton-dev:8000
*** Measurement Settings ***
Batch size: 1
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Latency limit: 0 msec
Concurrency limit: 4 concurrent requests
Using synchronous calls for inference
Stabilizing using p95 latency
Request concurrency: 1
Client:
Request count: 72
Throughput: 14.4 infer/sec
p50 latency: 69336 usec
p90 latency: 71017 usec
p95 latency: 71381 usec
p99 latency: 72006 usec
Avg HTTP time: 69057 usec (send/recv 2670 usec + response wait 66387 usec)
Server:
Inference count: 87
Execution count: 87
Successful request count: 87
Avg request latency: 65992 usec (overhead 58 usec + queue 156 usec + compute input 373 usec + compute infer 64844 usec + compute output 561 usec)
Request concurrency: 2
Client:
Request count: 73
Throughput: 14.6 infer/sec
p50 latency: 139080 usec
p90 latency: 140973 usec
p95 latency: 142347 usec
p99 latency: 143299 usec
Avg HTTP time: 138764 usec (send/recv 3458 usec + response wait 135306 usec)
Server:
Inference count: 87
Execution count: 87
Successful request count: 87
Avg request latency: 134881 usec (overhead 56 usec + queue 65587 usec + compute input 316 usec + compute infer 68354 usec + compute output 568 usec)
Request concurrency: 3
Client:
Request count: 72
Throughput: 14.4 infer/sec
p50 latency: 208849 usec
p90 latency: 212004 usec
p95 latency: 213297 usec
p99 latency: 215298 usec
Avg HTTP time: 208996 usec (send/recv 3543 usec + response wait 205453 usec)
Server:
Inference count: 86
Execution count: 86
Successful request count: 86
Avg request latency: 205010 usec (overhead 59 usec + queue 135397 usec + compute input 326 usec + compute infer 68653 usec + compute output 575 usec)
Request concurrency: 4
Client:
Request count: 72
Throughput: 14.4 infer/sec
p50 latency: 280456 usec
p90 latency: 284061 usec
p95 latency: 284732 usec
p99 latency: 287123 usec
Avg HTTP time: 280509 usec (send/recv 3952 usec + response wait 276557 usec)
Server:
Inference count: 86
Execution count: 86
Successful request count: 86
Avg request latency: 276127 usec (overhead 59 usec + queue 206034 usec + compute input 323 usec + compute infer 69150 usec + compute output 561 usec)
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 14.4 infer/sec, latency 71381 usec
Concurrency: 2, throughput: 14.6 infer/sec, latency 142347 usec
Concurrency: 3, throughput: 14.4 infer/sec, latency 213297 usec
Concurrency: 4, throughput: 14.4 infer/sec, latency 284732 usec
Test performance with batch size explicitliy specified in perf_analyzer Failed.
root@60c8579e0920:/workspace# perf_analyzer -m bert-base-chinese --percentile=95 --concurrency-range 1:4 --shape attention_mask:8,256 --shape token_type_ids:8,256 --shape input_ids:8,256 -b 8 -u triton-dev:8000
can not specify batch size > 1 as the model does not support batching
To sum up, enabling dynamic batching feature does not seem to truly activate dynamic batching behavior on the server side, as:
- no throughput gain were observed with respect to the baseline setting, its drops from 90.8~97.6 to 87.6~91.4 infer/sec
- enforcing client side batching logic does give throughput gain (14.5 infer/sec * 8 sample/batch) compared with the baseline setting (around 90~97 infer/sec * 1 sampole/batch)
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
I used bert-base-chinese from huggingface and exported model as onnx format with optimum CLI. Here’s what the model looks like in onnxexplorer
onnxexp glance -m models/bert-base-chinese/1/model.onnx
13:19:53 04.30 WARNING onnxexplorer.py:43]: disable TensorRT since it was not found.
Exploring on onnx model: models/bert-base-chinese/1/model.onnx
╭────────────────────────────────── model.onnx Summary ──────────────────────────────────╮
│ IR Version: 6 │
│ Opset Version: 11, │
│ Doc: │
│ Producer Name: pytorch │
│ All Ops: │
│ Constant,Unsqueeze,Cast,Sub,Mul,Shape,Gather,Slice,Add,ReduceMean,Pow,Sqrt,Div,MatMul, │
│ Concat,Reshape,Transpose,Softmax,Erf,Gemm,Tanh │
╰────────────────────────────────────────────────────────────────────────────────────────╯
model.onnx Detail
╭─────────────────────────────┬───────────────────────┬─────────────────────┬────────────╮
│ Name │ Shape │ Input/Output │ Dtype │
├─────────────────────────────┼───────────────────────┼─────────────────────┼────────────┤
│ input_ids │ [-1, -1] │ input │ int64 │
│ attention_mask │ [-1, -1] │ input │ int64 │
│ token_type_ids │ [-1, -1] │ input │ int64 │
│ last_hidden_state │ [-1, -1, 768] │ output │ float32 │
│ 1525 │ [-1, 768] │ output │ float32 │
╰─────────────────────────────┴───────────────────────┴─────────────────────┴────────────╯
Table generated by onnxexplorer
It’s clear that the first two dimensions of inputs are organaized as [BATCH_SIZE, SEQ_LENGTH]
Expected behavior
Gain in throughput should be observed when dynamic bathching is enabled during inference
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 18 (9 by maintainers)
I’ve been busy for now. I’ve saved this post and I’ll validate this solution later.
@BorisPolonsky I have a different model but had a similar problem. I had to specify the batch size & shape for the requests to be sent
perf_analyzer -m bert-base-chinese --concurrency-range 1:4 --shape attention_mask:8,256 --shape token_type_ids:8,256 --shape input_ids:8,256 -b 8and on the model config side use
then also use
--log-verbose 2 --log-file triton_logs.logwhen launching the server & & grep:tail -f triton_logs.log|grep --line-buffered executingto check if requests are being grouped. should see something like:I0517 11:41:01.923802 98 tensorrt.cc:334] model bert-base-chinese, instance bert-base-chinese_0_1, executing 4 requests