TensorRT: [question] why unet-trt-int8’s inference is much slower than fp16

Description

I have implemented a stable diffusion img2img pipeline using tensorrt fp16. It’s good but we are seeking for a faster deployment solution because the whole pipeline’s latency is still a little bit unbearable.

Then I tried int8, according to my test result, I found that running unet with int8 precision is much slower than fp16.

unet onnx is exported from stable diffusion v2.1, then converted into tensorrt plan test with image’s input size：512x512 latency results are as follow： fp16 ：42.872 ms

int8 ：111.368 ms

My guess is that only fused fp16 attention implementation (like flash attention) are provided by tensorrt. If working with int8, lots of attention layers are broken into many plain matmul ops. which results in slow inference. Is that right?

Is it possible accelerate stable diffusion with int8 quantization? I am wondering whether somebody has tried to quantize stable diffusion model. Does it hurt the quality of result?

Or, it would be great that someone share any viable accelerattion solutions (with better performance, indeed).

Environment

TensorRT Version: tensorrt 8.6 EA NVIDIA GPU: A10 NVIDIA Driver Version: 470.161.03 CUDA Version: 11.7 CUDNN Version:
Operating System:
Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

Relevant Files

Steps To Reproduce

# build int8 plan
./TensorRT-8.6.0.12/bin/trtexec --onnx=unet.onnx --minShapes=\'sample\':2x4x8x8,\'encoder_hidden_states\':2x77x1024 --optShapes=\'sample\':2x4x64x64,\'encoder_hidden_states\':2x77x1024 --maxShapes=\'sample\':4x4x96x96,\'encoder_hidden_states\':4x77x1024  --buildOnly --saveEngine=unet-int8.plan --memPoolSize=workspace:13888 --device=1 --int8

# build fp16 plan
./TensorRT-8.6.0.12/bin/trtexec --onnx=unet.onnx --minShapes=\'sample\':2x4x8x8,\'encoder_hidden_states\':2x77x1024 --optShapes=\'sample\':2x4x64x64,\'encoder_hidden_states\':2x77x1024 --maxShapes=\'sample\':4x4x96x96,\'encoder_hidden_states\':4x77x1024  --buildOnly --saveEngine=unet-int8.plan --memPoolSize=workspace:13888 --device=1 --fp16

# run inference with trtexec
./TensorRT-8.6.0.12/bin/trtexec --loadEngine=./unet-int8.plan --device=2 --iterations=100 --warmUp=500 --shapes='sample':2x4x64x64,'encoder_hidden_states':2x77x1024

./TensorRT-8.6.0.12/bin/trtexec --loadEngine=./unet.plan --device=2 --iterations=100 --warmUp=500 --shapes='sample':2x4x64x64,'encoder_hidden_states':2x77x1024

About this issue

Original URL
State: closed
Created a year ago
Comments: 16

Most upvoted comments

I can’t reproduce the issue. int8 + fp16

[05/14/2023-04:31:44] [I] === Performance summary ===
[05/14/2023-04:31:44] [I] Throughput: 30.8434 qps
[05/14/2023-04:31:44] [I] Latency: min = 31.198 ms, max = 33.0905 ms, mean = 32.1222 ms, median = 32.0154 ms, percentile(90%) = 32.916 ms, percentile(95%) = 32.9651 ms, percentile(99%) = 33.0905 ms
[05/14/2023-04:31:44] [I] Enqueue Time: min = 27.7755 ms, max = 29.4277 ms, mean = 28.5828 ms, median = 28.4602 ms, percentile(90%) = 29.3256 ms, percentile(95%) = 29.3436 ms, percentile(99%) = 29.4277 ms
[05/14/2023-04:31:44] [I] H2D Latency: min = 0.0251465 ms, max = 0.0323486 ms, mean = 0.026692 ms, median = 0.0265503 ms, percentile(90%) = 0.0273438 ms, percentile(95%) = 0.0282593 ms, percentile(99%) = 0.0323486 ms
[05/14/2023-04:31:44] [I] GPU Compute Time: min = 31.1654 ms, max = 33.0587 ms, mean = 32.0889 ms, median = 31.9817 ms, percentile(90%) = 32.8837 ms, percentile(95%) = 32.9319 ms, percentile(99%) = 33.0587 ms
[05/14/2023-04:31:44] [I] D2H Latency: min = 0.00537109 ms, max = 0.00866699 ms, mean = 0.00656883 ms, median = 0.00634766 ms, percentile(90%) = 0.00749207 ms, percentile(95%) = 0.00756836 ms, percentile(99%) = 0.00866699 ms
[05/14/2023-04:31:44] [I] Total Host Walltime: 3.08008 s
[05/14/2023-04:31:44] [I] Total GPU Compute Time: 3.04845 s
[05/14/2023-04:31:44] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/14/2023-04:31:44] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/14/2023-04:31:44] [W] * GPU compute time is unstable, with coefficient of variance = 1.81891%.
[05/14/2023-04:31:44] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/14/2023-04:31:44] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/14/2023-04:31:44] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --loadEngine=unet-best.plan --shapes='sample':2x4x64x64,'encoder_hidden_states':2x77x1024

fp16

[05/14/2023-04:32:20] [I] === Performance summary ===
[05/14/2023-04:32:20] [I] Throughput: 23.401 qps
[05/14/2023-04:32:20] [I] Latency: min = 41.2711 ms, max = 43.8199 ms, mean = 42.1853 ms, median = 42.0254 ms, percentile(90%) = 42.7623 ms, percentile(95%) = 43.0861 ms, percentile(99%) = 43.8199 ms
[05/14/2023-04:32:20] [I] Enqueue Time: min = 37.4642 ms, max = 39.7625 ms, mean = 38.2745 ms, median = 38.1261 ms, percentile(90%) = 38.8303 ms, percentile(95%) = 39.1207 ms, percentile(99%) = 39.7625 ms
[05/14/2023-04:32:20] [I] H2D Latency: min = 0.0253906 ms, max = 0.032959 ms, mean = 0.0278485 ms, median = 0.027832 ms, percentile(90%) = 0.0286865 ms, percentile(95%) = 0.0292969 ms, percentile(99%) = 0.032959 ms
[05/14/2023-04:32:20] [I] GPU Compute Time: min = 41.2354 ms, max = 43.7862 ms, mean = 42.1507 ms, median = 41.9922 ms, percentile(90%) = 42.7274 ms, percentile(95%) = 43.052 ms, percentile(99%) = 43.7862 ms
[05/14/2023-04:32:20] [I] D2H Latency: min = 0.00537109 ms, max = 0.00866699 ms, mean = 0.00670592 ms, median = 0.00646973 ms, percentile(90%) = 0.00805664 ms, percentile(95%) = 0.00817871 ms, percentile(99%) = 0.00866699 ms
[05/14/2023-04:32:20] [I] Total Host Walltime: 3.11953 s
[05/14/2023-04:32:20] [I] Total GPU Compute Time: 3.077 s
[05/14/2023-04:32:20] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/14/2023-04:32:20] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/14/2023-04:32:20] [W] * GPU compute time is unstable, with coefficient of variance = 1.16616%.
[05/14/2023-04:32:20] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/14/2023-04:32:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/14/2023-04:32:20] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --loadEngine=unet-fp16.plan --shapes='sample':2x4x64x64,'encoder_hidden_states':2x77x1024

Confirmed.

This performance issue has been resolved after upgrading to TensorRT 8.6.1.6 from 8.6.0.12. Thus, I am closing this issue

handoku on May 17, 2023

I can’t reproduce the issue. int8 + fp16

[05/14/2023-04:31:44] [I] === Performance summary ===
[05/14/2023-04:31:44] [I] Throughput: 30.8434 qps
[05/14/2023-04:31:44] [I] Latency: min = 31.198 ms, max = 33.0905 ms, mean = 32.1222 ms, median = 32.0154 ms, percentile(90%) = 32.916 ms, percentile(95%) = 32.9651 ms, percentile(99%) = 33.0905 ms
[05/14/2023-04:31:44] [I] Enqueue Time: min = 27.7755 ms, max = 29.4277 ms, mean = 28.5828 ms, median = 28.4602 ms, percentile(90%) = 29.3256 ms, percentile(95%) = 29.3436 ms, percentile(99%) = 29.4277 ms
[05/14/2023-04:31:44] [I] H2D Latency: min = 0.0251465 ms, max = 0.0323486 ms, mean = 0.026692 ms, median = 0.0265503 ms, percentile(90%) = 0.0273438 ms, percentile(95%) = 0.0282593 ms, percentile(99%) = 0.0323486 ms
[05/14/2023-04:31:44] [I] GPU Compute Time: min = 31.1654 ms, max = 33.0587 ms, mean = 32.0889 ms, median = 31.9817 ms, percentile(90%) = 32.8837 ms, percentile(95%) = 32.9319 ms, percentile(99%) = 33.0587 ms
[05/14/2023-04:31:44] [I] D2H Latency: min = 0.00537109 ms, max = 0.00866699 ms, mean = 0.00656883 ms, median = 0.00634766 ms, percentile(90%) = 0.00749207 ms, percentile(95%) = 0.00756836 ms, percentile(99%) = 0.00866699 ms
[05/14/2023-04:31:44] [I] Total Host Walltime: 3.08008 s
[05/14/2023-04:31:44] [I] Total GPU Compute Time: 3.04845 s
[05/14/2023-04:31:44] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/14/2023-04:31:44] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/14/2023-04:31:44] [W] * GPU compute time is unstable, with coefficient of variance = 1.81891%.
[05/14/2023-04:31:44] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/14/2023-04:31:44] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/14/2023-04:31:44] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --loadEngine=unet-best.plan --shapes='sample':2x4x64x64,'encoder_hidden_states':2x77x1024

fp16

[05/14/2023-04:32:20] [I] === Performance summary ===
[05/14/2023-04:32:20] [I] Throughput: 23.401 qps
[05/14/2023-04:32:20] [I] Latency: min = 41.2711 ms, max = 43.8199 ms, mean = 42.1853 ms, median = 42.0254 ms, percentile(90%) = 42.7623 ms, percentile(95%) = 43.0861 ms, percentile(99%) = 43.8199 ms
[05/14/2023-04:32:20] [I] Enqueue Time: min = 37.4642 ms, max = 39.7625 ms, mean = 38.2745 ms, median = 38.1261 ms, percentile(90%) = 38.8303 ms, percentile(95%) = 39.1207 ms, percentile(99%) = 39.7625 ms
[05/14/2023-04:32:20] [I] H2D Latency: min = 0.0253906 ms, max = 0.032959 ms, mean = 0.0278485 ms, median = 0.027832 ms, percentile(90%) = 0.0286865 ms, percentile(95%) = 0.0292969 ms, percentile(99%) = 0.032959 ms
[05/14/2023-04:32:20] [I] GPU Compute Time: min = 41.2354 ms, max = 43.7862 ms, mean = 42.1507 ms, median = 41.9922 ms, percentile(90%) = 42.7274 ms, percentile(95%) = 43.052 ms, percentile(99%) = 43.7862 ms
[05/14/2023-04:32:20] [I] D2H Latency: min = 0.00537109 ms, max = 0.00866699 ms, mean = 0.00670592 ms, median = 0.00646973 ms, percentile(90%) = 0.00805664 ms, percentile(95%) = 0.00817871 ms, percentile(99%) = 0.00866699 ms
[05/14/2023-04:32:20] [I] Total Host Walltime: 3.11953 s
[05/14/2023-04:32:20] [I] Total GPU Compute Time: 3.077 s
[05/14/2023-04:32:20] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/14/2023-04:32:20] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/14/2023-04:32:20] [W] * GPU compute time is unstable, with coefficient of variance = 1.16616%.
[05/14/2023-04:32:20] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/14/2023-04:32:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/14/2023-04:32:20] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --loadEngine=unet-fp16.plan --shapes='sample':2x4x64x64,'encoder_hidden_states':2x77x1024

zerollzeng on May 14, 2023