TensorRT: [question] why unet-trt-int8’s inference is much slower than fp16
Description
I have implemented a stable diffusion img2img pipeline using tensorrt fp16. It’s good but we are seeking for a faster deployment solution because the whole pipeline’s latency is still a little bit unbearable.
Then I tried int8, according to my test result, I found that running unet with int8 precision is much slower than fp16.
unet onnx is exported from stable diffusion v2.1, then converted into tensorrt plan
test with image’s input size:512x512
latency results are as follow:
fp16 :42.872 ms
int8 :111.368 ms
My guess is that only fused fp16 attention implementation (like flash attention) are provided by tensorrt. If working with int8, lots of attention layers are broken into many plain matmul ops. which results in slow inference. Is that right?
Is it possible accelerate stable diffusion with int8 quantization? I am wondering whether somebody has tried to quantize stable diffusion model. Does it hurt the quality of result?
Or, it would be great that someone share any viable accelerattion solutions (with better performance, indeed).
Environment
TensorRT Version: tensorrt 8.6 EA
NVIDIA GPU: A10
NVIDIA Driver Version: 470.161.03
CUDA Version: 11.7
CUDNN Version:
Operating System:
Python Version (if applicable):
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Relevant Files
Steps To Reproduce
# build int8 plan
./TensorRT-8.6.0.12/bin/trtexec --onnx=unet.onnx --minShapes=\'sample\':2x4x8x8,\'encoder_hidden_states\':2x77x1024 --optShapes=\'sample\':2x4x64x64,\'encoder_hidden_states\':2x77x1024 --maxShapes=\'sample\':4x4x96x96,\'encoder_hidden_states\':4x77x1024 --buildOnly --saveEngine=unet-int8.plan --memPoolSize=workspace:13888 --device=1 --int8
# build fp16 plan
./TensorRT-8.6.0.12/bin/trtexec --onnx=unet.onnx --minShapes=\'sample\':2x4x8x8,\'encoder_hidden_states\':2x77x1024 --optShapes=\'sample\':2x4x64x64,\'encoder_hidden_states\':2x77x1024 --maxShapes=\'sample\':4x4x96x96,\'encoder_hidden_states\':4x77x1024 --buildOnly --saveEngine=unet-int8.plan --memPoolSize=workspace:13888 --device=1 --fp16
# run inference with trtexec
./TensorRT-8.6.0.12/bin/trtexec --loadEngine=./unet-int8.plan --device=2 --iterations=100 --warmUp=500 --shapes='sample':2x4x64x64,'encoder_hidden_states':2x77x1024
./TensorRT-8.6.0.12/bin/trtexec --loadEngine=./unet.plan --device=2 --iterations=100 --warmUp=500 --shapes='sample':2x4x64x64,'encoder_hidden_states':2x77x1024
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16
Confirmed.
This performance issue has been resolved after upgrading to TensorRT 8.6.1.6 from 8.6.0.12. Thus, I am closing this issue
I can’t reproduce the issue. int8 + fp16
fp16