TensorRT: FasterTransformer plugins for TensorRT | add_residual+LayerNorm not being matched by plugin
We are seeing a not very fused LayerNorm’s in a trex output, e.g. it does two-pass for mean-var compared to possible one-pass impl
So I wondered if FasterTransformer separate kernels exist in form of plugins for TensorRT?
I noticed at https://github.com/nvidia/FasterTransformer/blob/dev/v5.0_beta/docs/bert_guide.md#standard-bert-and-effective-fastertransformer that in reverse, FT can use TRT kernels for MHA, but I’m wondering in reverse if TRT can use FT’s add_bias_input_layernorm
?
https://github.com/NVIDIA/FasterTransformer/#advanced suggests that |--/tensorrt_plugin: encapluate FasterTransformer into TensorRT plugin.
is supported, but on the other side in the README somewhere it says Remove the TensorRT plugin supporting.
. So I am confused if by TensorRT plugin
they mean a FT-kernels packaged as TRT-plugin
or Being able to use TRT-optimized-code-path from FT
.
Does it make sense at all? Or should TRT be always faster than FT?
Our graph is a regular pytorch LayerNorm exported to ONNX, so it should be pattern-matched easily. Shouldn’t https://github.com/NVIDIA/TensorRT/tree/release/8.6/plugin/skipLayerNormPlugin get matched? This is exactly our case: adding a residual + layernorm.
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 19 (1 by maintainers)
We recommend users to use opset>=17 to ensure LayerNorm fusion.
This is in progress and will gradually be rolled out in the next couple of TRT versions.
Cool, thanks @zhenhuaw-me 😃 Feel free to close this issue when you roll it out!
The fix will be included in the next release.
Yes, if you see the ForeignNode, that means the LayerNorms have been fused.
The ForeignNode does not support INT8 implicit quantization, so it uses FP16. If you want to use INT8, you will need to insert Q/DQ ops into the network to make it explicitly quantized.