TensorRT: FasterTransformer plugins for TensorRT | add_residual+LayerNorm not being matched by plugin

We are seeing a not very fused LayerNorm’s in a trex output, e.g. it does two-pass for mean-var compared to possible one-pass impl

So I wondered if FasterTransformer separate kernels exist in form of plugins for TensorRT?

I noticed at https://github.com/nvidia/FasterTransformer/blob/dev/v5.0_beta/docs/bert_guide.md#standard-bert-and-effective-fastertransformer that in reverse, FT can use TRT kernels for MHA, but I’m wondering in reverse if TRT can use FT’s add_bias_input_layernorm?

https://github.com/NVIDIA/FasterTransformer/#advanced suggests that |--/tensorrt_plugin: encapluate FasterTransformer into TensorRT plugin. is supported, but on the other side in the README somewhere it says Remove the TensorRT plugin supporting.. So I am confused if by TensorRT plugin they mean a FT-kernels packaged as TRT-plugin or Being able to use TRT-optimized-code-path from FT.

Does it make sense at all? Or should TRT be always faster than FT?

Our graph is a regular pytorch LayerNorm exported to ONNX, so it should be pattern-matched easily. Shouldn’t https://github.com/NVIDIA/TensorRT/tree/release/8.6/plugin/skipLayerNormPlugin get matched? This is exactly our case: adding a residual + layernorm.

image

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 19 (1 by maintainers)

Most upvoted comments

We recommend users to use opset>=17 to ensure LayerNorm fusion.

Also, it would be nice if the ForeignNode included all the node names that it fused together.

This is in progress and will gradually be rolled out in the next couple of TRT versions.

Cool, thanks @zhenhuaw-me 😃 Feel free to close this issue when you roll it out!

The fix will be included in the next release.

Yes, if you see the ForeignNode, that means the LayerNorms have been fused.

The ForeignNode does not support INT8 implicit quantization, so it uses FP16. If you want to use INT8, you will need to insert Q/DQ ops into the network to make it explicitly quantized.