TensorRT: Significant Floating Point Errors in Container Versions 23.03 to 23.08(starting from TensorRT 8.6.x) Affecting Specific Models when running on all GPUs including T4, A100.
Description
Reference: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
Description:
Up to container version 23.02
(TensorRT 8.5.x), there were no issues running our company’s models. However, starting from version 23.03
up to 23.08
(8.6.x), we’ve consistently encountered large floating point errors in specific models.
Specifics:
When the TensorRT model is built with batch_size=1, the error does not occur.
The issue manifests consistently when the TensorRT model is built with batch_size=2 or higher, rendering the model unusable.
The TensorRT models are constructed from ONNX, and we’ve verified that the issue is not related to the opset. --fp16
Given that the error compromises model integrity, immediate attention is requested.
Environment
TensorRT Version: All version of 8.6.x. NGC Container 23.04~23.08.
NVIDIA GPU: T4, A100
NVIDIA Driver Version: 535.104.05
CUDA Version: 12.2
CUDNN Version: x
Operating System:
Container (if so, version): NGC Container from 23.03 and 23.08. https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt
Relevant Files
Can’t include any model files.
Steps To Reproduce
Build TensorRT Model from ONNX model which larger than 200M parameters with as possible as big batch-size.
Commands or scripts:
Have you tried the latest release?:
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt
):
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 19
Yes layernorm is prone to overflow under FP16, so fallback it back to FP32 is a good solution. You should be able to see the warning in TRT log.
Sometime the diff just accumulate and it’s unavoidable, could you please try fallback some layers back to FP32? this can be done in try-and-test until you find a good balance between performance and accuracy.
Check onnx with polygraphy, looks like the output is totally matched:
Check the reproduce you provided, its it possible for you to provide the onnx model? I want to confirm that this issue is come from TRT or Triton, while in the latter case you have seek help from Triton developers.
Send the private link here and I’ll request for access 😃