TensorRT: Incorrect number of network outputs detected during int8 calibration leading to failure

Description

When trying to convert a PyTorch model to a TensorRT engine, int8 calibration fails with:

brett@brett-home:~/Work/Autosensor/NN$ python -- trt_builder.py "saved/RPN_ThunderNet2-activation:BN-ReLU-classes:2-input:3x512x896-complexity:0-statnett-0.5-2023-08-01" -q
/home/brett/Work/Autosensor/.direnv/python-venv-3.11.5/lib/python3.11/site-packages/torch/__init__.py:1418: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert condition, message
/home/brett/Work/Autosensor/.direnv/python-venv-3.11.5/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /tmp/build-via-sdist-9wtz2njt/torch-2.2.0a0+gitfaf3de3/aten/src/ATen/native/TensorShape.cpp:3549.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
[09/20/2023-16:57:36] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 541, GPU 4118 (MiB)
[09/20/2023-16:57:39] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1444, GPU +266, now: CPU 2062, GPU 4384 (MiB)
[09/20/2023-16:57:39] [TRT] [W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/20/2023-16:57:39] [TRT] [W] Tensor DataType is determined at build time for tensors not marked as input or output.
[09/20/2023-16:57:39] [TRT] [W] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[09/20/2023-16:57:39] [TRT] [I] Graph optimization time: 0.00566967 seconds.
[09/20/2023-16:57:39] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2085, GPU 4392 (MiB)
[09/20/2023-16:57:39] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2086, GPU 4402 (MiB)
[09/20/2023-16:57:39] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[09/20/2023-16:57:42] [TRT] [I] Detected 1 inputs and 6 output network tensors.
[09/20/2023-16:57:43] [TRT] [I] Total Host Persistent Memory: 450704
[09/20/2023-16:57:43] [TRT] [I] Total Device Persistent Memory: 1162240
[09/20/2023-16:57:43] [TRT] [I] Total Scratch Memory: 1605632
[09/20/2023-16:57:43] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 42 MiB
[09/20/2023-16:57:43] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 571 steps to complete.
[09/20/2023-16:57:43] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 180.923ms to assign 69 blocks to 571 nodes requiring 78085120 bytes.
[09/20/2023-16:57:43] [TRT] [I] Total Activation Memory: 78085120
[09/20/2023-16:57:43] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2435, GPU 4428 (MiB)
[09/20/2023-16:57:43] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2435, GPU 4438 (MiB)
[09/20/2023-16:57:43] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2435, GPU 4414 (MiB)
[09/20/2023-16:57:43] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2435, GPU 4422 (MiB)
[09/20/2023-16:57:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +75, now: CPU 0, GPU 91 (MiB)
[09/20/2023-16:57:43] [TRT] [I] Starting Calibration.
[09/20/2023-16:57:43] [TRT] [E] 1: [softMaxV2Runner.cpp::execute::226] Error Code 1: Cask (shader run failed)
[09/20/2023-16:57:43] [TRT] [E] 3: [engine.cpp::~Engine::298] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/engine.cpp::~Engine::298, condition: mExecutionContextCounter.use_count() == 1. Destroying an engine object before destroying the IExecutionContext objects it created leads to undefined behavior.
)
[09/20/2023-16:57:43] [TRT] [E] 2: [calibrator.cpp::calibrateEngine::1181] Error Code 2: Internal Error (Assertion context->executeV2(&bindings[0]) failed. )

which I suspect is a result of the wrong number of output tensors being detected for the network.

Inspecting the onnx model that the engine is being built from with polygraphy correctly shows 1 input and 4 outputs:

brett@brett-home:~/Work/Autosensor/NN$ polygraphy inspect model /tmp/model-with-shapes.onnx 
[I] Loading model: /tmp/model-with-shapes.onnx
[I] ==== ONNX Model ====
    Name: main_graph | ONNX Opset: 17
    
    ---- 1 Graph Input(s) ----
    {input [dtype=float32, shape=(1, 3, 512, 896)]}
    
    ---- 4 Graph Output(s) ----
    {scores [dtype=float32, shape=('Min(200, NonMaxSuppression_940_o0__d0)', 2)],
     boxes [dtype=float32, shape=('Min(200, NonMaxSuppression_940_o0__d0)', 4)],
     roi [dtype=float32, shape=('Min(200, NonMaxSuppression_940_o0__d0)', 5)],
     count [dtype=int64, shape=()]}
    
    ---- 163 Initializer(s) ----
    
    ---- 999 Node(s) ----

The model is proprietary, so I can’t share it and I don’t currently have a minimal repro model, however, I can share the builder script (trt_builder.py) which is based on https://github.com/NVIDIA-AI-IOT/jetson_dla_tutorial#step-7. I’ve also checked that calibration with the script works if a toy model (with just a single convolution layer) is used to verify that the DatasetCalibrator class works.

When building the TensorRT engine without quantization (i.e., without int8 calibration), the correct number of network outputs is detected and the engine is built successfully.

brett@brett-home:~/Work/Autosensor/NN$ python -- trt_builder.py "saved/RPN_ThunderNet2-activation:BN-ReLU-classes:2-input:3x512x896-complexity:0-statnett-0.5-2023-08-01"
/home/brett/Work/Autosensor/.direnv/python-venv-3.11.5/lib/python3.11/site-packages/torch/__init__.py:1418: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert condition, message
/home/brett/Work/Autosensor/.direnv/python-venv-3.11.5/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /tmp/build-via-sdist-9wtz2njt/torch-2.2.0a0+gitfaf3de3/aten/src/ATen/native/TensorShape.cpp:3549.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
[09/20/2023-16:59:23] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 537, GPU 4117 (MiB)
[09/20/2023-16:59:26] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1445, GPU +266, now: CPU 2058, GPU 4383 (MiB)
[09/20/2023-16:59:26] [TRT] [W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/20/2023-16:59:26] [TRT] [W] Tensor DataType is determined at build time for tensors not marked as input or output.
[09/20/2023-16:59:26] [TRT] [W] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[09/20/2023-16:59:26] [TRT] [I] Graph optimization time: 0.0266299 seconds.
[09/20/2023-16:59:26] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2080, GPU 4391 (MiB)
[09/20/2023-16:59:26] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2080, GPU 4401 (MiB)
[09/20/2023-16:59:26] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[09/20/2023-17:00:37] [TRT] [I] Detected 1 inputs and 4 output network tensors.
[09/20/2023-17:00:37] [TRT] [I] Total Host Persistent Memory: 453328
[09/20/2023-17:00:37] [TRT] [I] Total Device Persistent Memory: 18432
[09/20/2023-17:00:37] [TRT] [I] Total Scratch Memory: 1205248
[09/20/2023-17:00:37] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 7 MiB, GPU 105 MiB
[09/20/2023-17:00:37] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 178 steps to complete.
[09/20/2023-17:00:37] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 4.81692ms to assign 10 blocks to 178 nodes requiring 27747328 bytes.
[09/20/2023-17:00:37] [TRT] [I] Total Activation Memory: 27747328
[09/20/2023-17:00:37] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2330, GPU 4427 (MiB)
[09/20/2023-17:00:37] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2331, GPU 4437 (MiB)
[09/20/2023-17:00:37] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[09/20/2023-17:00:37] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[09/20/2023-17:00:37] [TRT] [W] Check verbose logs for the list of affected weights.
[09/20/2023-17:00:37] [TRT] [W] - 61 weights are affected by this issue: Detected subnormal FP16 values.
[09/20/2023-17:00:37] [TRT] [W] - 3 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[09/20/2023-17:00:37] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +6, GPU +10, now: CPU 6, GPU 10 (MiB)

and polygraphy gives sensible output for the resulting engine:

brett@brett-home:~/Work/Autosensor/NN$ polygraphy inspect model /tmp/model.engine 
[I] Loading bytes from /tmp/model.engine
[I] ==== TensorRT Engine ====
    Name: Unnamed Network 0 | Explicit Batch Engine
    
    ---- 1 Engine Input(s) ----
    {input [dtype=float32, shape=(1, 3, 512, 896)]}
    
    ---- 4 Engine Output(s) ----
    {roi [dtype=float32, shape=(-1, 5)],
     scores [dtype=float32, shape=(-1, 2)],
     boxes [dtype=float32, shape=(-1, 4)],
     count [dtype=int32, shape=()]}
    
    ---- Memory ----
    Device Memory: 27747328 bytes
    
    ---- 1 Profile(s) (5 Tensor(s) Each) ----
    - Profile: 0
        Tensor: input           (Input), Index: 0 | Shapes: min=(1, 3, 512, 896), opt=(1, 3, 512, 896), max=(1, 3, 512, 896)
        Tensor: roi            (Output), Index: 1 | Shape: (-1, 5)
        Tensor: scores         (Output), Index: 2 | Shape: (-1, 2)
        Tensor: boxes          (Output), Index: 3 | Shape: (-1, 4)
        Tensor: count          (Output), Index: 4 | Shape: ()
    
    ---- 215 Layer(s) ----

Environment

Modified collect_env.py script from PyTorch to include tensorrt in the pip packages:

brett@brett-home:~/Work/Autosensor/NN$ python collect_env.py 
Collecting environment information...
PyTorch version: 2.2.0a0
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: Ubuntu Mantic Minotaur (development branch) (x86_64)
GCC version: (Ubuntu 13.2.0-4ubuntu1) 13.2.0
Clang version: 16.0.6 (15)
CMake version: version 3.27.4
Libc version: glibc-2.38

Python version: 3.11.5 (main, Aug 29 2023, 15:31:31) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-6.5.1-060501-generic-x86_64-with-glibc2.38
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             32
On-line CPU(s) list:                0-31
Vendor ID:                          GenuineIntel
Model name:                         13th Gen Intel(R) Core(TM) i9-13900K
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 24
Socket(s):                          1
Stepping:                           1
CPU(s) scaling MHz:                 83%
CPU max MHz:                        5800.0000
CPU min MHz:                        800.0000
BogoMIPS:                           5990.40
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          896 KiB (24 instances)
L1i cache:                          1.3 MiB (24 instances)
L2 cache:                           32 MiB (12 instances)
L3 cache:                           36 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.2
[pip3] numpy-quaternion==2022.4.3
[pip3] tensorrt==8.6.1.post1
[pip3] tensorrt-bindings==8.6.1
[pip3] tensorrt-libs==8.6.1
[pip3] torch==2.2.0a0+a683bc5
[pip3] torchaudio==2.0.2
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.17.0a0+4cb3d80
[pip3] triton==2.0.0
[conda] Could not collect

About this issue

Original URL
State: closed
Created 9 months ago
Comments: 16

Most upvoted comments

Thanks, I’ve filed internal bug 4340507 to track this, sorry about the delayed response, quite busy with other things these day 😃

zerollzeng on Oct 20, 2023