TensorRT: TensorRT 7.1 produce incorrect results when batch size >1

Description

Environment

TensorRT Version: 7.1.3.0 GPU Type: Jetson TX2 (Jetpack 4.4) Nvidia Driver Version: CUDA Version: 10.1 CUDNN Version: Operating System + Version: (LTS from Jetpack 4.4) Python Version (if applicable): 3.6.9 PyTorch Version (if applicable): 1.6.0

Relevant Files

|----rfb.onnx |----trt_utils.py |----produce_bug.py |____________________ These files can be downloaded from google drive

I reproduce produce_bug.py as below

import os
import sys
import pickle
import argparse
import numpy as np
from pprint import pprint
import pdb
import torch
import torch.backends.cudnn as cudnn
cudnn.benchmark = True
import pycuda.autoinit

    
import tensorrt as trt
import trt_utils as tu # trt utils
import pdb


TRT_LOGGER = trt.Logger()

def test_bs_1():
    #engine, context,  h_input, d_input, h_output, d_output, stream = onnx_2_tensorrt.main()
    onnx_file_path = 'rfb.onnx'
    fp16_mode = False#args.fp16
    int8_mode = False#args.int8
    batchsize =1 
    engine_file_path = "rfb_fp16_{}_int8_{}_bs_{}.trt".format(fp16_mode,int8_mode,batchsize)

    print("Building Engine")
  
    engine = tu.get_engine(batchsize,onnx_file_path,engine_file_path,fp16_mode=fp16_mode,int8_mode=int8_mode)
    #pdb.set_trace()
    context = engine.create_execution_context() 
    inputs, outputs, bindings, stream = tu.allocate_buffers(engine) # input, output: host # bindings 


    trt_outputs = []

    #img = np.random.randn(1,3,512,512) + 100
    #img = img.astype(np.uint8)
    global img
    x = img 
        
    inputs[0].host = x.reshape(-1)
    trt_outputs = tu.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
        # 
    locs, scores = tu.process_outputs(trt_outputs,batchsize, 6)  # The locs is corr
    #print(np.mean(locs))
    #print(np.mean(scores))
    pdb.set_trace()


def test_bs_2():
    onnx_file_path = 'rfb.onnx'
    fp16_mode = False#args.fp16
    int8_mode = False#args.int8
    batchsize = 2 
    engine_file_path = "rfb_fp16_{}_int8_{}_bs_{}.trt".format(fp16_mode,int8_mode,batchsize)

    print("Building Engine")
  
    engine = tu.get_engine(batchsize,onnx_file_path,engine_file_path,fp16_mode=fp16_mode,int8_mode=int8_mode)
    #pdb.set_trace()
    context = engine.create_execution_context() 
    inputs, outputs, bindings, stream = tu.allocate_buffers(engine) # input, output: host # bindings 

    trt_outputs = []

    #img = np.random.randn(1,3,512,512) + 100
    #img = img.astype(np.uint8)

    global img
    x = np.concatenate([img,img]) 
        
    inputs[0].host = x.reshape(-1)
    trt_outputs = tu.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
        # 
    locs, scores = tu.process_outputs(trt_outputs, batchsize, 6)  # The locs is c       
    #print(np.mean(locs))
    #print(np.mean(scores))
    pdb.set_trace()

if __name__ == '__main__':
    img = np.random.randn(1,3,512,512) + 100
    img = img.astype(np.uint8)
    test_bs_1()
    test_bs_2()

Steps To Reproduce

python produce_bug.py

There are two test functions in the produce_bug.py,

In test_bs_1(), the code generate an engine whose maxBatchSize is 1, when I generate a random input img and set up a input x = img (batch size=1)

the shape of locs is [1, 32756, 4] and the mean of locs is -7.8930e+18 (non zero)

In test_bs_2(), the code generate an engine whose maxBatchSize is 2, and the input becomes the concatenation of img that x = np.concatenation([img, img]) (batch size =2)

If the engine works correctly, the mean of locs.

We can see from the screenshot above that the output locs has a lot of zeros values. The first output locs[0, :, :] in the batch seems normal, but the second output locs[1,: , : ] in the batch has all zero values. Besides, another output boxes have also a lot zero values when batch size > 1

Therefore, the TensorRT Engine does not work properly.

If I used TensorRT 6.0 to run the code, locs[1,: , : ] will be normal (non zero).

This problem is very urgent for me. Please help me! Appreciate any suggestions!

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 22

Most upvoted comments

Hi, @RizhaoCai , i have same problem. when run batch_size = 2.

Inputs : array images define :inputs[0].host = images batch

Output: array 1 dimension. After reshape (batch, , ). results in first batch too correct, next batches is incorrect. You have solution??

@NgTuong Same issue. I have tried many methods. But all failed.

RizhaoCai on Nov 22, 2020

@ttyio I got a question about batch size. sorry for mention. It’s not a case using dynamic batch size.

My question is that, when my raw onnx model have batch size = 1, is there problem on creating trt enigine with batchsize=2?

only scenario with onnx.batch_size == trt.batch_size is acceptable?

I’m on the scenario, trt have wrong output.

Hi @dongyi-kim , there will be problem. we have to export onnx model which have batch size = -1 or 2 if you want creating trt engine with batch size 2. thanks!

ttyio on Nov 23, 2021

@NgTuong BTW, besides read the code, you can also run trtexec follow https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec#example-4-running-an-onnx-model-with-full-dimensions-and-dynamic-shapes to first verify that the batchsize = 2 works when using trtexec.

ttyio on Nov 30, 2020