TensorRT: trtexec failure with multiple streams (TensorRT 8.0.1): "mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()"

Description

I’m trying to run benchmarking using TensorRT 8.0.1 using trtexec, and I receive the following error when setting more than one stream.

Command: trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2

Error: Error[3]: [executionContext.cpp::setBindingDimensions::949] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::949, condition: mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()

I can email the model files if needed.

Environment

TensorRT Version: 8.0.1-1+cuda11.3 NVIDIA GPU: NVIDIA T4 NVIDIA Driver Version: 450.80.02 CUDA Version: 11.3 CUDNN Version: Operating System: Ubuntu 20.04 Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:21.06-py3

Steps To Reproduce

The pipeline involves converting from ONNX->TRT and the benchmarking the engine file.

Step 1: Run Docker docker run --rm -it nvcr.io/nvidia/tensorrt:21.06-py3

Step 2: Upgrade to TensorRT 8.0.1 A. Download from https://developer.nvidia.com/nvidia-tensorrt-8x-download B. Run installation

dpkg -i nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.1.6-ga-20210626_1-1_amd64.deb
apt-get update
apt-get install tensorrt libcudnn8

Step 3: Convert ONNX model to TensorRT in 8.0.1 From ONNX->TRT:

trtexec --onnx=model.onnx --saveEngine=model-fp32.engine \
	--workspace=4096 \
	--minShapes=input_tensor:0:1x300x300x3 \
	--maxShapes=input_tensor:0:32x300x300x3 \
	--optShapes=input_tensor:0:8x300x300x3 \
	--buildOnly

Step 4: Run Benchmarking trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose

Output:

&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose
[07/02/2021-15:05:16] [I] === Model Options ===
[07/02/2021-15:05:16] [I] Format: *
[07/02/2021-15:05:16] [I] Model: 
[07/02/2021-15:05:16] [I] Output:
[07/02/2021-15:05:16] [I] === Build Options ===
[07/02/2021-15:05:16] [I] Max batch: explicit
[07/02/2021-15:05:16] [I] Workspace: 16 MiB
[07/02/2021-15:05:16] [I] minTiming: 1
[07/02/2021-15:05:16] [I] avgTiming: 8
[07/02/2021-15:05:16] [I] Precision: FP32
[07/02/2021-15:05:16] [I] Calibration: 
[07/02/2021-15:05:16] [I] Refit: Disabled
[07/02/2021-15:05:16] [I] Sparsity: Disabled
[07/02/2021-15:05:16] [I] Safe mode: Disabled
[07/02/2021-15:05:16] [I] Restricted mode: Disabled
[07/02/2021-15:05:16] [I] Save engine: 
[07/02/2021-15:05:16] [I] Load engine: model-fp32.engine
[07/02/2021-15:05:16] [I] NVTX verbosity: 0
[07/02/2021-15:05:16] [I] Tactic sources: Using default tactic sources
[07/02/2021-15:05:16] [I] timingCacheMode: local
[07/02/2021-15:05:16] [I] timingCacheFile: 
[07/02/2021-15:05:16] [I] Input(s)s format: fp32:CHW
[07/02/2021-15:05:16] [I] Output(s)s format: fp32:CHW
[07/02/2021-15:05:16] [I] Input build shape: input_tensor:0=1x300x300x3+1x300x300x3+1x300x300x3
[07/02/2021-15:05:16] [I] Input calibration shapes: model
[07/02/2021-15:05:16] [I] === System Options ===
[07/02/2021-15:05:16] [I] Device: 0
[07/02/2021-15:05:16] [I] DLACore: 
[07/02/2021-15:05:16] [I] Plugins:
[07/02/2021-15:05:16] [I] === Inference Options ===
[07/02/2021-15:05:16] [I] Batch: Explicit
[07/02/2021-15:05:16] [I] Input inference shape: input_tensor:0=1x300x300x3
[07/02/2021-15:05:16] [I] Iterations: 10
[07/02/2021-15:05:16] [I] Duration: 3s (+ 200ms warm up)
[07/02/2021-15:05:16] [I] Sleep time: 0ms
[07/02/2021-15:05:16] [I] Streams: 2
[07/02/2021-15:05:16] [I] ExposeDMA: Disabled
[07/02/2021-15:05:16] [I] Data transfers: Enabled
[07/02/2021-15:05:16] [I] Spin-wait: Disabled
[07/02/2021-15:05:16] [I] Multithreading: Disabled
[07/02/2021-15:05:16] [I] CUDA Graph: Disabled
[07/02/2021-15:05:16] [I] Separate profiling: Disabled
[07/02/2021-15:05:16] [I] Time Deserialize: Disabled
[07/02/2021-15:05:16] [I] Time Refit: Disabled
[07/02/2021-15:05:16] [I] Skip inference: Disabled
[07/02/2021-15:05:16] [I] Inputs:
[07/02/2021-15:05:16] [I] === Reporting Options ===
[07/02/2021-15:05:16] [I] Verbose: Enabled
[07/02/2021-15:05:16] [I] Averages: 10 inferences
[07/02/2021-15:05:16] [I] Percentile: 99
[07/02/2021-15:05:16] [I] Dump refittable layers:Disabled
[07/02/2021-15:05:16] [I] Dump output: Disabled
[07/02/2021-15:05:16] [I] Profile: Disabled
[07/02/2021-15:05:16] [I] Export timing to JSON file: 
[07/02/2021-15:05:16] [I] Export output to JSON file: 
[07/02/2021-15:05:16] [I] Export profile to JSON file: 
[07/02/2021-15:05:16] [I] 
[07/02/2021-15:05:16] [I] === Device Information ===
[07/02/2021-15:05:16] [I] Selected Device: Tesla T4
[07/02/2021-15:05:16] [I] Compute Capability: 7.5
[07/02/2021-15:05:16] [I] SMs: 40
[07/02/2021-15:05:16] [I] Compute Clock Rate: 1.59 GHz
[07/02/2021-15:05:16] [I] Device Global Memory: 15109 MiB
[07/02/2021-15:05:16] [I] Shared Memory per SM: 64 KiB
[07/02/2021-15:05:16] [I] Memory Bus Width: 256 bits (ECC enabled)
[07/02/2021-15:05:16] [I] Memory Clock Rate: 5.001 GHz
[07/02/2021-15:05:16] [I] 
[07/02/2021-15:05:16] [I] TensorRT version: 8001
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Proposal version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Split version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[07/02/2021-15:05:17] [I] [TRT] [MemUsageChange] Init CUDA: CPU +328, GPU +0, now: CPU 355, GPU 250 (MiB)
[07/02/2021-15:05:17] [I] [TRT] Loaded engine size: 19 MB
[07/02/2021-15:05:17] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 355 MiB, GPU 250 MiB
[07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +482, GPU +206, now: CPU 838, GPU 476 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +394, GPU +172, now: CPU 1232, GPU 648 (MiB)
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1232, GPU 630 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Deserialization required 1204936 microseconds.
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1232 MiB, GPU 630 MiB
[07/02/2021-15:05:18] [I] Engine loaded in 1.74508 sec.
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1212 MiB, GPU 630 MiB
[07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +10, now: CPU 1213, GPU 640 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1213, GPU 648 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Total per-runner device memory is 16729600
[07/02/2021-15:05:18] [V] [TRT] Total per-runner host memory is 101424
[07/02/2021-15:05:18] [V] [TRT] Allocated activation device memory of size 445687808
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1219 MiB, GPU 1090 MiB
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1219 MiB, GPU 1090 MiB
[07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1219, GPU 1098 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1219, GPU 1108 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Total per-runner device memory is 16729600
[07/02/2021-15:05:18] [V] [TRT] Total per-runner host memory is 101424
[07/02/2021-15:05:18] [V] [TRT] Allocated activation device memory of size 445687808
[07/02/2021-15:05:18] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1219 MiB, GPU 1550 MiB
[07/02/2021-15:05:18] [E] Error[3]: [executionContext.cpp::setBindingDimensions::949] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::949, condition: mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()
)
[07/02/2021-15:05:18] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8001] # trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1219, GPU 1518 (MiB)
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1219, GPU 1058 (MiB)

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15

Most upvoted comments

Hello @gabrielibagon , currently --streams in TRT not works with dynamic shapes, it is not TRT limitation, we just have not polish trtexec yet.

I was having the same “Assertion mOptimizationProfile >= 0 failed” problem in TensorRT8.6 and 9.1 using tensorrt python api, with dynamic shape, and multi-stream, multi- CPU thread, multi-execution-context. Can we reopen this issue ? I have noticed that in such cases, it seems each context should manually use a different optimization profile? https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#perform-inference https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes I tried using https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/ExecutionContext.html#tensorrt.IExecutionContext.set_optimization_profile_async this function after each thread create execution context, but then it throw another error saying that Profile is being used by another context. Finally I tried setting the Previewfeature of kPROFILE_SHARING_0806 and finally solved the problem. (It’s not default to enabled in TRT9.1 as it should be https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/BuilderConfig.html#tensorrt.PreviewFeature) My question is, what shall we do with TensorRT prior than 8.6, to use muti-stream and dynamic shape? @ttyio

Is there an ETA on trtexec support for --streams >1 for dynamic shaped inputs