TensorRT: trtexec failure with multiple streams (TensorRT 8.0.1): "mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()"
Description
I’m trying to run benchmarking using TensorRT 8.0.1 using trtexec
, and I receive the following error when setting more than one stream.
Command:
trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2
Error:
Error[3]: [executionContext.cpp::setBindingDimensions::949] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::949, condition: mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()
I can email the model files if needed.
Environment
TensorRT Version: 8.0.1-1+cuda11.3 NVIDIA GPU: NVIDIA T4 NVIDIA Driver Version: 450.80.02 CUDA Version: 11.3 CUDNN Version: Operating System: Ubuntu 20.04 Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:21.06-py3
Steps To Reproduce
The pipeline involves converting from ONNX->TRT and the benchmarking the engine file.
Step 1: Run Docker
docker run --rm -it nvcr.io/nvidia/tensorrt:21.06-py3
Step 2: Upgrade to TensorRT 8.0.1 A. Download from https://developer.nvidia.com/nvidia-tensorrt-8x-download B. Run installation
dpkg -i nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.1.6-ga-20210626_1-1_amd64.deb
apt-get update
apt-get install tensorrt libcudnn8
Step 3: Convert ONNX model to TensorRT in 8.0.1 From ONNX->TRT:
trtexec --onnx=model.onnx --saveEngine=model-fp32.engine \
--workspace=4096 \
--minShapes=input_tensor:0:1x300x300x3 \
--maxShapes=input_tensor:0:32x300x300x3 \
--optShapes=input_tensor:0:8x300x300x3 \
--buildOnly
Step 4: Run Benchmarking
trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose
Output:
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose
[07/02/2021-15:05:16] [I] === Model Options ===
[07/02/2021-15:05:16] [I] Format: *
[07/02/2021-15:05:16] [I] Model:
[07/02/2021-15:05:16] [I] Output:
[07/02/2021-15:05:16] [I] === Build Options ===
[07/02/2021-15:05:16] [I] Max batch: explicit
[07/02/2021-15:05:16] [I] Workspace: 16 MiB
[07/02/2021-15:05:16] [I] minTiming: 1
[07/02/2021-15:05:16] [I] avgTiming: 8
[07/02/2021-15:05:16] [I] Precision: FP32
[07/02/2021-15:05:16] [I] Calibration:
[07/02/2021-15:05:16] [I] Refit: Disabled
[07/02/2021-15:05:16] [I] Sparsity: Disabled
[07/02/2021-15:05:16] [I] Safe mode: Disabled
[07/02/2021-15:05:16] [I] Restricted mode: Disabled
[07/02/2021-15:05:16] [I] Save engine:
[07/02/2021-15:05:16] [I] Load engine: model-fp32.engine
[07/02/2021-15:05:16] [I] NVTX verbosity: 0
[07/02/2021-15:05:16] [I] Tactic sources: Using default tactic sources
[07/02/2021-15:05:16] [I] timingCacheMode: local
[07/02/2021-15:05:16] [I] timingCacheFile:
[07/02/2021-15:05:16] [I] Input(s)s format: fp32:CHW
[07/02/2021-15:05:16] [I] Output(s)s format: fp32:CHW
[07/02/2021-15:05:16] [I] Input build shape: input_tensor:0=1x300x300x3+1x300x300x3+1x300x300x3
[07/02/2021-15:05:16] [I] Input calibration shapes: model
[07/02/2021-15:05:16] [I] === System Options ===
[07/02/2021-15:05:16] [I] Device: 0
[07/02/2021-15:05:16] [I] DLACore:
[07/02/2021-15:05:16] [I] Plugins:
[07/02/2021-15:05:16] [I] === Inference Options ===
[07/02/2021-15:05:16] [I] Batch: Explicit
[07/02/2021-15:05:16] [I] Input inference shape: input_tensor:0=1x300x300x3
[07/02/2021-15:05:16] [I] Iterations: 10
[07/02/2021-15:05:16] [I] Duration: 3s (+ 200ms warm up)
[07/02/2021-15:05:16] [I] Sleep time: 0ms
[07/02/2021-15:05:16] [I] Streams: 2
[07/02/2021-15:05:16] [I] ExposeDMA: Disabled
[07/02/2021-15:05:16] [I] Data transfers: Enabled
[07/02/2021-15:05:16] [I] Spin-wait: Disabled
[07/02/2021-15:05:16] [I] Multithreading: Disabled
[07/02/2021-15:05:16] [I] CUDA Graph: Disabled
[07/02/2021-15:05:16] [I] Separate profiling: Disabled
[07/02/2021-15:05:16] [I] Time Deserialize: Disabled
[07/02/2021-15:05:16] [I] Time Refit: Disabled
[07/02/2021-15:05:16] [I] Skip inference: Disabled
[07/02/2021-15:05:16] [I] Inputs:
[07/02/2021-15:05:16] [I] === Reporting Options ===
[07/02/2021-15:05:16] [I] Verbose: Enabled
[07/02/2021-15:05:16] [I] Averages: 10 inferences
[07/02/2021-15:05:16] [I] Percentile: 99
[07/02/2021-15:05:16] [I] Dump refittable layers:Disabled
[07/02/2021-15:05:16] [I] Dump output: Disabled
[07/02/2021-15:05:16] [I] Profile: Disabled
[07/02/2021-15:05:16] [I] Export timing to JSON file:
[07/02/2021-15:05:16] [I] Export output to JSON file:
[07/02/2021-15:05:16] [I] Export profile to JSON file:
[07/02/2021-15:05:16] [I]
[07/02/2021-15:05:16] [I] === Device Information ===
[07/02/2021-15:05:16] [I] Selected Device: Tesla T4
[07/02/2021-15:05:16] [I] Compute Capability: 7.5
[07/02/2021-15:05:16] [I] SMs: 40
[07/02/2021-15:05:16] [I] Compute Clock Rate: 1.59 GHz
[07/02/2021-15:05:16] [I] Device Global Memory: 15109 MiB
[07/02/2021-15:05:16] [I] Shared Memory per SM: 64 KiB
[07/02/2021-15:05:16] [I] Memory Bus Width: 256 bits (ECC enabled)
[07/02/2021-15:05:16] [I] Memory Clock Rate: 5.001 GHz
[07/02/2021-15:05:16] [I]
[07/02/2021-15:05:16] [I] TensorRT version: 8001
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Proposal version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Split version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[07/02/2021-15:05:17] [I] [TRT] [MemUsageChange] Init CUDA: CPU +328, GPU +0, now: CPU 355, GPU 250 (MiB)
[07/02/2021-15:05:17] [I] [TRT] Loaded engine size: 19 MB
[07/02/2021-15:05:17] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 355 MiB, GPU 250 MiB
[07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +482, GPU +206, now: CPU 838, GPU 476 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +394, GPU +172, now: CPU 1232, GPU 648 (MiB)
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1232, GPU 630 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Deserialization required 1204936 microseconds.
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1232 MiB, GPU 630 MiB
[07/02/2021-15:05:18] [I] Engine loaded in 1.74508 sec.
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1212 MiB, GPU 630 MiB
[07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +10, now: CPU 1213, GPU 640 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1213, GPU 648 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Total per-runner device memory is 16729600
[07/02/2021-15:05:18] [V] [TRT] Total per-runner host memory is 101424
[07/02/2021-15:05:18] [V] [TRT] Allocated activation device memory of size 445687808
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1219 MiB, GPU 1090 MiB
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1219 MiB, GPU 1090 MiB
[07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1219, GPU 1098 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1219, GPU 1108 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Total per-runner device memory is 16729600
[07/02/2021-15:05:18] [V] [TRT] Total per-runner host memory is 101424
[07/02/2021-15:05:18] [V] [TRT] Allocated activation device memory of size 445687808
[07/02/2021-15:05:18] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1219 MiB, GPU 1550 MiB
[07/02/2021-15:05:18] [E] Error[3]: [executionContext.cpp::setBindingDimensions::949] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::949, condition: mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()
)
[07/02/2021-15:05:18] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8001] # trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1219, GPU 1518 (MiB)
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1219, GPU 1058 (MiB)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15
Hello @gabrielibagon , currently
--streams
in TRT not works with dynamic shapes, it is not TRT limitation, we just have not polish trtexec yet.I was having the same “Assertion mOptimizationProfile >= 0 failed” problem in TensorRT8.6 and 9.1 using tensorrt python api, with dynamic shape, and multi-stream, multi- CPU thread, multi-execution-context. Can we reopen this issue ? I have noticed that in such cases, it seems each context should manually use a different optimization profile? https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#perform-inference https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes I tried using https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/ExecutionContext.html#tensorrt.IExecutionContext.set_optimization_profile_async this function after each thread create execution context, but then it throw another error saying that Profile is being used by another context. Finally I tried setting the Previewfeature of kPROFILE_SHARING_0806 and finally solved the problem. (It’s not default to enabled in TRT9.1 as it should be https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Core/BuilderConfig.html#tensorrt.PreviewFeature) My question is, what shall we do with TensorRT prior than 8.6, to use muti-stream and dynamic shape? @ttyio
Is there an ETA on trtexec support for --streams >1 for dynamic shaped inputs