TensorRT: š [Bug] Encountered bug when using Torch-TensorRT with torchscript model Conformer Transducer
Bug Description
I get an error when converting a conformer transducer enecoder to tensorrt. (asr task)
To Reproduce
CODE:
import nemo.collections.asr as nemo_asr
import torch
import torch_tensorrt as torchtrt
nemo_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="stt_en_conformer_transducer_large")
nemo_model.freeze()
nemo_model.export(output="temp_rnnt.ts", check_trace=True)
with torchtrt.logging.debug():
variant = "encoder-temp_rnnt.ts"
precisions = [torch.float, torch.half]
batch_size = 1
model = torch.jit.load(variant)
inputs = [
torchtrt.Input(shape=[batch_size, 80, 8269]), # 8269 from mel spectr for 1min wav with resample
torchtrt.Input(shape=[1]),
]
for precision in precisions:
compile_settings = {
"inputs": inputs,
"enabled_precisions": {precision},
"workspace_size": 2000000000,
"truncate_long_and_double": True,
}
print(f"Generating Torchscript-TensorRT module for batchsize {batch_size} precision {precision}")
trt_ts_module = torchtrt.compile(model, **compile_settings)
torch.jit.save(trt_ts_module, f"{variant.replace('.ts','')}_bs{batch_size}_{precision}.torch-tensorrt")
CONSOLE:
Generating Torchscript-TensorRT module for batchsize 1 precision torch.float32
WARNING: [Torch-TensorRT] - Data types for input tensors have been modified by inserting aten::to operations which cast INT64 inputs to INT32. To disable this, please recompile using INT32 inputs
WARNING: [Torch-TensorRT] - Truncating intermediate graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating intermediate graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating intermediate graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating intermediate graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating intermediate graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT TorchScript Conversion Context] - CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Float64 to Float32
WARNING: [Torch-TensorRT] - Trying to record the value lengths1.1 with the ITensor (Unnamed Layer* 13) [Unary]_output again.
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Float64 to Float32
WARNING: [Torch-TensorRT] - Truncating aten::to output type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Trying to record the value padding_length.1 with the ITensor (Unnamed Layer* 26) [Identity]_output again.
WARNING: [Torch-TensorRT] - Truncating aten::to output type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Trying to record the value 28 with the ITensor (Unnamed Layer* 26) [Identity]_output again.
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Float64 to Float32
WARNING: [Torch-TensorRT] - Unable to process input type of at::kLong, truncate type to at::kInt in scalar_to_tensor_util
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
WARNING: [Torch-TensorRT TorchScript Conversion Context] - CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
WARNING: [Torch-TensorRT
] - Truncating weight (constant in the graph) from Float64 to Float32
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Float64 to Float32
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
Segmentation fault (core dumped)
Expected behavior
Iām expecting a tensorrt file on the output
Environment
- Torch-TensorRT Version (e.g. 1.0.0): 1.4.0
- PyTorch Version (e.g. 1.0): 2.0.1+cu118
- CPU Architecture: AMD EPYC 7763 64-Core Processor
- OS (e.g., Linux): Ubuntu 20.04.5 LTS
- How you installed PyTorch (
conda
,pip
,libtorch
, source): pip - Python version: 3.8.10
- CUDA version: release 11.8, V11.8.89
- GPU models and configuration: NVIDIA A100 80GB
- image: nvcr.io/nvidia/tensorrt:22.12-py3
Additional context
I want to export from torch script to tenorrt encoder and decoder conformer transducer models
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 16
Yes, this is very helpful thank you - it looks like we are missing the
torch.ops.aten.glu.default
operator here, which is causing some of the segmentation. It is possible that the encoder/decoder separation is contributing, but I also think the converter support is important to reduce the number of TRT engines generated. I have filed a converter request here: #2663, for this operator.After further investigation on this issue, we may be able to compile this model via the
ir="dynamo"
path, which also allows model saving and loading. Currently, we will need #2195, and possibly #2249 to fully compile, save, and load this model. Additionally, I used a wrapper class to ensure the inputs are a list of tensors instead of named arguments, as below:I will follow up on this issue again as these PRs and improvements are merged.
Regarding the TorchScript path, the bug occurs on this line, where the shape of
matrix_ac
andmatrix_bd
disagree. Specifically, the issue is that this line attempts to drop extra elements inmatrix_bd
to match that ofmatrix_ac
, butmatrix_bd
already has fewer elements thanmatrix_ac
, so the truncation has no effect. This behavior is not reflected in Torch, however, so the issue is likely not with the Nemo code or Torch.The issue does not seem to be with the Conformer architecture itself, since inference in plain PyTorch is working, and it is scripting to TorchScript successfully. There is a possibility that PyTorch --> ONNX --> TensorRT might work, yes. I have verified that with #2228 and #2234, we are able to compile this model with
torchtrt.compile(model, ir="torch_compile", ...)
. I am still investigating the TorchScript path for the model.When tracing this model with the
torch_compile
IR option onmain
, we encounter the errors from #2183 and #2227, for which a fix is in-progress. I will post an update on this model once that fix is ready. In the meantime, I am looking further into the TorchScript broadcasting issue.I am able to reproduce this error in the TorchScript path on the latest
main
and with NeMo toolkit1.20.0
. It seems to stem from tensor addition operators which are not broadcastable (looks like[1, 4, 6204, 6204]
and[1, 4, 6204, 4135]
are being added). Iām not yet sure what is causing this mismatch, but it seems to be either a converter or lowering pass.@gs-olive can you take a look at this nemo model?