TensorRT: šŸ› [Bug] Encountered bug when using Torch-TensorRT with torchscript model Conformer Transducer

Bug Description

I get an error when converting a conformer transducer enecoder to tensorrt. (asr task)

To Reproduce

requirenments.txt

CODE:

import nemo.collections.asr as nemo_asr
import torch
import torch_tensorrt as torchtrt


nemo_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="stt_en_conformer_transducer_large")
nemo_model.freeze()
nemo_model.export(output="temp_rnnt.ts", check_trace=True)


with torchtrt.logging.debug():
    variant = "encoder-temp_rnnt.ts"
    precisions = [torch.float, torch.half]
    batch_size = 1

    model = torch.jit.load(variant)

    inputs = [
            torchtrt.Input(shape=[batch_size, 80, 8269]), # 8269 from mel spectr for 1min wav with resample
            torchtrt.Input(shape=[1]),
        ]

    for precision in precisions:
        compile_settings = {
            "inputs": inputs, 
            "enabled_precisions": {precision},
            "workspace_size": 2000000000,
            "truncate_long_and_double": True,
        }
        print(f"Generating Torchscript-TensorRT module for batchsize {batch_size} precision {precision}")
        trt_ts_module = torchtrt.compile(model, **compile_settings)
        torch.jit.save(trt_ts_module, f"{variant.replace('.ts','')}_bs{batch_size}_{precision}.torch-tensorrt")

CONSOLE:

Generating Torchscript-TensorRT module for batchsize 1 precision torch.float32
WARNING: [Torch-TensorRT] - Data types for input tensors have been modified by inserting aten::to operations which cast INT64 inputs to INT32. To disable this, please recompile using INT32 inputs
WARNING: [Torch-TensorRT] - Truncating intermediate graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating intermediate graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating intermediate graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating intermediate graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Truncating intermediate graph input type from at::kLong to at::kInt
WARNING: [Torch-TensorRT TorchScript Conversion Context] - CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Float64 to Float32
WARNING: [Torch-TensorRT] - Trying to record the value lengths1.1 with the ITensor (Unnamed Layer* 13) [Unary]_output again.
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Float64 to Float32
WARNING: [Torch-TensorRT] - Truncating aten::to output type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Trying to record the value padding_length.1 with the ITensor (Unnamed Layer* 26) [Identity]_output again.
WARNING: [Torch-TensorRT] - Truncating aten::to output type from at::kLong to at::kInt
WARNING: [Torch-TensorRT] - Trying to record the value 28 with the ITensor (Unnamed Layer* 26) [Identity]_output again.
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Float64 to Float32
WARNING: [Torch-TensorRT] - Unable to process input type of at::kLong, truncate type to at::kInt in scalar_to_tensor_util 
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Int64 to Int32
WARNING: [Torch-TensorRT] - CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
WARNING: [Torch-TensorRT TorchScript Conversion Context] - CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
WARNING: [Torch-TensorRT
] - Truncating weight (constant in the graph) from Float64 to Float32
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
WARNING: [Torch-TensorRT] - Truncating weight (constant in the graph) from Float64 to Float32
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
ERROR: [Torch-TensorRT TorchScript Conversion Context] - 4: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer %103 : Tensor = aten::add(%matrix_ac.1, %matrix_bd0.1, %124) # /usr/local/lib/python3.8/dist-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py:243:0: broadcast dimensions must be conformable)
Segmentation fault (core dumped)

Expected behavior

I’m expecting a tensorrt file on the output

Environment

  • Torch-TensorRT Version (e.g. 1.0.0): 1.4.0
  • PyTorch Version (e.g. 1.0): 2.0.1+cu118
  • CPU Architecture: AMD EPYC 7763 64-Core Processor
  • OS (e.g., Linux): Ubuntu 20.04.5 LTS
  • How you installed PyTorch (conda, pip, libtorch, source): pip
  • Python version: 3.8.10
  • CUDA version: release 11.8, V11.8.89
  • GPU models and configuration: NVIDIA A100 80GB
  • image: nvcr.io/nvidia/tensorrt:22.12-py3

Additional context

I want to export from torch script to tenorrt encoder and decoder conformer transducer models

About this issue

Most upvoted comments

Yes, this is very helpful thank you - it looks like we are missing the torch.ops.aten.glu.default operator here, which is causing some of the segmentation. It is possible that the encoder/decoder separation is contributing, but I also think the converter support is important to reduce the number of TRT engines generated. I have filed a converter request here: #2663, for this operator.

After further investigation on this issue, we may be able to compile this model via the ir="dynamo" path, which also allows model saving and loading. Currently, we will need #2195, and possibly #2249 to fully compile, save, and load this model. Additionally, I used a wrapper class to ensure the inputs are a list of tensors instead of named arguments, as below:

class ModelWrapper(torch.nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.nemo_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="stt_en_conformer_transducer_small")
        self.nemo_model.freeze()
        self.nemo_model.eval().cuda()

    def forward(self, x, y):
        return self.nemo_model(processed_signal=x,
                               processed_signal_length=y)

I will follow up on this issue again as these PRs and improvements are merged.

Regarding the TorchScript path, the bug occurs on this line, where the shape of matrix_ac and matrix_bd disagree. Specifically, the issue is that this line attempts to drop extra elements in matrix_bd to match that of matrix_ac, but matrix_bd already has fewer elements than matrix_ac, so the truncation has no effect. This behavior is not reflected in Torch, however, so the issue is likely not with the Nemo code or Torch.

The issue does not seem to be with the Conformer architecture itself, since inference in plain PyTorch is working, and it is scripting to TorchScript successfully. There is a possibility that PyTorch --> ONNX --> TensorRT might work, yes. I have verified that with #2228 and #2234, we are able to compile this model with torchtrt.compile(model, ir="torch_compile", ...). I am still investigating the TorchScript path for the model.

When tracing this model with the torch_compile IR option on main, we encounter the errors from #2183 and #2227, for which a fix is in-progress. I will post an update on this model once that fix is ready. In the meantime, I am looking further into the TorchScript broadcasting issue.

I am able to reproduce this error in the TorchScript path on the latest main and with NeMo toolkit 1.20.0. It seems to stem from tensor addition operators which are not broadcastable (looks like [1, 4, 6204, 6204] and [1, 4, 6204, 4135] are being added). I’m not yet sure what is causing this mismatch, but it seems to be either a converter or lowering pass.

@gs-olive can you take a look at this nemo model?