TensorRT: 🐛 [Bug] torch_tensorrt::torchscript::compile gets stuck; bug caused by elimination exception

Bug Description

after calling auto trt_mod = torch_tensorrt::torchscript::compile(module, compile_settings); the process gets stuck in an infinite(?) loop. I can also observe that the GPU load drops back to 0% after about 1s.

According to this link: https://github.com/pytorch/TensorRT/pull/1409 the issue should already have been fixed.

Error message

1 __memmove_avx_unaligned 0x7fff79289cc1 2 std::vectortorch::jit::Use::_M_erase(__gnu_cxx::__normal_iterator<torch::jit::Use *, std::vectortorch::jit::Use>) 0x7fffab48412f 3 torch::jit::Value::replaceFirstUseWith(torch::jit::Value *) 0x7fffab46ff5d 4 torch::jit::Value::replaceAllUsesWith(torch::jit::Value *) 0x7fffab46ffcb 5 torch::jit::EliminateExceptions(torch::jit::Block *) 0x7fffab63c3c9 6 torch::jit::EliminateExceptions(std::shared_ptrtorch::jit::Graph&) 0x7fffab63c999 7 torch_tensorrt::core::lowering::LowerGraph(std::shared_ptrtorch::jit::Graph&, std::vectorc10::IValue&, torch_tensorrt::core::lowering::LowerInfo) 0x7fffd7426b0d 8 torch_tensorrt::core::lowering::Lower(torch::jit::Module const&, std::string, torch_tensorrt::core::lowering::LowerInfo const&) 0x7fffd742a181 9 torch_tensorrt::core::CompileGraph(torch::jit::Module const&, torch_tensorrt::core::CompileSpec) 0x7fffd732b5a8 10 torch_tensorrt::torchscript::compile(torch::jit::Module const&, torch_tensorrt::torchscript::CompileSpec) 0x7fffd7313a04 11 ModelLoader::optimizeWithTensorRT modelloader.cpp 266 0x5ad43c
12 InferenceDisplay::<lambda()>::<lambda()>::operator() inferencedisplay.cpp 1330 0x58c996
13 std::_Function_handler<void(), InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>::<lambda()>>::_M_invoke(const std::_Any_data &) std_function.h 316 0x58c996
14 std::function<void ()>::operator()() const std_function.h 706 0x5cbcca
15 errorwrapper::loading(std::function<void ()>) errorwrapper.cpp 11 0x5cbcca
16 InferenceDisplay::<lambda()>::operator() inferencedisplay.cpp 1333 0x58e127
17 QtPrivate::FunctorCall<QtPrivate::IndexesList<>, QtPrivate::List<>, void, InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>>::call qobjectdefs_impl.h 146 0x58e127
18 QtPrivate::Functor<InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>, 0>::call<QtPrivate::List<>, void> qobjectdefs_impl.h 256 0x58e127
19 QtPrivate::QFunctorSlotObject<InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>, 0, QtPrivate::List<>, void>::impl(int, QtPrivate::QSlotObjectBase *, QObject *, void * *, bool *) qobjectdefs_impl.h 439 0x58e127
20 QMetaObject::activate(QObject *, int, int, void * *) 0x7fff7a163f8f … <More>

Expected behavior

successful torch-tensorrt optimization of a torchscript model

Environment

  • Torch-TensorRT Version: v1.3.0
  • PyTorch Version : 1.13.0
  • OS: Linux
  • PyTorch : libtorch 1.13+cu117
  • CUDA version: 11.7
  • cudnn version: 8.5.0.96
  • TensorRT version: 8.5.2.2

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 31 (1 by maintainers)

Most upvoted comments

Update: I just installed the libtorch nightly binary which comes with CUDA 12.1

I built a torch_tensorrt-docker-image with CUDA-12.1 and then the torch_tensorrt library. When compiling my minimal example the compilation process fails with 3 undefined references to glibc:

~/Dokumente/Projekte/Torch-TensorRT-Minimal-Example/tensorrt_api> ./build.sh – The C compiler identification is GNU 11.3.0 – The CXX compiler identification is GNU 11.3.0 – Detecting C compiler ABI info – Detecting C compiler ABI info - done – Check for working C compiler: /usr/bin/gcc-11 - skipped – Detecting C compile features – Detecting C compile features - done – Detecting CXX compiler ABI info – Detecting CXX compiler ABI info - done – Check for working CXX compiler: /usr/bin/g+±11 - skipped – Detecting CXX compile features – Detecting CXX compile features - done – Looking for pthread.h – Looking for pthread.h - found – Performing Test CMAKE_HAVE_LIBC_PTHREAD – Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed – Looking for pthread_create in pthreads – Looking for pthread_create in pthreads - not found – Looking for pthread_create in pthread – Looking for pthread_create in pthread - found – Found Threads: TRUE
– Found TensorRT headers at /usr/local/TensorRT/include – Found TensorRT libs at /usr/local/TensorRT/lib/libnvinfer.so;/usr/local/TensorRT/lib/libnvinfer_plugin.so – Found TENSORRT: /usr/local/TensorRT/include
– Found CUDA: /usr/local/cuda (found version “12.1”) – The CUDA compiler identification is NVIDIA 12.1.105 – Detecting CUDA compiler ABI info – Detecting CUDA compiler ABI info - done – Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped – Detecting CUDA compile features – Detecting CUDA compile features - done – Found CUDAToolkit: /usr/local/cuda/include (found version “12.1.105”) – Caffe2: CUDA detected: 12.1 – Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc – Caffe2: CUDA toolkit directory: /usr/local/cuda – Caffe2: Header version is: 12.1 – /usr/local/cuda/lib64/libnvrtc.so shorthash is b51b459d – USE_CUDNN is set to 0. Compiling without cuDNN support – Autodetected CUDA architecture(s): 7.5 – Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75 – Found Torch: /usr/local/libtorch/lib/libtorch.so
– Configuring done – Generating done – Build files have been written to: /home/jrb/Dokumente/Projekte/Torch-TensorRT-Minimal-Example/tensorrt_api/build [2/2] Linking CXX executable TensorRT-app FAILED: TensorRT-app : && /usr/bin/g+±11 -D_GLIBCXX_USE_CXX11_ABI=0 -g -rdynamic CMakeFiles/TensorRT-app.dir/TensorRT-app.cpp.o -o TensorRT-app -L/usr/local/torch_tensorrt/lib -Wl,-rpath,/usr/local/cudnn/lib:/usr/local/TensorRT/lib:/usr/local/torch_tensorrt/lib:/usr/local/libtorch/lib:/usr/local/cuda/lib64 /usr/local/cudnn/lib/libcudnn.so.8 /usr/local/TensorRT/lib/libnvinfer.so /usr/local/TensorRT/lib/libnvinfer_plugin.so -ltorchtrt /usr/local/libtorch/lib/libtorch.so /usr/local/libtorch/lib/libc10.so /usr/local/libtorch/lib/libkineto.a -lcuda /usr/local/cuda/lib64/libnvrtc.so /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcudart.so /usr/local/libtorch/lib/libc10_cuda.so -Wl,–no-as-needed,“/usr/local/libtorch/lib/libtorch_cpu.so” -Wl,–as-needed -Wl,–no-as-needed,“/usr/local/libtorch/lib/libtorch_cuda.so” -Wl,–as-needed /usr/local/libtorch/lib/libc10_cuda.so /usr/local/libtorch/lib/libc10.so /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcufft.so /usr/local/cuda/lib64/libcurand.so /usr/local/cuda/lib64/libcublas.so /usr/local/cuda/lib64/libcublasLt.so -Wl,–no-as-needed,“/usr/local/libtorch/lib/libtorch.so” -Wl,–as-needed && :

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to lstat@GLIBC_2.33

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to __libc_single_threaded@GLIBC_2.32

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to stat@GLIBC_2.33

collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed.

The docker build command works fine. In both cases described in my previous comment, the problem arises inside the running docker container, either running: bazel test //tests/core/conversion/converters:test_activation --compilation_mode=opt --test_output=summary --config pre_cxx11_abi

or: /docker/dist-build.sh

I suspect this might be related to issue #1823