TensorRT: 🐛 [Bug] torch_tensorrt::torchscript::compile gets stuck; bug caused by elimination exception

Bug Description

after calling auto trt_mod = torch_tensorrt::torchscript::compile(module, compile_settings); the process gets stuck in an infinite(?) loop. I can also observe that the GPU load drops back to 0% after about 1s.

According to this link: https://github.com/pytorch/TensorRT/pull/1409 the issue should already have been fixed.

Error message

1 __memmove_avx_unaligned 0x7fff79289cc1 2 std::vectortorch::jit::Use::_M_erase(__gnu_cxx::__normal_iterator<torch::jit::Use *, std::vectortorch::jit::Use>) 0x7fffab48412f 3 torch::jit::Value::replaceFirstUseWith(torch::jit::Value *) 0x7fffab46ff5d 4 torch::jit::Value::replaceAllUsesWith(torch::jit::Value *) 0x7fffab46ffcb 5 torch::jit::EliminateExceptions(torch::jit::Block *) 0x7fffab63c3c9 6 torch::jit::EliminateExceptions(std::shared_ptrtorch::jit::Graph&) 0x7fffab63c999 7 torch_tensorrt::core::lowering::LowerGraph(std::shared_ptrtorch::jit::Graph&, std::vectorc10::IValue&, torch_tensorrt::core::lowering::LowerInfo) 0x7fffd7426b0d 8 torch_tensorrt::core::lowering::Lower(torch::jit::Module const&, std::string, torch_tensorrt::core::lowering::LowerInfo const&) 0x7fffd742a181 9 torch_tensorrt::core::CompileGraph(torch::jit::Module const&, torch_tensorrt::core::CompileSpec) 0x7fffd732b5a8 10 torch_tensorrt::torchscript::compile(torch::jit::Module const&, torch_tensorrt::torchscript::CompileSpec) 0x7fffd7313a04 11 ModelLoader::optimizeWithTensorRT modelloader.cpp 266 0x5ad43c
12 InferenceDisplay::<lambda()>::<lambda()>::operator() inferencedisplay.cpp 1330 0x58c996
13 std::_Function_handler<void(), InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>::<lambda()>>::_M_invoke(const std::_Any_data &) std_function.h 316 0x58c996
14 std::function<void ()>::operator()() const std_function.h 706 0x5cbcca
15 errorwrapper::loading(std::function<void ()>) errorwrapper.cpp 11 0x5cbcca
16 InferenceDisplay::<lambda()>::operator() inferencedisplay.cpp 1333 0x58e127
17 QtPrivate::FunctorCall<QtPrivate::IndexesList<>, QtPrivate::List<>, void, InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>>::call qobjectdefs_impl.h 146 0x58e127
18 QtPrivate::Functor<InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>, 0>::call<QtPrivate::List<>, void> qobjectdefs_impl.h 256 0x58e127
19 QtPrivate::QFunctorSlotObject<InferenceDisplay::InferenceDisplay(QWidget *, DataController&)::<lambda()>, 0, QtPrivate::List<>, void>::impl(int, QtPrivate::QSlotObjectBase *, QObject *, void * *, bool *) qobjectdefs_impl.h 439 0x58e127
20 QMetaObject::activate(QObject *, int, int, void * *) 0x7fff7a163f8f … <More>

Expected behavior

successful torch-tensorrt optimization of a torchscript model

Environment

Torch-TensorRT Version: v1.3.0
PyTorch Version : 1.13.0
OS: Linux
PyTorch : libtorch 1.13+cu117
CUDA version: 11.7
cudnn version: 8.5.0.96
TensorRT version: 8.5.2.2

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 31 (1 by maintainers)

Most upvoted comments

Update: I just installed the libtorch nightly binary which comes with CUDA 12.1

I built a torch_tensorrt-docker-image with CUDA-12.1 and then the torch_tensorrt library. When compiling my minimal example the compilation process fails with 3 undefined references to glibc:

~/Dokumente/Projekte/Torch-TensorRT-Minimal-Example/tensorrt_api> ./build.sh – The C compiler identification is GNU 11.3.0 – The CXX compiler identification is GNU 11.3.0 – Detecting C compiler ABI info – Detecting C compiler ABI info - done – Check for working C compiler: /usr/bin/gcc-11 - skipped – Detecting C compile features – Detecting C compile features - done – Detecting CXX compiler ABI info – Detecting CXX compiler ABI info - done – Check for working CXX compiler: /usr/bin/g+±11 - skipped – Detecting CXX compile features – Detecting CXX compile features - done – Looking for pthread.h – Looking for pthread.h - found – Performing Test CMAKE_HAVE_LIBC_PTHREAD – Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed – Looking for pthread_create in pthreads – Looking for pthread_create in pthreads - not found – Looking for pthread_create in pthread – Looking for pthread_create in pthread - found – Found Threads: TRUE
– Found TensorRT headers at /usr/local/TensorRT/include – Found TensorRT libs at /usr/local/TensorRT/lib/libnvinfer.so;/usr/local/TensorRT/lib/libnvinfer_plugin.so – Found TENSORRT: /usr/local/TensorRT/include
– Found CUDA: /usr/local/cuda (found version “12.1”) – The CUDA compiler identification is NVIDIA 12.1.105 – Detecting CUDA compiler ABI info – Detecting CUDA compiler ABI info - done – Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped – Detecting CUDA compile features – Detecting CUDA compile features - done – Found CUDAToolkit: /usr/local/cuda/include (found version “12.1.105”) – Caffe2: CUDA detected: 12.1 – Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc – Caffe2: CUDA toolkit directory: /usr/local/cuda – Caffe2: Header version is: 12.1 – /usr/local/cuda/lib64/libnvrtc.so shorthash is b51b459d – USE_CUDNN is set to 0. Compiling without cuDNN support – Autodetected CUDA architecture(s): 7.5 – Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75 – Found Torch: /usr/local/libtorch/lib/libtorch.so
– Configuring done – Generating done – Build files have been written to: /home/jrb/Dokumente/Projekte/Torch-TensorRT-Minimal-Example/tensorrt_api/build [2/2] Linking CXX executable TensorRT-app FAILED: TensorRT-app : && /usr/bin/g+±11 -D_GLIBCXX_USE_CXX11_ABI=0 -g -rdynamic CMakeFiles/TensorRT-app.dir/TensorRT-app.cpp.o -o TensorRT-app -L/usr/local/torch_tensorrt/lib -Wl,-rpath,/usr/local/cudnn/lib:/usr/local/TensorRT/lib:/usr/local/torch_tensorrt/lib:/usr/local/libtorch/lib:/usr/local/cuda/lib64 /usr/local/cudnn/lib/libcudnn.so.8 /usr/local/TensorRT/lib/libnvinfer.so /usr/local/TensorRT/lib/libnvinfer_plugin.so -ltorchtrt /usr/local/libtorch/lib/libtorch.so /usr/local/libtorch/lib/libc10.so /usr/local/libtorch/lib/libkineto.a -lcuda /usr/local/cuda/lib64/libnvrtc.so /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcudart.so /usr/local/libtorch/lib/libc10_cuda.so -Wl,–no-as-needed,“/usr/local/libtorch/lib/libtorch_cpu.so” -Wl,–as-needed -Wl,–no-as-needed,“/usr/local/libtorch/lib/libtorch_cuda.so” -Wl,–as-needed /usr/local/libtorch/lib/libc10_cuda.so /usr/local/libtorch/lib/libc10.so /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcufft.so /usr/local/cuda/lib64/libcurand.so /usr/local/cuda/lib64/libcublas.so /usr/local/cuda/lib64/libcublasLt.so -Wl,–no-as-needed,“/usr/local/libtorch/lib/libtorch.so” -Wl,–as-needed && :

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to lstat@GLIBC_2.33

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to __libc_single_threaded@GLIBC_2.32

/usr/lib64/gcc/x86_64-suse-linux/11/../../../../x86_64-suse-linux/bin/ld: /usr/local/torch_tensorrt/lib/libtorchtrt.so: undefined reference to stat@GLIBC_2.33

collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed.

bjaeger1 on Jul 12, 2023

The docker build command works fine. In both cases described in my previous comment, the problem arises inside the running docker container, either running: bazel test //tests/core/conversion/converters:test_activation --compilation_mode=opt --test_output=summary --config pre_cxx11_abi

or: /docker/dist-build.sh

bjaeger1 on Jul 12, 2023

I suspect this might be related to issue #1823

gcuendet on Apr 20, 2023