mmdetection: ImportError: ./mmdetection/mmdet/ops/nms/gpu_nms.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __cudaRegisterFatBinaryEnd

Hi, I`m facing the problem with training:

alexander@alexander-desktop:~/Code/Projects/mmdetection$ ./tools/dist_train.sh configs/retinanet_r101_fpn_1x.py 1 Traceback (most recent call last): File “./tools/train.py”, line 7, in <module> from mmdet.datasets import get_dataset File “/home/alexander/Code/Projects/mmdetection/mmdet/datasets/init.py”, line 1, in <module> from .custom import CustomDataset File “/home/alexander/Code/Projects/mmdetection/mmdet/datasets/custom.py”, line 11, in <module> from .extra_aug import ExtraAugmentation File “/home/alexander/Code/Projects/mmdetection/mmdet/datasets/extra_aug.py”, line 5, in <module> from mmdet.core.evaluation.bbox_overlaps import bbox_overlaps File “/home/alexander/Code/Projects/mmdetection/mmdet/core/init.py”, line 6, in <module> from .post_processing import * # noqa: F401, F403 File “/home/alexander/Code/Projects/mmdetection/mmdet/core/post_processing/init.py”, line 1, in <module> from .bbox_nms import multiclass_nms File “/home/alexander/Code/Projects/mmdetection/mmdet/core/post_processing/bbox_nms.py”, line 3, in <module> from mmdet.ops.nms import nms_wrapper File “/home/alexander/Code/Projects/mmdetection/mmdet/ops/init.py”, line 5, in <module> from .nms import nms, soft_nms File “/home/alexander/Code/Projects/mmdetection/mmdet/ops/nms/init.py”, line 1, in <module> from .nms_wrapper import nms, soft_nms File “/home/alexander/Code/Projects/mmdetection/mmdet/ops/nms/nms_wrapper.py”, line 4, in <module> from .gpu_nms import gpu_nms ImportError: /home/alexander/Code/Projects/mmdetection/mmdet/ops/nms/gpu_nms.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __cudaRegisterFatBinaryEnd

I`m using CUDA 10.1, pytorch 1.0.1.post2, python 3.6 on Ubuntu 18.04 Everything compiled well during installation.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 24 (3 by maintainers)

Commits related to this issue

fix msdef (#385) fix multi-scale def — committed to liuhuiCNN/mmdetection by littletomatodonkey 4 years ago
Tabular: Added improved inference time estimates + networkx dependency (#385) * Added improved inference time estimates + networkx dependency * Addressed comments * Minor cleanup of comments — committed to FANGAreNotGnu/mmdetection by Innixma 4 years ago

Most upvoted comments

Just solved the same issue. Check the compatibility of your Pytorch and Cuda version.

+18

donglao on Mar 10, 2019

Well, I’ve solved this issue on my machine using pytorch 1.1.0 (latest version on github).

gcc 5.x doesn’t help, because some compile options in the CMakeLists.txt of pytorch 1.1.0 is not supported by gcc 5.x, while gcc 7.x is OK to compile pytroch.

It seems that CUDA 10.0 is slightly different from CUDA 10.1. This issue is caused by the CUDA version mismatch of pytorch during compiling time and run-time. Other conditions might occur if other mismatched run-time CUDA version is installed. For example, this error: “undefined symbol: __cudaPopCallConfiguration” might occur for earlier version of CUDA. Thus, my solution is to recompile pytorch to match the run-time CUDA version. Maybe change the CUDA run-time version also works, I didn’t test that. Here is how I fixed it.

(Ubuntu 18.04 only)

1. Uninstall pytorch if it doesn’t work:

pip uninstall pytorch #  conda uninstall pytorch, if you use conda

2. Install CUDA-10.0 (optional)

This step is optional, other version of CUDA should be OK, if the CUDA version of compiling time matches run-time version.

Following the instructions of run file here:

Then check nvcc version:

nvcc -V

The output should be something like (release 10.0):

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

Note that symlink is needed after cuda installation:

sudo rm -f /usr/local/cuda # optional, only if you already have this symlink
sudo ln -s /usr/local/cuda-10.0 /usr/local/cuda

Then, add paths to your ~/.basrc file. These paths will be used during pytorch compiling.

export CUDA_HOME=/usr/local/cuda
export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export LIBRARY_PATH="$LIBRARY_PATH:/usr/local/cuda/lib64"

Use source to make sure the paths above will be loaded.

source ~/.bashrc

3. Compile pytorch

The instructions can be found here, but some details might be different.

Note that mkl=2019.3 is required. Details can be found in this issue.

conda install numpy pyyaml mkl=2019.3 mkl-include setuptools cmake cffi typing
conda install -c pytorch magma-cuda100 # optional step
# clone the pytorch source code
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
make clean # make clean is needed in my case
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
sudo python setup.py install # sudo is needed in my case.

After all the steps aforementioned, it finally works.

+10

ruiyuanlu on Apr 23, 2019

I’ve met the same issue. I`m using CUDA 10.1, pytorch 1.0.1.post2, python 3.6 on Ubuntu 18.04, too. Note that CUDA 8.0 for Ubuntu 18.04 is not available. I’ve tried to compile pytorch 1.0.1.post2 and install from source code with CUDA 10.1, the error “undefined symbol: __cudaRegisterFatBinaryEnd” still occurred.

I’ve also tried CUDA 9.0 and pytroch 1.0.1.post2, and got error: "undefined symbol: __cudaPopCallConfiguration. Any tips?

ruiyuanlu on Apr 15, 2019

As a reference, we have tried mmdetection on CUDA 9.0/9.2/10.0 with PyTorch 1.0 and CUDA 9.0/9.2 with PyTorch 0.4.1.

hellock on Mar 11, 2019

I had a similar problem with my model. But I had the luxury of having it mostly working on my local system with my puny GPU (4GB) and it was failing on school’s GPU cluster (4 x VT100 32 GB). The issue is that the school’s cluster operates through a restricted nvidia-docker platform, so the choice of setup is not entirely yours.

The key is looking at the dynamic libraries symbols exported. CUDA refers to its runtime libraries under the environment variable path LD_LIBRARY_PATH. In here, you can find this symbol as exported by the libcudart.so. Use the command readelf -sW to peruse your symbol tables. In my system, it looks like this:

echo $LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib64

cd /usr/local/cuda-10.0/lib64

readelf -sW *.so | grep FatBinaryEnd 390: 0000000000010ab0 37 FUNC GLOBAL DEFAULT 11 __cudaRegisterFatBinaryEnd@@libcudart.so.10.0

Here, it’s shown that the __cudaRegisterFatBinaryEnd symbol is exported by the library: libcudart.so.10.0. So now just make sure this library is contained in the LD_LIBRARY_PATH by either editing the path definition, or copying/linking to the library file in the existing path

RexBarker on Apr 5, 2020