mmdetection: ImportError: ./mmdetection/mmdet/ops/nms/gpu_nms.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __cudaRegisterFatBinaryEnd
Hi, I`m facing the problem with training:
alexander@alexander-desktop:~/Code/Projects/mmdetection$ ./tools/dist_train.sh configs/retinanet_r101_fpn_1x.py 1 Traceback (most recent call last): File “./tools/train.py”, line 7, in <module> from mmdet.datasets import get_dataset File “/home/alexander/Code/Projects/mmdetection/mmdet/datasets/init.py”, line 1, in <module> from .custom import CustomDataset File “/home/alexander/Code/Projects/mmdetection/mmdet/datasets/custom.py”, line 11, in <module> from .extra_aug import ExtraAugmentation File “/home/alexander/Code/Projects/mmdetection/mmdet/datasets/extra_aug.py”, line 5, in <module> from mmdet.core.evaluation.bbox_overlaps import bbox_overlaps File “/home/alexander/Code/Projects/mmdetection/mmdet/core/init.py”, line 6, in <module> from .post_processing import * # noqa: F401, F403 File “/home/alexander/Code/Projects/mmdetection/mmdet/core/post_processing/init.py”, line 1, in <module> from .bbox_nms import multiclass_nms File “/home/alexander/Code/Projects/mmdetection/mmdet/core/post_processing/bbox_nms.py”, line 3, in <module> from mmdet.ops.nms import nms_wrapper File “/home/alexander/Code/Projects/mmdetection/mmdet/ops/init.py”, line 5, in <module> from .nms import nms, soft_nms File “/home/alexander/Code/Projects/mmdetection/mmdet/ops/nms/init.py”, line 1, in <module> from .nms_wrapper import nms, soft_nms File “/home/alexander/Code/Projects/mmdetection/mmdet/ops/nms/nms_wrapper.py”, line 4, in <module> from .gpu_nms import gpu_nms ImportError: /home/alexander/Code/Projects/mmdetection/mmdet/ops/nms/gpu_nms.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __cudaRegisterFatBinaryEnd
I`m using CUDA 10.1, pytorch 1.0.1.post2, python 3.6 on Ubuntu 18.04 Everything compiled well during installation.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 24 (3 by maintainers)
Commits related to this issue
- fix msdef (#385) fix multi-scale def — committed to liuhuiCNN/mmdetection by littletomatodonkey 4 years ago
- Tabular: Added improved inference time estimates + networkx dependency (#385) * Added improved inference time estimates + networkx dependency * Addressed comments * Minor cleanup of comments — committed to FANGAreNotGnu/mmdetection by Innixma 4 years ago
Just solved the same issue. Check the compatibility of your Pytorch and Cuda version.
Well, I’ve solved this issue on my machine using pytorch 1.1.0 (latest version on github).
gcc 5.x doesn’t help, because some compile options in the CMakeLists.txt of pytorch 1.1.0 is not supported by gcc 5.x, while gcc 7.x is OK to compile pytroch.
It seems that CUDA 10.0 is slightly different from CUDA 10.1. This issue is caused by the CUDA version mismatch of pytorch during compiling time and run-time. Other conditions might occur if other mismatched run-time CUDA version is installed. For example, this error: “undefined symbol: __cudaPopCallConfiguration” might occur for earlier version of CUDA. Thus, my solution is to recompile pytorch to match the run-time CUDA version. Maybe change the CUDA run-time version also works, I didn’t test that. Here is how I fixed it.
(Ubuntu 18.04 only)
1. Uninstall pytorch if it doesn’t work:
2. Install CUDA-10.0 (optional)
This step is optional, other version of CUDA should be OK, if the CUDA version of compiling time matches run-time version.
Following the instructions of run file here:
Then check nvcc version:
The output should be something like (release 10.0):
Note that symlink is needed after cuda installation:
Then, add paths to your ~/.basrc file. These paths will be used during pytorch compiling.
Use source to make sure the paths above will be loaded.
3. Compile pytorch
The instructions can be found here, but some details might be different.
Note that mkl=2019.3 is required. Details can be found in this issue.
After all the steps aforementioned, it finally works.
I’ve met the same issue. I`m using CUDA 10.1, pytorch 1.0.1.post2, python 3.6 on Ubuntu 18.04, too. Note that CUDA 8.0 for Ubuntu 18.04 is not available. I’ve tried to compile pytorch 1.0.1.post2 and install from source code with CUDA 10.1, the error “undefined symbol: __cudaRegisterFatBinaryEnd” still occurred.
I’ve also tried CUDA 9.0 and pytroch 1.0.1.post2, and got error: "undefined symbol: __cudaPopCallConfiguration. Any tips?
As a reference, we have tried mmdetection on CUDA 9.0/9.2/10.0 with PyTorch 1.0 and CUDA 9.0/9.2 with PyTorch 0.4.1.
I had a similar problem with my model. But I had the luxury of having it mostly working on my local system with my puny GPU (4GB) and it was failing on school’s GPU cluster (4 x VT100 32 GB). The issue is that the school’s cluster operates through a restricted nvidia-docker platform, so the choice of setup is not entirely yours.
The key is looking at the dynamic libraries symbols exported. CUDA refers to its runtime libraries under the environment variable path
LD_LIBRARY_PATH. In here, you can find this symbol as exported by thelibcudart.so. Use the commandreadelf -sWto peruse your symbol tables. In my system, it looks like this:echo $LD_LIBRARY_PATH/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib64cd /usr/local/cuda-10.0/lib64readelf -sW *.so | grep FatBinaryEnd390: 0000000000010ab0 37 FUNC GLOBAL DEFAULT 11 __cudaRegisterFatBinaryEnd@@libcudart.so.10.0Here, it’s shown that the
__cudaRegisterFatBinaryEndsymbol is exported by the library:libcudart.so.10.0. So now just make sure this library is contained in theLD_LIBRARY_PATHby either editing the path definition, or copying/linking to the library file in the existing path