mmdetection3d: Getting "CUDA error: an illegal memory access was encountered" on MVXNet
Continuing #336
Describe the bug Getting “CUDA error: an illegal memory access was encountered”
Reproduction
- What command or script did you run?
python tools/train.py configs/mvxnet/dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py
- What dataset did you use? I have used custom dataset converted to kitti format. Training plain kitti dataset works, but with the custom dataset I am getting the error. I have regenerated infos for my dataset.
Environment
sys.platform: linux
Python: 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0]
CUDA available: True
GPU 0: TITAN X (Pascal)
CUDA_HOME: /home/kirilly/cuda10.1
NVCC: Cuda compilation tools, release 10.1, V10.1.105
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.5.0+cu101
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.6.0+cu101
OpenCV: 4.5.1
MMCV: 1.2.6
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.1
MMDetection: 2.9.0
MMDetection3D: 0.10.0+ac9a3e8
Error traceback
2021-03-04 08:35:26,752 - mmdet - INFO - Epoch [1][50/8195] lr: 4.323e-04, eta: 1 day, 22:41:33, time: 0.513, data_time: 0.052, memory: 4503, loss_cls: 1.1683, loss_bbox: 2.4333, loss_dir: 0.1463, loss: 3.7479, grad_norm: 113.1708
2021-03-04 08:35:49,141 - mmdet - INFO - Epoch [1][100/8195] lr: 5.673e-04, eta: 1 day, 19:43:21, time: 0.448, data_time: 0.006, memory: 4523, loss_cls: 0.9347, loss_bbox: 2.0219, loss_dir: 0.1392, loss: 3.0958, grad_norm: 19.0356
Traceback (most recent call last):
File "tools/train.py", line 166, in <module>
main()
File "tools/train.py", line 162, in main
meta=meta)
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmdet/apis/train.py", line 150, in train_detector
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True)
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 247, in train_step
losses = self(**data)
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(*args, **kwargs)
File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/models/detectors/base.py", line 59, in forward
return self.forward_train(**kwargs)
File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 274, in forward_train
points, img=img, img_metas=img_metas)
File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 208, in extract_feat
pts_feats = self.extract_pts_feat(points, img_feats, img_metas)
File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/models/detectors/mvx_faster_rcnn.py", line 54, in extract_pts_feat
voxels, coors, points, img_feats, img_metas)
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func
return old_func(*args, **kwargs)
File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/models/voxel_encoders/voxel_encoder.py", line 244, in forward
voxel_mean, mean_coors = self.cluster_scatter(features, coors)
File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/ops/voxel/scatter_points.py", line 113, in forward
points[inds], coors[inds][:, 1:])
File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/ops/voxel/scatter_points.py", line 92, in forward_single
self.point_cloud_range)
File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/ops/voxel/scatter_points.py", line 38, in forward
coors_range)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fc48ce74536 in /home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7fc48d0b7fbe in /home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc48ce64abd in /home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x522fe2 (0x7fc4d4054fe2 in /home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x523086 (0x7fc4d4055086 in /home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: python() [0x54f146]
frame #6: python() [0x588475]
frame #7: python() [0x572b20]
frame #8: python() [0x54ee4b]
frame #9: python() [0x54ee4b]
frame #10: python() [0x588948]
frame #11: python() [0x5ad418]
frame #12: python() [0x5ad42e]
frame #13: python() [0x5ad42e]
frame #14: python() [0x5ad42e]
frame #15: python() [0x5ad42e]
frame #16: python() [0x5ad42e]
frame #17: python() [0x5ad42e]
frame #18: python() [0x56b4c6]
<omitting python frames>
frame #24: __libc_start_main + 0xe7 (0x7fc4e5694bf7 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 27 (7 by maintainers)
Suppose the point cloud range is [0, -40, -3, 70.4, 40, 1], voxel size is [0.05, 0.05, 0.1], then the shape of intermediate feature map is [(1-(-3))/0.1+1, (40-(-40))/0.05, (70.4-0)/0.05]=[41, 1600, 1408]. You can check your sparse_shape in this way.
Hi @manonthegithub , Thanks for your bug report. It seems that the error could be reproduced. Could you dump the input of the module that could cause the error and provide it to us? Thus, we could create a tiny unit test to fully reproduce your error and have a check for that. Hopefully, we could solve this bug with the input you provided.