mmdetection3d: Getting "CUDA error: an illegal memory access was encountered" on MVXNet

Continuing #336

Describe the bug Getting “CUDA error: an illegal memory access was encountered”

Reproduction

What command or script did you run?

python tools/train.py configs/mvxnet/dv_mvx-fpn_second_secfpn_adamw_2x8_80e_kitti-3d-3class.py

What dataset did you use? I have used custom dataset converted to kitti format. Training plain kitti dataset works, but with the custom dataset I am getting the error. I have regenerated infos for my dataset.

Environment

sys.platform: linux
Python: 3.6.9 (default, Oct  8 2020, 12:12:24) [GCC 8.4.0]
CUDA available: True
GPU 0: TITAN X (Pascal)
CUDA_HOME: /home/kirilly/cuda10.1
NVCC: Cuda compilation tools, release 10.1, V10.1.105
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.5.0+cu101
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.6.0+cu101
OpenCV: 4.5.1
MMCV: 1.2.6
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.1
MMDetection: 2.9.0
MMDetection3D: 0.10.0+ac9a3e8

Error traceback

2021-03-04 08:35:26,752 - mmdet - INFO - Epoch [1][50/8195]     lr: 4.323e-04, eta: 1 day, 22:41:33, time: 0.513, data_time: 0.052, memory: 4503, loss_cls: 1.1683, loss_bbox: 2.4333, loss_dir: 0.1463, loss: 3.7479, grad_norm: 113.1708
2021-03-04 08:35:49,141 - mmdet - INFO - Epoch [1][100/8195]    lr: 5.673e-04, eta: 1 day, 19:43:21, time: 0.448, data_time: 0.006, memory: 4523, loss_cls: 0.9347, loss_bbox: 2.0219, loss_dir: 0.1392, loss: 3.0958, grad_norm: 19.0356
Traceback (most recent call last):
  File "tools/train.py", line 166, in <module>
    main()
  File "tools/train.py", line 162, in main
    meta=meta)
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmdet/apis/train.py", line 150, in train_detector
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True)
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 247, in train_step
    losses = self(**data)
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
    return old_func(*args, **kwargs)
  File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/models/detectors/base.py", line 59, in forward
    return self.forward_train(**kwargs)
  File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 274, in forward_train
    points, img=img, img_metas=img_metas)
  File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 208, in extract_feat
    pts_feats = self.extract_pts_feat(points, img_feats, img_metas)
  File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/models/detectors/mvx_faster_rcnn.py", line 54, in extract_pts_feat
    voxels, coors, points, img_feats, img_metas)
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func
    return old_func(*args, **kwargs)
  File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/models/voxel_encoders/voxel_encoder.py", line 244, in forward
    voxel_mean, mean_coors = self.cluster_scatter(features, coors)
  File "/home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/ops/voxel/scatter_points.py", line 113, in forward
    points[inds], coors[inds][:, 1:])
  File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/ops/voxel/scatter_points.py", line 92, in forward_single
    self.point_cloud_range)
  File "/home/kirilly/git_repos/mmdetection3d/mmdet3d/ops/voxel/scatter_points.py", line 38, in forward
    coors_range)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fc48ce74536 in /home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7fc48d0b7fbe in /home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc48ce64abd in /home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x522fe2 (0x7fc4d4054fe2 in /home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x523086 (0x7fc4d4055086 in /home/kirilly/v2pearl5p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: python() [0x54f146]
frame #6: python() [0x588475]
frame #7: python() [0x572b20]
frame #8: python() [0x54ee4b]
frame #9: python() [0x54ee4b]
frame #10: python() [0x588948]
frame #11: python() [0x5ad418]
frame #12: python() [0x5ad42e]
frame #13: python() [0x5ad42e]
frame #14: python() [0x5ad42e]
frame #15: python() [0x5ad42e]
frame #16: python() [0x5ad42e]
frame #17: python() [0x5ad42e]
frame #18: python() [0x56b4c6]
<omitting python frames>
frame #24: __libc_start_main + 0xe7 (0x7fc4e5694bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 27 (7 by maintainers)

Commits related to this issue

Most upvoted comments

Suppose the point cloud range is [0, -40, -3, 70.4, 40, 1], voxel size is [0.05, 0.05, 0.1], then the shape of intermediate feature map is [(1-(-3))/0.1+1, (40-(-40))/0.05, (70.4-0)/0.05]=[41, 1600, 1408]. You can check your sparse_shape in this way.

Tai-Wang on Mar 31, 2021

Hi @manonthegithub ， Thanks for your bug report. It seems that the error could be reproduced. Could you dump the input of the module that could cause the error and provide it to us? Thus, we could create a tiny unit test to fully reproduce your error and have a check for that. Hopefully, we could solve this bug with the input you provided.

ZwwWayne on Mar 30, 2021