mmrotate: RuntimeError: CUDA error: an illegal memory access was encountered when training cfa
Prerequisite
- I have searched Issues and Discussions but cannot get the expected help.
- I have read the FAQ documentation but cannot get the expected help.
- The bug has not been fixed in the latest version (master) or latest version (1.x).
Task
I’m using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
1.x branch https://github.com/open-mmlab/mmrotate/tree/1.x
Environment
sys.platform: linux Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3: TITAN X (Pascal) CUDA_HOME: /usr/local/cuda-10.1 NVCC: Cuda compilation tools, release 10.1, V10.1.10 GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.7.1+cu101 PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel® Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel® 64 architecture applications
- Intel® MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.8.2+cu101 OpenCV: 4.6.0 MMEngine: 0.3.0 MMRotate: 1.0.0rc0+
Reproduces the problem - code sample
python tools/tain.py configs/cfa/cfa-qbox_r50_fpn_1x_dota.py
Reproduces the problem - command or script
CUDA_LAUNCH_BLOCKING=1 python tools/tain.py configs/cfa/cfa-qbox_r50_fpn_1x_dota.py
Reproduces the problem - error message
Result has been saved to /media/amax/partion2/lsy/work_dir/mmlabV2/mmrotate/cfa/cfa_qbox_r50_fpn_1x_dota1/modules_statistic_results.json 11/13 14:51:42 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used. 11/13 14:51:46 - mmengine - WARNING - Failed to search registry with scope “mmrotate” in the “optim_wrapper” registry tree. As a workaround, the current “optim_wrapper” registry in “mmengine” is used to build instance. This may cause unexpected failure when running the built modules. Please check whether “mmrotate” is a correct scope, or whether the registry is initialized. fatal: Not a git repository (or any parent up to mount point /media/amax/partion2) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). 11/13 14:51:47 - mmengine - INFO - load model from: torchvision://resnet50 11/13 14:51:47 - mmengine - INFO - torchvision loads checkpoint from path: torchvision://resnet50 11/13 14:51:48 - mmengine - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
11/13 14:51:48 - mmengine - INFO - Checkpoints will be saved to /media/amax/partion2/lsy/work_dir/mmlabV2/mmrotate/cfa/cfa_qbox_r50_fpn_1x_dota1.
/media/amax/partion2/lsy/workspace/mmlab_V2/mmrotate/mmrotate/structures/bbox/quadri_boxes.py:146: UserWarning: The clip function does nothing in QuadriBoxes.
warnings.warn(‘The clip function does nothing in QuadriBoxes.’)
/media/amax/partion2/lsy/workspace/mmlab_V2/mmrotate/mmrotate/structures/bbox/quadri_boxes.py:146: UserWarning: The clip function does nothing in QuadriBoxes.
warnings.warn(‘The clip function does nothing in QuadriBoxes.’)
Traceback (most recent call last):
File “tools/train.py”, line 122, in <module>
main()
File “tools/train.py”, line 118, in main
runner.train()
File “/home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmengine/runner/runner.py”, line 1661, in train
model = self.train_loop.run() # type: ignore
File “/home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmengine/runner/loops.py”, line 90, in run
self.run_epoch()
File “/home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmengine/runner/loops.py”, line 106, in run_epoch
self.run_iter(idx, data_batch)
File “/home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmengine/runner/loops.py”, line 122, in run_iter
outputs = self.runner.model.train_step(
File “/home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py”, line 114, in train_step
losses = self._run_forward(data, mode=‘loss’) # type: ignore
File “/home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py”, line 320, in _run_forward
results = self(**data, mode=mode)
File “/home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, kwargs)
File “/media/amax/partion2/lsy/workspace/mmlab_V2/mmdetection3.x/mmdet/models/detectors/base.py”, line 92, in forward
return self.loss(inputs, data_samples)
File “/media/amax/partion2/lsy/workspace/mmlab_V2/mmdetection3.x/mmdet/models/detectors/single_stage.py”, line 78, in loss
losses = self.bbox_head.loss(x, batch_data_samples)
File “/media/amax/partion2/lsy/workspace/mmlab_V2/mmdetection3.x/mmdet/models/dense_heads/base_dense_head.py”, line 123, in loss
losses = self.loss_by_feat(loss_inputs)
File “/media/amax/partion2/lsy/workspace/mmlab_V2/mmrotate/mmrotate/models/dense_heads/cfa_head.py”, line 186, in loss_by_feat
pos_losses_list, = multi_apply(self.get_pos_loss, cls_scores,
File “/media/amax/partion2/lsy/workspace/mmlab_V2/mmdetection3.x/mmdet/models/utils/misc.py”, line 218, in multi_apply
return tuple(map(list, zip(map_results)))
File “/media/amax/partion2/lsy/workspace/mmlab_V2/mmrotate/mmrotate/models/dense_heads/cfa_head.py”, line 379, in get_pos_loss
loss_bbox = self.loss_bbox_refine(
File “/home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(input, **kwargs)
File “/media/amax/partion2/lsy/workspace/mmlab_V2/mmrotate/mmrotate/models/losses/convex_giou_loss.py”, line 111, in forward
loss = self.loss_weight * convex_giou_loss(
File “/media/amax/partion2/lsy/workspace/mmlab_V2/mmrotate/mmrotate/models/losses/convex_giou_loss.py”, line 37, in forward
convex_gious, grad = convex_giou(pred, target)
File “/home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmcv/ops/convex_iou.py”, line 28, in convex_giou
ext_module.convex_giou(pointsets, polygons, output)
RuntimeError: CUDA error: an illegal memory access was encountered
Exception raised from ConvexGIoUCUDAKernelLauncher at /tmp/mmcv/mmcv/ops/csrc/pytorch/cuda/convex_iou.cu:40 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f6faf2928b2 in /home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: ConvexGIoUCUDAKernelLauncher(at::Tensor, at::Tensor, at::Tensor) + 0x1d6 (0x7f6f8ccb85c3 in /home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #2: convex_giou_cuda(at::Tensor, at::Tensor, at::Tensor) + 0x67 (0x7f6f8ccc8417 in /home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #3: auto Dispatch<DeviceRegistry<void ()(at::Tensor, at::Tensor, at::Tensor), &(convex_giou_impl(at::Tensor, at::Tensor, at::Tensor))>, at::Tensor const&, at::Tensor const&, at::Tensor&>(DeviceRegistry<void ()(at::Tensor, at::Tensor, at::Tensor), &(convex_giou_impl(at::Tensor, at::Tensor, at::Tensor))> const&, char const, at::Tensor const&, at::Tensor const&, at::Tensor&) + 0x802 (0x7f6f8cbfe182 in /home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #4: convex_giou(at::Tensor, at::Tensor, at::Tensor) + 0x67 (0x7f6f8cbfbdd7 in /home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x330513 (0x7f6f8ce8a513 in /home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x350190 (0x7f6f8ceaa190 in /home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x34de4e (0x7f6f8cea7e4e in /home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #16: THPFunction_apply(_object, _object) + 0x93d (0x7f6ff9a732dd in /home/amax/anaconda3/envs/openmmlab2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
Additional information
I’m using DOTA V1.0 dataset, and this error occurs when training
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (2 by maintainers)
I also encountered this problem, It can be trained at the beginning, but this problem occurs during the training process or validation randomly.
According to https://github.com/open-mmlab/mmcv/issues/2407, adding small random noise to the input of mmcv.ops.min_area_polygons seems to be a feasible solution. As stated in the above link, this bug may be caused by numerical instability of min_area_polygons cuda op.
Meanwhile, I think cv2.minAreaRect is another way to fix this bug as cv2.minAreaRect can return the true min area polygons without adding any noises to the input. But the speed of training might be slowed down as cv2.minAreaRect is running on cpu.
FYI, @yangxue0827 Unfortunately, all the solutions you provided is proven to be unuseful in my case
In this case, the convex hull of
pred_ptare two same points. But when calculating the area of the convex hull of the union ofpred_ptandtarget, it will fall into an infinite loop in the same code snippets.I’ve tried reinstall mmcv using command above, but this erroe still exist. And it seems that all models using convex_giou_loss will occur this error(I’ve tested training cfa, rotated_reppoints and oriented_reppoints, they all have the same error). And when I train roi_transformer, this error doesn’t exist.
Update for this bug: After several testing, it seems that mmcv ops: min_area_polygon and convex_giou all have numerical instability problems for specific predictions. Here are the details.
For min_area_polygon function, I tested the case where the input convex is extremely small, for example, all points in convex are 0 or random numbers generated with mean of 0 and variance of 1e-30 are used as points of convex. However, it works normally in these cases, so I further tested the function and find a convex data that can reproduce this problem:
You can find that the point of this convex are distributed in the corner as well as the boundary of the image(all images in my dataset are 1024*1024), I don’t know the exact reason of this problem, but it seems that convex points distributed in the corner or border of the image may lead to the collapse of this function.
For convex_giou function, I also test extremely small convex input(the convex area is extremely small). In such situation, convex_iou works normally but convex_giou returns negative iou with 0 grad. However, it doesn’t occur errors in this situation and can still running. After further testing, I found that specific predicted convex with specific target quadrangle may lead to the collapse of convex_giou. Here is an example I found can reproduce the error of convex_giou:
In this example, the target quadrangle intersects the boundary of the image, and this error only occurs when convex points are all right bottom corner of the image, even convex with all 0 points will not lead to this error. I don’t know why such thing happens and it’s so wired. For this case, simply add some small random noise with std=1e-3 to every convex point can avoid the error.
Update for this bug: I found that when train with Adam, the feature map result from the forward function would produce extreme large value (1e32 or -1e32) unexpectedly in an iteration. Before this iteration, everything goes normally. It is the extreme big value cracked the function min_area_polygons when training the reppoints. But when I switch the optimizer from Adam to SGD, the model can be trained without any problem.
I encountered the same problem using oriented_reppoints. I did extra experiment, it shows that when using mmcv.ops.min_area_polygons, the same CUDA error arises. I can’t tell what the problem is, as I carefully inspect the input data, I can’t see anything abnormal and this happens randomly. I guess there are some problems in the source code of the function min_area_polygons.