faster-rcnn.pytorch: nms_gpu throws runtime error: illegal memory access

Trying to train this model on my own dataset.
I converted it to pascal voc format, assured max resolution is of 1000 (most images are 600x900), adjusted some fine details, but I get the following error while training:

Called with args:
Namespace(batch_size=8, checkepoch=1, checkpoint=0, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='my_custom_ds', disp_interval=100, large_scale=False, lr=0.004, lr_decay_gamma=0.1, lr_decay_step=8, mGPUs=True, max_epochs=2, net='res101', num_workers=2, optimizer='sgd', resume=False, save_dir='saved_models', session=1, start_epoch=1, use_tfboard=False)
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
 'ANCHOR_SCALES': [8, 16, 32],
 'CROP_RESIZE_WITH_MAX_POOL': False,
 'CUDA': False,
 'DATA_DIR': '/home/cyb/user/pycharm/src/faster-rcnn.pytorch/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'res101',
 'FEAT_STRIDE': [16],
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'MAX_NUM_GT_BOXES': 93,
 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
               'FIXED_LAYERS': 5,
               'REGU_DEPTH': False,
               'WEIGHT_DECAY': 4e-05},
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'POOLING_MODE': 'align',
 'POOLING_SIZE': 7,
 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/cyb/user/pycharm/src/faster-rcnn.pytorch',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'gt',
          'RPN_MIN_SIZE': 16,
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 128,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'BN_TRAIN': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': False,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_MIN_SIZE': 8,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
           'STEPSIZE': [30000],
           'SUMMARY_INTERVAL': 180,
           'TRIM_HEIGHT': 600,
           'TRIM_WIDTH': 600,
           'TRUNCATED': False,
           'USE_ALL_GT': True,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0001},
 'USE_GPU_NMS': True}
Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
wrote gt roidb to /home/cyb/user/pycharm/src/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 4224 images...
after filtering, there are 4224 images...
4224 roidb entries
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py:68: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py:98: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  cls_prob = F.softmax(cls_score)
[session 1][epoch  1][iter    0] loss: 233749.3594, lr: 4.00e-03
			fg/bg=(24/1000), time cost: 6.419112
			rpn_cls: 179158.4219, rpn_box: 41295.5859, rcnn_cls: 9535.8477, rcnn_box 3759.5171
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCTensorMath.cu line=267 error=77 : an illegal memory access was encountered
an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered, at line 147
CUDA Error: an illegal memory access was encountered, at line 154
an illegal memory access was encountered
an illegal memory access was encountered
Traceback (most recent call last):
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/trainval_net.py", line 326, in <module>
    rois_label = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
    raise output
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker
    output = module(*input, **kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py", line 50, in forward
    rois, rpn_loss_cls, rpn_loss_bbox = self.RCNN_rpn(base_feat, im_info, gt_boxes, num_boxes)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py", line 78, in forward
    im_info, cfg_key))
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/proposal_layer.py", line 148, in forward
    keep_idx_i = nms(torch.cat((proposals_single, scores_single), 1), nms_thresh)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/nms/nms_wrapper.py", line 18, in nms
    return nms_gpu(dets, thresh)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/nms/nms_gpu.py", line 11, in nms_gpu
    keep = keep[:num_out[0]]
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCStorage.c:36

Process finished with exit code 1

I have two Titan K40 cards, however it’s an illegal access and not out of memory error, so I wonder where does it come from.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 19 (5 by maintainers)

Most upvoted comments

The NaN was fixed when I set the MAX_NUM_GT_BOXES to the correct value.

CodeJjang on Mar 24, 2018

@jwyang

[session 1][epoch 14][iter  500] loss: 0.0917, lr: 4.00e-04
			fg/bg=(96/416), time cost: 168.633036
			rpn_cls: 0.0002, rpn_box: 0.0006, rcnn_cls: 0.0357, rcnn_box 0.0256
[session 1][epoch 14][iter  600] loss: 0.0865, lr: 4.00e-04
			fg/bg=(90/422), time cost: 168.349854
			rpn_cls: 0.0015, rpn_box: 0.0009, rcnn_cls: 0.0078, rcnn_box 0.0114
[session 1][epoch 14][iter  700] loss: 0.0794, lr: 4.00e-04
			fg/bg=(118/394), time cost: 168.221013
			rpn_cls: 0.0028, rpn_box: 0.0021, rcnn_cls: 0.0474, rcnn_box 0.0575

However, the network seems to learn small vehicle better than large vehicle, and also, it cannot learn the solar panel class (which is quite small) for some reason:

AP for large vehicle = 0.7531
AP for small vehicle = 0.8962
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/datasets/voc_eval.py:204: RuntimeWarning: invalid value encountered in true_divide
  rec = tp / float(npos)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/datasets/voc_eval.py:45: RuntimeWarning: invalid value encountered in greater_equal
  if np.sum(rec >= t) == 0:
AP for solar panel = 0.0000
Mean AP = 0.5498
~~~~~~~~
Results:
0.753
0.896
0.000
0.550
~~~~~~~~

Any idea how to improve? Or why it fails on the solar panel? It throws some errors in the AP calculation though.

Edit: Just found out why it learns small vehicles better, that’s because they appear way more than large vehicles, and apparently I mistakenly filtered solar panels out of my train set, thats why its 0.

CodeJjang on Mar 23, 2018