mmdetection: Out of memory when training on custom dataset

I was trying to train a retinanet model on some custom dataset (e.g. WIDER face) and I’ve encountered consistent out of memory issue after several hundred (or thousand) iteration. The error messages are like the followings:

  File "/root/mmdetection/mmdet/core/bbox/assigners/max_iou_assigner.py", line 72, in assign
    overlaps = bbox_overlaps(bboxes, gt_bboxes)
  File "/root/mmdetection/mmdet/core/bbox/geometry.py", line 59, in bbox_overlaps
    ious = overlap / (area1[:, None] + area2 - overlap)
RuntimeError: CUDA error: out of memory

or

  File "/root/mmdetection/mmdet/core/bbox/assigners/max_iou_assigner.py", line 72, in assign
    overlaps = bbox_overlaps(bboxes, gt_bboxes)
  File "/root/mmdetection/mmdet/core/bbox/geometry.py", line 51, in bbox_overlaps
    wh = (rb - lt + 1).clamp(min=0)  # [rows, cols, 2]
RuntimeError: CUDA error: out of memory

The training machine has 2 x 1080 Ti and the OOM happens no matter imgs_per_gpu is 2 or 1. In fact, even if the initial memory consumption reduces to around 3~4G when imgs_per_gpu=1, it quickly grows to around 8G and fluctuate for a while until finally OOM. I’ve tried both distributed and non-distributed training and both suffer OOM.

BTW, the machine has the following software configurations:

  • Pytorch = 0.4.1
  • CUDA = 9
  • Cudnn = 7

Is this normal?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 7
  • Comments: 40 (13 by maintainers)

Commits related to this issue

Most upvoted comments

Well, in fact I can train the same dataset in Tensorflow and Detectron without any problems, even not filtering out those samples. So I guess there are memory leak issues in Pytorch and perhaps there are some part of implementations in this project which worsens the case.

@hellock I got the same problem when I train wider_face who has many gts per pic. I find you calculate anchor_taegets by purely gpu. When the gts goes larger, the bi-match operator will get a heavy burden on gpus. Here is the code…

        lt = torch.max(bboxes1[:, None, :2], bboxes2[:, :2])  # [rows, cols, 2]
        rb = torch.min(bboxes1[:, None, 2:], bboxes2[:, 2:])  # [rows, cols, 2]

        wh = (rb - lt + 1).clamp(min=0)  # [rows, cols, 2]
        overlap = wh[:, :, 0] * wh[:, :, 1]
        area1 = (bboxes1[:, 2] - bboxes1[:, 0] + 1) * (
            bboxes1[:, 3] - bboxes1[:, 1] + 1)

        if mode == 'iou':
            area2 = (bboxes2[:, 2] - bboxes2[:, 0] + 1) * (
                bboxes2[:, 3] - bboxes2[:, 1] + 1)
            ious = overlap / (area1[:, None] + area2 - overlap)
        else:
            ious = overlap / (area1[:, None])

Then I change the implemention by loop the gts. It works well on no_dist training, in my test it stablely occupy 8G memory while with no more training time while your code will get CUDA_OUT_OF _MEMORY. However, when I use the dist training, I got a unbalance distribution problem of the memory as mentioned by @hedes1992 . And at last it comes up with CUDA_OUT_OF _MEMORY… So…what’s the problem… ?? When will you fix this…??

@hellock So do you think it is necessary to make box operations configurable so that for those who want to train their models on custom datasets with large gt bbox number, they can switch to CPU? It is likely that there are datasets where most samples have many gt bboxes and thus simply removing those data samples won’t work.

@hellock You’re right. After filtering out samples with very large number of gt bboxes, it can be successully trained (20 epoch in total). I’ve limited the number of sample to 1 per GPU as using 2 samples will increase memory consumption to near 10G in the first few hundred of iterations. Not sure if it is the case for other custom dataset that other people have mentioned. I would like to keep this issue open for a few more days so that people in the loop can still post their results with the fixes we’ve talked about.

@pkdogcom We conduct all box operations on GPU, so the memory will increase if the number of gt bboxes is large, though there may exist some hidden bugs. A walk around is to move some operations (IoU computation, etc) to CPU, and then copy the results to GPU.