mmdetection: Out of memory when training on custom dataset

I was trying to train a retinanet model on some custom dataset (e.g. WIDER face) and I’ve encountered consistent out of memory issue after several hundred (or thousand) iteration. The error messages are like the followings:

  File "/root/mmdetection/mmdet/core/bbox/assigners/max_iou_assigner.py", line 72, in assign
    overlaps = bbox_overlaps(bboxes, gt_bboxes)
  File "/root/mmdetection/mmdet/core/bbox/geometry.py", line 59, in bbox_overlaps
    ious = overlap / (area1[:, None] + area2 - overlap)
RuntimeError: CUDA error: out of memory

  File "/root/mmdetection/mmdet/core/bbox/assigners/max_iou_assigner.py", line 72, in assign
    overlaps = bbox_overlaps(bboxes, gt_bboxes)
  File "/root/mmdetection/mmdet/core/bbox/geometry.py", line 51, in bbox_overlaps
    wh = (rb - lt + 1).clamp(min=0)  # [rows, cols, 2]
RuntimeError: CUDA error: out of memory

The training machine has 2 x 1080 Ti and the OOM happens no matter imgs_per_gpu is 2 or 1. In fact, even if the initial memory consumption reduces to around 3~4G when imgs_per_gpu=1, it quickly grows to around 8G and fluctuate for a while until finally OOM. I’ve tried both distributed and non-distributed training and both suffer OOM.

BTW, the machine has the following software configurations:

Pytorch = 0.4.1
CUDA = 9
Cudnn = 7

Is this normal?

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 7
Comments: 40 (13 by maintainers)

Commits related to this issue

Merge pull request #188 from GalyaZalesskaya/gz/fix/SSD_config Fix parameter in custom object detection config (SSD) — committed to SemyonBevzuk/mmdetection by druzhkov-paul 3 years ago
AutoGluon Tabular HPO Optimization + Refactoring (#188) * Made all BaggedEnsembleModels be StackerEnsembleModels to simplify functionality * Added random_state to init params of ensemble models, r... — committed to FANGAreNotGnu/mmdetection by Innixma 4 years ago

Most upvoted comments

Well, in fact I can train the same dataset in Tensorflow and Detectron without any problems, even not filtering out those samples. So I guess there are memory leak issues in Pytorch and perhaps there are some part of implementations in this project which worsens the case.

pkdogcom on Dec 17, 2018

@hellock I got the same problem when I train wider_face who has many gts per pic. I find you calculate anchor_taegets by purely gpu. When the gts goes larger, the bi-match operator will get a heavy burden on gpus. Here is the code…

        lt = torch.max(bboxes1[:, None, :2], bboxes2[:, :2])  # [rows, cols, 2]
        rb = torch.min(bboxes1[:, None, 2:], bboxes2[:, 2:])  # [rows, cols, 2]

        wh = (rb - lt + 1).clamp(min=0)  # [rows, cols, 2]
        overlap = wh[:, :, 0] * wh[:, :, 1]
        area1 = (bboxes1[:, 2] - bboxes1[:, 0] + 1) * (
            bboxes1[:, 3] - bboxes1[:, 1] + 1)

        if mode == 'iou':
            area2 = (bboxes2[:, 2] - bboxes2[:, 0] + 1) * (
                bboxes2[:, 3] - bboxes2[:, 1] + 1)
            ious = overlap / (area1[:, None] + area2 - overlap)
        else:
            ious = overlap / (area1[:, None])

Then I change the implemention by loop the gts. It works well on no_dist training, in my test it stablely occupy 8G memory while with no more training time while your code will get CUDA_OUT_OF _MEMORY. However, when I use the dist training, I got a unbalance distribution problem of the memory as mentioned by @hedes1992 . And at last it comes up with CUDA_OUT_OF _MEMORY… So…what’s the problem… ?? When will you fix this…??

ZhihuaGao on Feb 3, 2019

@hellock So do you think it is necessary to make box operations configurable so that for those who want to train their models on custom datasets with large gt bbox number, they can switch to CPU? It is likely that there are datasets where most samples have many gt bboxes and thus simply removing those data samples won’t work.

pkdogcom on Dec 19, 2018

@hellock You’re right. After filtering out samples with very large number of gt bboxes, it can be successully trained (20 epoch in total). I’ve limited the number of sample to 1 per GPU as using 2 samples will increase memory consumption to near 10G in the first few hundred of iterations. Not sure if it is the case for other custom dataset that other people have mentioned. I would like to keep this issue open for a few more days so that people in the loop can still post their results with the fixes we’ve talked about.

pkdogcom on Dec 18, 2018

@pkdogcom We conduct all box operations on GPU, so the memory will increase if the number of gt bboxes is large, though there may exist some hidden bugs. A walk around is to move some operations (IoU computation, etc) to CPU, and then copy the results to GPU.

hellock on Dec 18, 2018