mmdetection: [Bug] CUDA out of memory in RTMDet-Ins on custom dataset with > 100 ground truths per img

Prerequisite

Task

I have modified the scripts/configs, or I’m working on my own tasks/models/datasets.

Branch

3.x branch https://github.com/open-mmlab/mmdetection/tree/3.x

Environment

mira@Dell-Precision:/mmdetection$ python3 mmdet/utils/collect_env.py
sys.platform: linux
Python: 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 2060
GPU 1: NVIDIA T400
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.2, V11.2.152
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.13.1+cu116
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.6
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.14.1+cu116
OpenCV: 4.6.0
MMEngine: 0.3.0
MMDetection: 3.0.0rc4+7185b5a

Additional installation/environment information

Installed inside a docker container based on the example Dockerfile but pulls dev-3.x because I started working on this before it was merged into 3.x but I verified that there haven’t been any changes to the specific code snippets that would help with the OOM error.

RUN apt-get update \
    && apt-get install --no-install-recommends -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install MMEngine and MMCV
RUN pip install openmim && \
    mim install "mmengine==0.3.0" "mmcv>=2.0.0rc1"

# Install MMDetection
RUN git clone https://github.com/open-mmlab/mmdetection.git -b dev-3.x /mmdetection \
    && cd /mmdetection \
    && pip install --no-cache-dir -e .

Reproduces the problem - code sample

Config file run for training. Classes and meta info hidden for privacy:

# rtmdet-ins_tiny_1xb2-200e.py
_base_ = "/mmdetection/configs/rtmdet/rtmdet-ins_tiny_8xb32-300e_coco.py"

checkpoint = (
    "https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth"  # noqa
)

data_root = "/home/mira/RTMDet/rtmdet_ins_data/"

model = dict(bbox_head=dict(num_classes=5, in_channels=96, feat_channels=96))

train_pipeline_stage2 = [
    dict(
        type='LoadImageFromFile',
        file_client_args={{_base_.file_client_args}}),
    dict(
        type='LoadAnnotations',
        with_bbox=True,
        with_mask=True,
        poly2mask=False),
    dict(
        type='RandomResize',
        scale=(1280, 720),
        ratio_range=(0.5, 2.0),
        keep_ratio=True),
    dict(
        type='RandomCrop',
        crop_size=(640, 480),
        recompute_bbox=True,
        allow_negative_crop=True),
    dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1)),
    dict(type='YOLOXHSVRandomAug'),
    dict(type='RandomFlip', prob=0.5),
    dict(type='Pad', size=(640, 480), pad_val=dict(img=(114, 114, 114))),
    dict(type='PackDetInputs')
]

log_interval = 20
val_epoch_interval = 10
max_epochs = 200
stage2_num_epochs = 10
base_lr = 0.004

train_cfg = dict(
    max_epochs=max_epochs, val_interval=val_epoch_interval, dynamic_intervals=[(max_epochs - stage2_num_epochs, 1)]
)

train_dataloader = dict(
    batch_size=2,
    dataset=dict(
        metainfo=metainfo,
        data_root=data_root,
        ann_file="coco_labels/train_annotations2023.json",
        data_prefix=dict(img="train_images/"),
    ),
)

val_dataloader = dict(
    dataset=dict(
        ann_file="coco_labels/val_annotations2023.json",
        metainfo=metainfo,
        data_root=data_root,
        data_prefix=dict(img="val_images/"),
    )
)

test_dataloader = dict(
    dataset=dict(
        ann_file="coco_labels/test_annotations2023.json",
        metainfo=metainfo,
        data_root=data_root,
        data_prefix=dict(img="test_images/"),
    )
)

val_evaluator = dict(ann_file=data_root + "coco_labels/val_annotations2023.json")
test_evaluator = dict(ann_file=data_root + "coco_labels/test_annotations2023.json")

param_scheduler = [
    dict(
        # use cosine lr from 150 to 300 epoch
        type="CosineAnnealingLR",
        eta_min=base_lr * 0.05,
        begin=max_epochs // 2,
        end=max_epochs,
        T_max=max_epochs // 2,
        by_epoch=True,
        convert_to_iter_based=True,
    )
]
default_hooks = dict(
    logger=dict(type="LoggerHook", interval=log_interval),
    checkpoint=dict(interval=val_epoch_interval, max_keep_ckpts=3),
)  # only keep latest 3 checkpoints
custom_hooks = [
    dict(type="EMAHook", ema_type="ExpMomentumEMA", momentum=0.0002, update_buffers=True, priority=49),
    dict(type="PipelineSwitchHook", switch_epoch=max_epochs - stage2_num_epochs, switch_pipeline=train_pipeline_stage2),
]



### Reproduces the problem - command or script

mira@Dell-Precision:/mmdetection$ python3 tools/train.py ~/RTMDet/configs/rtmdet-ins_tiny_1xb2-200e.py --work-dir ~/RTMDet/Exp2_logs --resume ~/RTMDet/Exp1_logs/epoch_30.pth

### Reproduces the problem - error message

01/11 17:31:41 - mmengine - INFO - Epoch(train) [31][20/41]  lr: 4.0000e-03  eta: 2:26:33  time: 1.3264  data_time: 0.0113  memory: 4257  loss: 1.4234  loss_cls: 0.3436  loss_bbox: 0.5839  loss_mask: 0.4959
01/11 17:32:28 - mmengine - INFO - Epoch(train) [31][40/41]  lr: 4.0000e-03  eta: 3:29:35  time: 1.7126  data_time: 0.0127  memory: 7018  loss: 1.4461  loss_cls: 0.3816  loss_bbox: 0.5821  loss_mask: 0.4823
01/11 17:32:30 - mmengine - INFO - Exp name: rtmdet-ins_tiny_1xb2-200e_20230111_173110
01/11 17:33:31 - mmengine - INFO - Epoch(train) [32][20/41]  lr: 4.0000e-03  eta: 4:15:49  time: 2.4690  data_time: 0.0107  memory: 7136  loss: 1.4545  loss_cls: 0.3468  loss_bbox: 0.6041  loss_mask: 0.5036
Traceback (most recent call last):
  File "tools/train.py", line 130, in <module>
    main()
  File "tools/train.py", line 126, in main
    runner.train()
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/runner.py", line 1661, in train
    model = self.train_loop.run()  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 90, in run
    self.run_epoch()
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 106, in run_epoch
    self.run_iter(idx, data_batch)
  File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 122, in run_iter
    outputs = self.runner.model.train_step(
  File "/usr/local/lib/python3.8/dist-packages/mmengine/model/base_model/base_model.py", line 114, in train_step
    losses = self._run_forward(data, mode='loss')  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/mmengine/model/base_model/base_model.py", line 320, in _run_forward
    results = self(**data, mode=mode)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mmdetection/mmdet/models/detectors/base.py", line 92, in forward
    return self.loss(inputs, data_samples)
  File "/mmdetection/mmdet/models/detectors/single_stage.py", line 78, in loss
    losses = self.bbox_head.loss(x, batch_data_samples)
  File "/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
    losses = self.loss_by_feat(*loss_inputs)
  File "/mmdetection/mmdet/models/dense_heads/rtmdet_ins_head.py", line 748, in loss_by_feat
    loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
  File "/mmdetection/mmdet/models/dense_heads/rtmdet_ins_head.py", line 653, in loss_mask_by_feat
    loss_mask = self.loss_mask(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mmdetection/mmdet/models/losses/dice_loss.py", line 137, in forward
    loss = self.loss_weight * dice_loss(
  File "/mmdetection/mmdet/models/losses/dice_loss.py", line 47, in dice_loss
    a = torch.sum(input * target, 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 478.00 MiB (GPU 0; 11.75 GiB total capacity; 8.98 GiB already allocated; 77.44 MiB free; 10.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additional information

Expected Result

Training without an OOM error.

Dataset

Custom dataset of 100 images for instance segmentation with >200 polygons per image. Resolution of images: 1280 x 720

Hardware

NVIDIA RTX 2060. Also tried training on NVIDIA RTX 3080. Both have a GPU memory of 12 GB.

Additional description/information

Based on reading the FAQ and looking through issues #188 and [#1581], (https://github.com/open-mmlab/mmdetection/issues/1581), and given the high number of ground truth per image, I assumed that the problem was that assign_gpu_thr needed to be set to a number so the assign computation takes place in the CPU instead of the GPU.

However, rtmdet uses DynamicSoftLabelAssigner and not MaxIoUAssigner, which does not have an assign_gpu_thr parameter that is configurable. Switching the assigner to MaxIoUAssigner in the config as shown below

model = dict(bbox_head=dict(num_classes=5, in_channels=96, feat_channels=96),  train_cfg=dict(
        assigner=dict(type='MaxIoUAssigner', pos_iou_thr=0.5,
                    neg_iou_thr=0.5,
                    min_pos_iou=0.5,
                    match_low_quality=False,
                    ignore_iof_thr=-1, gpu_assign_thr=5),
        allowed_border=-1,
        pos_weight=-1,
        debug=False))

resulted in the following output:

01/11 20:00:57 - mmengine - INFO - load backbone. in model from: https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth
http loads checkpoint from path: https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth
01/11 20:00:57 - mmengine - INFO - Checkpoints will be saved to /home/mira/RTMDet/Exp19_maxiou_logs.
/usr/local/lib/python3.8/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
01/11 20:01:06 - mmengine - INFO - Epoch(train) [1][20/41]  lr: 4.0000e-03  eta: 0:00:27  time: 0.4417  data_time: 0.0270  memory: 1072  loss: 0.7659  loss_cls: 0.2922  loss_bbox: 0.1749  loss_mask: 0.2988
01/11 20:01:19 - mmengine - INFO - Epoch(train) [1][40/41]  lr: 4.0000e-03  eta: 0:00:23  time: 0.5636  data_time: 0.0140  memory: 1735  loss: 0.9751  loss_cls: 0.3478  loss_bbox: 0.2383  loss_mask: 0.3890
01/11 20:01:20 - mmengine - INFO - Exp name: rtmdet-ins_tiny_1xb2-2e_20230111_200052
01/11 20:01:20 - mmengine - INFO - Saving checkpoint at 1 epochs
01/11 20:01:22 - mmengine - INFO - Evaluating bbox...
Loading and preparing results...
01/11 20:01:22 - mmengine - ERROR - /mmdetection/mmdet/evaluation/metrics/coco_metric.py - compute_metrics - 437 - The testing results of the whole dataset is empty.
01/11 20:01:22 - mmengine - INFO - Epoch(val) [1][1/1]  
01/11 20:01:22 - mmengine - INFO - Switch pipeline now!
01/11 20:01:27 - mmengine - INFO - Epoch(train) [2][20/41]  lr: 2.3179e-03  eta: 0:00:09  time: 0.4578  data_time: 0.0163  memory: 821  loss: 0.6599  loss_cls: 0.2235  loss_bbox: 0.1650  loss_mask: 0.2714
01/11 20:01:32 - mmengine - INFO - Epoch(train) [2][40/41]  lr: 2.2227e-04  eta: 0:00:00  time: 0.3217  data_time: 0.0164  memory: 800  loss: 0.3189  loss_cls: 0.0869  loss_bbox: 0.0804  loss_mask: 0.1516
01/11 20:01:32 - mmengine - INFO - Exp name: rtmdet-ins_tiny_1xb2-1e_tomato_20230111_200052
01/11 20:01:32 - mmengine - INFO - Saving checkpoint at 2 epochs
01/11 20:01:34 - mmengine - INFO - Evaluating bbox...
Loading and preparing results...
01/11 20:01:34 - mmengine - ERROR - /mmdetection/mmdet/evaluation/metrics/coco_metric.py - compute_metrics - 437 - The testing results of the whole dataset is empty.
01/11 20:01:34 - mmengine - INFO - Epoch(val) [2][1/1]

However, switching to the MaxIouAssigner did not lead to an OOM error for multiple epochs, which biases me to think the problem is the high number of polygons. But inference with the trained model outputs no predictions and as shown in the log above, throws an error saying that the testing results of the whole dataset is empty. Reading through the issues (#9381), this is sometimes attributed to incorrect format of the ground truth labels but since the data has not changed, this doesn’t seem plausible.

To summarize:

  1. Apart from increasing GPU memory, are there any other solutions to this problem?
  2. Is there a feature to pass in assign_gpu_thr to DynamicSoftLabelAssigner?
  3. Why does using MaxIoUAssigner with RTMDet result in no inference results? This seems like a bug.
  4. The argument with_cp (suggested in the FAQ for OOM issues) does not exist in CSPNext which is the backbone for RTMDet. Are there plans to add it?
  5. Documentation for RTMDet seems to lack information on how to train with FP16 if that’s an option as suggested in the FAQs. Please advise.

I’m not sure if this is entirely a bug or a feature request but it seems to be a bit of both.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 4
  • Comments: 27 (6 by maintainers)

Most upvoted comments

For those who got the OOM error during validation, I’ve found one problem. During validation, the inference input will go through the val_pipeline, which contains the ‘Resize’. So it’s ok during inference. But in post-processing, it will interpolate the output mask to the original image size, then run sigmoid and thresholding to get the mask output. Refer to the code snippet bellow. https://github.com/open-mmlab/mmdetection/blob/f78af7785ada87f1ced75a2313746e4ba3149760/mmdet/models/dense_heads/rtmdet_ins_head.py#L498-L510 This could be extremely memory costing if your original image has a large resolution. e.g. a 4000x3000 image will have a mask output tensor of 100x4000x3000, which will cost over 4G memory!(and it’s just a single tensor, there could be several temporary tensors with same size) I haven’t found an effective solution yet. If you set the ‘rescale’ parameter to false, then the output mask won’t be scaled to match the original image size. This will lead to wrong metric calculation. I tried to put the sigmoid before the interpolation, which do save some memory but not much. I think one solution would be to set the ‘rescale’ parameter to false, and when calculating validation metrics, resize the original image to match the output mask size.

I wrote my own version of gpu_assign_thr in DynamicSoftLabelAssigner., it solves the out of memory error during training as the computations now happen on cpu and then passed on to gpu at the end.


@TASK_UTILS.register_module()
class DynamicSoftLabelAssigner(BaseAssigner):
    """Computes matching between predictions and ground truth with dynamic soft
    label assignment.

    Args:
        soft_center_radius (float): Radius of the soft center prior.
            Defaults to 3.0.
        topk (int): Select top-k predictions to calculate dynamic k
            best matches for each gt. Defaults to 13.
        iou_weight (float): The scale factor of iou cost. Defaults to 3.0.
        iou_calculator (ConfigType): Config of overlaps Calculator.
            Defaults to dict(type='BboxOverlaps2D').
    """

    def __init__(
            self,
            soft_center_radius: float = 3.0,
            topk: int = 13,
            iou_weight: float = 3.0,
            gpu_assign_thr: float = -1,
            iou_calculator: ConfigType = dict(type='BboxOverlaps2D')):

        self.soft_center_radius = soft_center_radius
        self.topk = topk
        self.iou_weight = iou_weight
        # ic(gpu_assign_thr)
        self.gpu_assign_thr = gpu_assign_thr
        self.iou_calculator = TASK_UTILS.build(iou_calculator)

    def assign(self,
               pred_instances: InstanceData,
               gt_instances: InstanceData,
               gt_instances_ignore: Optional[InstanceData] = None,
               **kwargs) -> AssignResult:
        """Assign gt to priors.

        Args:
            pred_instances (:obj:`InstanceData`): Instances of model
                predictions. It includes ``priors``, and the priors can
                be anchors or points, or the bboxes predicted by the
                previous stage, has shape (n, 4). The bboxes predicted by
                the current model or stage will be named ``bboxes``,
                ``labels``, and ``scores``, the same as the ``InstanceData``
                in other places.
            gt_instances (:obj:`InstanceData`): Ground truth of instance
                annotations. It usually includes ``bboxes``, with shape (k, 4),
                and ``labels``, with shape (k, ).
            gt_instances_ignore (:obj:`InstanceData`, optional): Instances
                to be ignored during training. It includes ``bboxes``
                attribute data that is ignored during training and testing.
                Defaults to None.
        Returns:
            obj:`AssignResult`: The assigned result.
        """
        gt_bboxes = gt_instances.bboxes
        gt_labels = gt_instances.labels
        num_gt = gt_bboxes.size(0)

        decoded_bboxes = pred_instances.bboxes
        pred_scores = pred_instances.scores
        priors = pred_instances.priors
        num_bboxes = decoded_bboxes.size(0)

        # ic(gt_bboxes.shape[0])
        # ic(self.gpu_assign_thr)

        assign_on_cpu = True if (self.gpu_assign_thr > 0) and (
            gt_bboxes.shape[0] > self.gpu_assign_thr) else False

        # ic(assign_on_cpu)

        # compute overlap and assign gt on CPU when number of GT is large
        if assign_on_cpu:
            # ic('assigning on cpu')
            device = priors.device
            priors = priors.cpu()
            gt_bboxes = gt_bboxes.cpu()
            gt_labels = gt_labels.cpu()
            decoded_bboxes = decoded_bboxes.cpu()
            pred_scores = pred_scores.cpu()

            # if gt_bboxes_ignore is not None:
            #     gt_bboxes_ignore = gt_bboxes_ignore.cpu()

        # assign 0 by default
        assigned_gt_inds = decoded_bboxes.new_full((num_bboxes, ),
                                                   0,
                                                   dtype=torch.long)
        if num_gt == 0 or num_bboxes == 0:
            # No ground truth or boxes, return empty assignment
            max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
            if num_gt == 0:
                # No truth, assign everything to background
                assigned_gt_inds[:] = 0
            assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
                                                      -1,
                                                      dtype=torch.long)
            if assign_on_cpu:
                # num_gt = num_gt.to(device)
                assigned_gt_inds = assigned_gt_inds.to(device)
                max_overlaps = max_overlaps.to(device)
                assigned_labels = assigned_labels.to(device)
            return AssignResult(
                num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

        prior_center = priors[:, :2]
        if isinstance(gt_bboxes, BaseBoxes):
            is_in_gts = gt_bboxes.find_inside_points(prior_center)
        else:
            # Tensor boxes will be treated as horizontal boxes by defaults
            lt_ = prior_center[:, None] - gt_bboxes[:, :2]
            rb_ = gt_bboxes[:, 2:] - prior_center[:, None]

            deltas = torch.cat([lt_, rb_], dim=-1)
            is_in_gts = deltas.min(dim=-1).values > 0

        valid_mask = is_in_gts.sum(dim=1) > 0

        valid_decoded_bbox = decoded_bboxes[valid_mask]
        valid_pred_scores = pred_scores[valid_mask]
        num_valid = valid_decoded_bbox.size(0)

        if num_valid == 0:
            # No ground truth or boxes, return empty assignment
            max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
            assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
                                                      -1,
                                                      dtype=torch.long)
            if assign_on_cpu:
                # num_gt = num_gt.to(device)
                assigned_gt_inds = assigned_gt_inds.to(device)
                max_overlaps = max_overlaps.to(device)
                assigned_labels = assigned_labels.to(device)
            return AssignResult(
                num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)
        if hasattr(gt_instances, 'masks'):
            gt_center = center_of_mass(gt_instances.masks, eps=EPS)
        elif isinstance(gt_bboxes, BaseBoxes):
            gt_center = gt_bboxes.centers
        else:
            # Tensor boxes will be treated as horizontal boxes by defaults
            gt_center = (gt_bboxes[:, :2] + gt_bboxes[:, 2:]) / 2.0
        valid_prior = priors[valid_mask]
        strides = valid_prior[:, 2]
        distance = (valid_prior[:, None, :2] - gt_center[None, :, :]
                    ).pow(2).sum(-1).sqrt() / strides[:, None]
        soft_center_prior = torch.pow(10, distance - self.soft_center_radius)

        pairwise_ious = self.iou_calculator(valid_decoded_bbox, gt_bboxes)
        iou_cost = -torch.log(pairwise_ious + EPS) * self.iou_weight

        gt_onehot_label = (
            F.one_hot(gt_labels.to(torch.int64),
                      pred_scores.shape[-1]).float().unsqueeze(0).repeat(
                          num_valid, 1, 1))
        valid_pred_scores = valid_pred_scores.unsqueeze(1).repeat(1, num_gt, 1)

        soft_label = gt_onehot_label * pairwise_ious[..., None]
        scale_factor = soft_label - valid_pred_scores.sigmoid()
        soft_cls_cost = F.binary_cross_entropy_with_logits(
            valid_pred_scores, soft_label,
            reduction='none') * scale_factor.abs().pow(2.0)
        soft_cls_cost = soft_cls_cost.sum(dim=-1)

        cost_matrix = soft_cls_cost + iou_cost + soft_center_prior

        matched_pred_ious, matched_gt_inds = self.dynamic_k_matching(
            cost_matrix, pairwise_ious, num_gt, valid_mask)

        # convert to AssignResult format
        assigned_gt_inds[valid_mask] = matched_gt_inds + 1
        assigned_labels = assigned_gt_inds.new_full((num_bboxes, ), -1)
        assigned_labels[valid_mask] = gt_labels[matched_gt_inds].long()
        max_overlaps = assigned_gt_inds.new_full((num_bboxes, ),
                                                 -INF,
                                                 dtype=torch.float32)
        max_overlaps[valid_mask] = matched_pred_ious

        if assign_on_cpu:
            # num_gt = num_gt.to(device)
            assigned_gt_inds = assigned_gt_inds.to(device)
            max_overlaps = max_overlaps.to(device)
            assigned_labels = assigned_labels.to(device)
        return AssignResult(
            num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

    def dynamic_k_matching(self, cost: Tensor, pairwise_ious: Tensor,
                           num_gt: int,
                           valid_mask: Tensor) -> Tuple[Tensor, Tensor]:
        """Use IoU and matching cost to calculate the dynamic top-k positive
        targets. Same as SimOTA.

        Args:
            cost (Tensor): Cost matrix.
            pairwise_ious (Tensor): Pairwise iou matrix.
            num_gt (int): Number of gt.
            valid_mask (Tensor): Mask for valid bboxes.

        Returns:
            tuple: matched ious and gt indexes.
        """
        matching_matrix = torch.zeros_like(cost, dtype=torch.uint8)
        # select candidate topk ious for dynamic-k calculation
        candidate_topk = min(self.topk, pairwise_ious.size(0))
        topk_ious, _ = torch.topk(pairwise_ious, candidate_topk, dim=0)
        # calculate dynamic k for each gt
        dynamic_ks = torch.clamp(topk_ious.sum(0).int(), min=1)
        for gt_idx in range(num_gt):
            _, pos_idx = torch.topk(
                cost[:, gt_idx], k=dynamic_ks[gt_idx], largest=False)
            matching_matrix[:, gt_idx][pos_idx] = 1

        del topk_ious, dynamic_ks, pos_idx

        prior_match_gt_mask = matching_matrix.sum(1) > 1
        if prior_match_gt_mask.sum() > 0:
            cost_min, cost_argmin = torch.min(
                cost[prior_match_gt_mask, :], dim=1)
            matching_matrix[prior_match_gt_mask, :] *= 0
            matching_matrix[prior_match_gt_mask, cost_argmin] = 1
        # get foreground mask inside box and center prior
        fg_mask_inboxes = matching_matrix.sum(1) > 0
        valid_mask[valid_mask.clone()] = fg_mask_inboxes

        matched_gt_inds = matching_matrix[fg_mask_inboxes, :].argmax(1)
        matched_pred_ious = (matching_matrix *
                             pairwise_ious).sum(1)[fg_mask_inboxes]
        return matched_pred_ious, matched_gt_inds

I have the same error of CUDA memory error during validation (training finishes fine, with only ~30% memory occupied) with a single GPU setting

Try to add ‘–amp’ to enable fp16 training.

This bug also exists in mmyolo.

@SimonGuoNjust could you elaborate on how you included @AvoidCUDAOOM.retry_if_cuda_oom as well as the max_mask_to_train constraint? The former doesn’t seem to make a difference for me.

@qwert31639 I added @torch.no_grad() but it also doesn’t seem to make much of a difference.

I am trying to run on multiple (24 GB) GPUs using /mmdetection/tools/dist_train.sh and I notice that one GPU remains around the 14GB memory mark and the other one maxes out at 23GB and causes the error. Does this have something to do with how pytorch handles distributed training or how mmdetection is handling it?

+1 same problem with fp16 training, always OOM in the middle of training.

I have also noticed this increasing memory trait in the first and second epoch.

Thanks for your bug report! We are working on optimizing the memory footprint of RTMDet.