mmdetection: [Bug] CUDA out of memory in RTMDet-Ins on custom dataset with > 100 ground truths per img
Prerequisite
- I have searched Issues and Discussions but cannot get the expected help.
- I have read the FAQ documentation but cannot get the expected help.
- The bug has not been fixed in the latest version (master) or latest version (3.x).
Task
I have modified the scripts/configs, or I’m working on my own tasks/models/datasets.
Branch
3.x branch https://github.com/open-mmlab/mmdetection/tree/3.x
Environment
mira@Dell-Precision:/mmdetection$ python3 mmdet/utils/collect_env.py
sys.platform: linux
Python: 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 2060
GPU 1: NVIDIA T400
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.2, V11.2.152
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.13.1+cu116
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.6
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.3.2 (built against CUDA 11.5)
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.14.1+cu116
OpenCV: 4.6.0
MMEngine: 0.3.0
MMDetection: 3.0.0rc4+7185b5a
Additional installation/environment information
Installed inside a docker container based on the example Dockerfile but pulls dev-3.x because I started working on this before it was merged into 3.x but I verified that there haven’t been any changes to the specific code snippets that would help with the OOM error.
RUN apt-get update \
&& apt-get install --no-install-recommends -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Install MMEngine and MMCV
RUN pip install openmim && \
mim install "mmengine==0.3.0" "mmcv>=2.0.0rc1"
# Install MMDetection
RUN git clone https://github.com/open-mmlab/mmdetection.git -b dev-3.x /mmdetection \
&& cd /mmdetection \
&& pip install --no-cache-dir -e .
Reproduces the problem - code sample
Config file run for training. Classes and meta info hidden for privacy:
# rtmdet-ins_tiny_1xb2-200e.py
_base_ = "/mmdetection/configs/rtmdet/rtmdet-ins_tiny_8xb32-300e_coco.py"
checkpoint = (
"https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth" # noqa
)
data_root = "/home/mira/RTMDet/rtmdet_ins_data/"
model = dict(bbox_head=dict(num_classes=5, in_channels=96, feat_channels=96))
train_pipeline_stage2 = [
dict(
type='LoadImageFromFile',
file_client_args={{_base_.file_client_args}}),
dict(
type='LoadAnnotations',
with_bbox=True,
with_mask=True,
poly2mask=False),
dict(
type='RandomResize',
scale=(1280, 720),
ratio_range=(0.5, 2.0),
keep_ratio=True),
dict(
type='RandomCrop',
crop_size=(640, 480),
recompute_bbox=True,
allow_negative_crop=True),
dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1)),
dict(type='YOLOXHSVRandomAug'),
dict(type='RandomFlip', prob=0.5),
dict(type='Pad', size=(640, 480), pad_val=dict(img=(114, 114, 114))),
dict(type='PackDetInputs')
]
log_interval = 20
val_epoch_interval = 10
max_epochs = 200
stage2_num_epochs = 10
base_lr = 0.004
train_cfg = dict(
max_epochs=max_epochs, val_interval=val_epoch_interval, dynamic_intervals=[(max_epochs - stage2_num_epochs, 1)]
)
train_dataloader = dict(
batch_size=2,
dataset=dict(
metainfo=metainfo,
data_root=data_root,
ann_file="coco_labels/train_annotations2023.json",
data_prefix=dict(img="train_images/"),
),
)
val_dataloader = dict(
dataset=dict(
ann_file="coco_labels/val_annotations2023.json",
metainfo=metainfo,
data_root=data_root,
data_prefix=dict(img="val_images/"),
)
)
test_dataloader = dict(
dataset=dict(
ann_file="coco_labels/test_annotations2023.json",
metainfo=metainfo,
data_root=data_root,
data_prefix=dict(img="test_images/"),
)
)
val_evaluator = dict(ann_file=data_root + "coco_labels/val_annotations2023.json")
test_evaluator = dict(ann_file=data_root + "coco_labels/test_annotations2023.json")
param_scheduler = [
dict(
# use cosine lr from 150 to 300 epoch
type="CosineAnnealingLR",
eta_min=base_lr * 0.05,
begin=max_epochs // 2,
end=max_epochs,
T_max=max_epochs // 2,
by_epoch=True,
convert_to_iter_based=True,
)
]
default_hooks = dict(
logger=dict(type="LoggerHook", interval=log_interval),
checkpoint=dict(interval=val_epoch_interval, max_keep_ckpts=3),
) # only keep latest 3 checkpoints
custom_hooks = [
dict(type="EMAHook", ema_type="ExpMomentumEMA", momentum=0.0002, update_buffers=True, priority=49),
dict(type="PipelineSwitchHook", switch_epoch=max_epochs - stage2_num_epochs, switch_pipeline=train_pipeline_stage2),
]
### Reproduces the problem - command or script
mira@Dell-Precision:/mmdetection$ python3 tools/train.py ~/RTMDet/configs/rtmdet-ins_tiny_1xb2-200e.py --work-dir ~/RTMDet/Exp2_logs --resume ~/RTMDet/Exp1_logs/epoch_30.pth
### Reproduces the problem - error message
01/11 17:31:41 - mmengine - INFO - Epoch(train) [31][20/41] lr: 4.0000e-03 eta: 2:26:33 time: 1.3264 data_time: 0.0113 memory: 4257 loss: 1.4234 loss_cls: 0.3436 loss_bbox: 0.5839 loss_mask: 0.4959
01/11 17:32:28 - mmengine - INFO - Epoch(train) [31][40/41] lr: 4.0000e-03 eta: 3:29:35 time: 1.7126 data_time: 0.0127 memory: 7018 loss: 1.4461 loss_cls: 0.3816 loss_bbox: 0.5821 loss_mask: 0.4823
01/11 17:32:30 - mmengine - INFO - Exp name: rtmdet-ins_tiny_1xb2-200e_20230111_173110
01/11 17:33:31 - mmengine - INFO - Epoch(train) [32][20/41] lr: 4.0000e-03 eta: 4:15:49 time: 2.4690 data_time: 0.0107 memory: 7136 loss: 1.4545 loss_cls: 0.3468 loss_bbox: 0.6041 loss_mask: 0.5036
Traceback (most recent call last):
File "tools/train.py", line 130, in <module>
main()
File "tools/train.py", line 126, in main
runner.train()
File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/runner.py", line 1661, in train
model = self.train_loop.run() # type: ignore
File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 90, in run
self.run_epoch()
File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 106, in run_epoch
self.run_iter(idx, data_batch)
File "/usr/local/lib/python3.8/dist-packages/mmengine/runner/loops.py", line 122, in run_iter
outputs = self.runner.model.train_step(
File "/usr/local/lib/python3.8/dist-packages/mmengine/model/base_model/base_model.py", line 114, in train_step
losses = self._run_forward(data, mode='loss') # type: ignore
File "/usr/local/lib/python3.8/dist-packages/mmengine/model/base_model/base_model.py", line 320, in _run_forward
results = self(**data, mode=mode)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mmdetection/mmdet/models/detectors/base.py", line 92, in forward
return self.loss(inputs, data_samples)
File "/mmdetection/mmdet/models/detectors/single_stage.py", line 78, in loss
losses = self.bbox_head.loss(x, batch_data_samples)
File "/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 123, in loss
losses = self.loss_by_feat(*loss_inputs)
File "/mmdetection/mmdet/models/dense_heads/rtmdet_ins_head.py", line 748, in loss_by_feat
loss_mask = self.loss_mask_by_feat(mask_feat, flatten_kernels,
File "/mmdetection/mmdet/models/dense_heads/rtmdet_ins_head.py", line 653, in loss_mask_by_feat
loss_mask = self.loss_mask(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mmdetection/mmdet/models/losses/dice_loss.py", line 137, in forward
loss = self.loss_weight * dice_loss(
File "/mmdetection/mmdet/models/losses/dice_loss.py", line 47, in dice_loss
a = torch.sum(input * target, 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 478.00 MiB (GPU 0; 11.75 GiB total capacity; 8.98 GiB already allocated; 77.44 MiB free; 10.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Additional information
Expected Result
Training without an OOM error.
Dataset
Custom dataset of 100 images for instance segmentation with >200 polygons per image. Resolution of images: 1280 x 720
Hardware
NVIDIA RTX 2060. Also tried training on NVIDIA RTX 3080. Both have a GPU memory of 12 GB.
Additional description/information
Based on reading the FAQ and looking through issues #188 and [#1581], (https://github.com/open-mmlab/mmdetection/issues/1581), and given the high number of ground truth per image, I assumed that the problem was that assign_gpu_thr needed to be set to a number so the assign computation takes place in the CPU instead of the GPU.
However, rtmdet uses DynamicSoftLabelAssigner and not MaxIoUAssigner, which does not have an assign_gpu_thr parameter that is configurable.
Switching the assigner to MaxIoUAssigner in the config as shown below
model = dict(bbox_head=dict(num_classes=5, in_channels=96, feat_channels=96), train_cfg=dict(
assigner=dict(type='MaxIoUAssigner', pos_iou_thr=0.5,
neg_iou_thr=0.5,
min_pos_iou=0.5,
match_low_quality=False,
ignore_iof_thr=-1, gpu_assign_thr=5),
allowed_border=-1,
pos_weight=-1,
debug=False))
resulted in the following output:
01/11 20:00:57 - mmengine - INFO - load backbone. in model from: https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth
http loads checkpoint from path: https://download.openmmlab.com/mmdetection/v3.0/rtmdet/cspnext_rsb_pretrain/cspnext-tiny_imagenet_600e.pth
01/11 20:00:57 - mmengine - INFO - Checkpoints will be saved to /home/mira/RTMDet/Exp19_maxiou_logs.
/usr/local/lib/python3.8/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
01/11 20:01:06 - mmengine - INFO - Epoch(train) [1][20/41] lr: 4.0000e-03 eta: 0:00:27 time: 0.4417 data_time: 0.0270 memory: 1072 loss: 0.7659 loss_cls: 0.2922 loss_bbox: 0.1749 loss_mask: 0.2988
01/11 20:01:19 - mmengine - INFO - Epoch(train) [1][40/41] lr: 4.0000e-03 eta: 0:00:23 time: 0.5636 data_time: 0.0140 memory: 1735 loss: 0.9751 loss_cls: 0.3478 loss_bbox: 0.2383 loss_mask: 0.3890
01/11 20:01:20 - mmengine - INFO - Exp name: rtmdet-ins_tiny_1xb2-2e_20230111_200052
01/11 20:01:20 - mmengine - INFO - Saving checkpoint at 1 epochs
01/11 20:01:22 - mmengine - INFO - Evaluating bbox...
Loading and preparing results...
01/11 20:01:22 - mmengine - ERROR - /mmdetection/mmdet/evaluation/metrics/coco_metric.py - compute_metrics - 437 - The testing results of the whole dataset is empty.
01/11 20:01:22 - mmengine - INFO - Epoch(val) [1][1/1]
01/11 20:01:22 - mmengine - INFO - Switch pipeline now!
01/11 20:01:27 - mmengine - INFO - Epoch(train) [2][20/41] lr: 2.3179e-03 eta: 0:00:09 time: 0.4578 data_time: 0.0163 memory: 821 loss: 0.6599 loss_cls: 0.2235 loss_bbox: 0.1650 loss_mask: 0.2714
01/11 20:01:32 - mmengine - INFO - Epoch(train) [2][40/41] lr: 2.2227e-04 eta: 0:00:00 time: 0.3217 data_time: 0.0164 memory: 800 loss: 0.3189 loss_cls: 0.0869 loss_bbox: 0.0804 loss_mask: 0.1516
01/11 20:01:32 - mmengine - INFO - Exp name: rtmdet-ins_tiny_1xb2-1e_tomato_20230111_200052
01/11 20:01:32 - mmengine - INFO - Saving checkpoint at 2 epochs
01/11 20:01:34 - mmengine - INFO - Evaluating bbox...
Loading and preparing results...
01/11 20:01:34 - mmengine - ERROR - /mmdetection/mmdet/evaluation/metrics/coco_metric.py - compute_metrics - 437 - The testing results of the whole dataset is empty.
01/11 20:01:34 - mmengine - INFO - Epoch(val) [2][1/1]
However, switching to the MaxIouAssigner did not lead to an OOM error for multiple epochs, which biases me to think the problem is the high number of polygons. But inference with the trained model outputs no predictions and as shown in the log above, throws an error saying that the testing results of the whole dataset is empty. Reading through the issues (#9381), this is sometimes attributed to incorrect format of the ground truth labels but since the data has not changed, this doesn’t seem plausible.
To summarize:
- Apart from increasing GPU memory, are there any other solutions to this problem?
- Is there a feature to pass in
assign_gpu_thrtoDynamicSoftLabelAssigner? - Why does using
MaxIoUAssignerwith RTMDet result in no inference results? This seems like a bug. - The argument
with_cp(suggested in the FAQ for OOM issues) does not exist inCSPNextwhich is the backbone for RTMDet. Are there plans to add it? - Documentation for RTMDet seems to lack information on how to train with FP16 if that’s an option as suggested in the FAQs. Please advise.
I’m not sure if this is entirely a bug or a feature request but it seems to be a bit of both.
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 4
- Comments: 27 (6 by maintainers)
For those who got the OOM error during validation, I’ve found one problem. During validation, the inference input will go through the val_pipeline, which contains the ‘Resize’. So it’s ok during inference. But in post-processing, it will interpolate the output mask to the original image size, then run sigmoid and thresholding to get the mask output. Refer to the code snippet bellow. https://github.com/open-mmlab/mmdetection/blob/f78af7785ada87f1ced75a2313746e4ba3149760/mmdet/models/dense_heads/rtmdet_ins_head.py#L498-L510 This could be extremely memory costing if your original image has a large resolution. e.g. a 4000x3000 image will have a mask output tensor of 100x4000x3000, which will cost over 4G memory!(and it’s just a single tensor, there could be several temporary tensors with same size) I haven’t found an effective solution yet. If you set the ‘rescale’ parameter to false, then the output mask won’t be scaled to match the original image size. This will lead to wrong metric calculation. I tried to put the sigmoid before the interpolation, which do save some memory but not much. I think one solution would be to set the ‘rescale’ parameter to false, and when calculating validation metrics, resize the original image to match the output mask size.
I wrote my own version of gpu_assign_thr in DynamicSoftLabelAssigner., it solves the out of memory error during training as the computations now happen on cpu and then passed on to gpu at the end.
I have the same error of CUDA memory error during validation (training finishes fine, with only ~30% memory occupied) with a single GPU setting
Try to add ‘–amp’ to enable fp16 training.
This bug also exists in mmyolo.
@SimonGuoNjust could you elaborate on how you included @AvoidCUDAOOM.retry_if_cuda_oom as well as the max_mask_to_train constraint? The former doesn’t seem to make a difference for me.
@qwert31639 I added @torch.no_grad() but it also doesn’t seem to make much of a difference.
I am trying to run on multiple (24 GB) GPUs using
/mmdetection/tools/dist_train.shand I notice that one GPU remains around the 14GB memory mark and the other one maxes out at 23GB and causes the error. Does this have something to do with how pytorch handles distributed training or how mmdetection is handling it?+1 same problem with fp16 training, always OOM in the middle of training.
I have also noticed this increasing memory trait in the first and second epoch.
Thanks for your bug report! We are working on optimizing the memory footprint of RTMDet.