GRiT: Bug of corner case of proposals
Hi, Thanks for your amazing work and I try to retrain the model on VG, however, there seems to be a corner case that would raise an error
[01/16 12:04:41 d2.utils.events]: eta: 1 day, 11:49:23 iter: 1360 total_loss: 2.975 loss_box_reg_stage0: 0.2477 loss_box_reg_stage1: 0.3255 loss_box_reg_stage2: 0.2068 loss_centernet_agn_neg: 0.0414 loss_centernet_agn_pos: 0.1851 loss_centernet_loc: 0.3947 loss_cls_stage0: 0.2062 loss_cls_stage1: 0.1867 loss_cls_stage2: 0.1439 loss_mask: 0.3913 text_decoder_loss: 0.6096 time: 0.7084 data_time: 0.0160 lr: 7.7501e-07 max_mem: 21398M
[01/16 12:04:42] grit.modeling.roi_heads.grit_roi_heads INFO: all proposals are background at stage 2
Traceback (most recent call last):
File "train_deepspeed.py", line 263, in <module>
launch_deepspeed(
File "/nvme/xxxxx/GRiT/lauch_deepspeed.py", line 67, in launch_deepspeed
mp.spawn(
File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/nvme/xxxxx/GRiT/lauch_deepspeed.py", line 133, in _distributed_worker
main_func(*args)
File "/nvme/xxxxx/GRiT/train_deepspeed.py", line 251, in main
do_train(cfg, model, resume=args.resume, train_batch_size=train_batch_size)
File "/nvme/xxxxx/GRiT/train_deepspeed.py", line 175, in do_train
loss_dict = model(data)
File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1656, in forward
loss = self.module(*inputs, **kwargs)
File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/nvme/xxxxx/GRiT/grit/modeling/meta_arch/grit.py", line 59, in forward
proposals, roihead_textdecoder_losses = self.roi_heads(
File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 302, in forward
losses = self._forward_box(features, proposals, targets, task=targets_task)
File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 173, in _forward_box
proposals = self.check_if_all_background(proposals, targets, k)
File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 142, in check_if_all_background
proposals[0].proposal_boxes.tensor[0, :] = targets[0].gt_boxes.tensor[0, :]
IndexError: index 0 is out of bounds for dimension 0 with size 0
The error seems to indicate there is no any proposal for this batch and It can be easily reproduced by single-node training at around iter1360.
Would you mind checking it as I’m not familiar enough with this repo
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 2
- Comments: 24 (9 by maintainers)
Following ViTDet , for ViT-B backbone, we train on 32 GPUs with 2 images/gpu, and for ViT-L/H backbone, we train on 64 GPUs with 1 image/gpu.
Thanks for the update. I didn’t add the above suggested codes when I train the model. Not sure why this came to an issue for you guys. Would appreciate if you can give an update when you complete the training.
The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.