GRiT: Bug of corner case of proposals

Hi, Thanks for your amazing work and I try to retrain the model on VG, however, there seems to be a corner case that would raise an error

[01/16 12:04:41 d2.utils.events]:  eta: 1 day, 11:49:23  iter: 1360  total_loss: 2.975  loss_box_reg_stage0: 0.2477  loss_box_reg_stage1: 0.3255  loss_box_reg_stage2: 0.2068  loss_centernet_agn_neg: 0.0414  loss_centernet_agn_pos: 0.1851  loss_centernet_loc: 0.3947  loss_cls_stage0: 0.2062  loss_cls_stage1: 0.1867  loss_cls_stage2: 0.1439  loss_mask: 0.3913  text_decoder_loss: 0.6096  time: 0.7084  data_time: 0.0160  lr: 7.7501e-07  max_mem: 21398M
[01/16 12:04:42] grit.modeling.roi_heads.grit_roi_heads INFO: all proposals are background at stage 2
Traceback (most recent call last):
  File "train_deepspeed.py", line 263, in <module>
    launch_deepspeed(
  File "/nvme/xxxxx/GRiT/lauch_deepspeed.py", line 67, in launch_deepspeed
    mp.spawn(
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/nvme/xxxxx/GRiT/lauch_deepspeed.py", line 133, in _distributed_worker
    main_func(*args)
  File "/nvme/xxxxx/GRiT/train_deepspeed.py", line 251, in main
    do_train(cfg, model, resume=args.resume, train_batch_size=train_batch_size)
  File "/nvme/xxxxx/GRiT/train_deepspeed.py", line 175, in do_train
    loss_dict = model(data)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1656, in forward
    loss = self.module(*inputs, **kwargs)
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/GRiT/grit/modeling/meta_arch/grit.py", line 59, in forward
    proposals, roihead_textdecoder_losses = self.roi_heads(
  File "/nvme/xxxxx/anaconda3/envs/grit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 302, in forward
    losses = self._forward_box(features, proposals, targets, task=targets_task)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 173, in _forward_box
    proposals = self.check_if_all_background(proposals, targets, k)
  File "/nvme/xxxxx/GRiT/grit/modeling/roi_heads/grit_roi_heads.py", line 142, in check_if_all_background
    proposals[0].proposal_boxes.tensor[0, :] = targets[0].gt_boxes.tensor[0, :]
IndexError: index 0 is out of bounds for dimension 0 with size 0

The error seems to indicate there is no any proposal for this batch and It can be easily reproduced by single-node training at around iter1360.

Would you mind checking it as I’m not familiar enough with this repo

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 2
  • Comments: 24 (9 by maintainers)

Most upvoted comments

Following ViTDet , for ViT-B backbone, we train on 32 GPUs with 2 images/gpu, and for ViT-L/H backbone, we train on 64 GPUs with 1 image/gpu.

The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.

Thanks for the update. I didn’t add the above suggested codes when I train the model. Not sure why this came to an issue for you guys. Would appreciate if you can give an update when you complete the training.

The model has been trained for 1w iters and processes smoothly. Thus I believe this bug has been fixed.