vision: MaskRCNN crashes when reshaping an empty tensor rel_codes

torchvision ‘0.4.0+cu92’

Traceback:

creating index...
index created!
Traceback (most recent call last):
  File "./scratch_19.py", line 1068, in <module>
    main()
  File "./scratch_19.py", line 1054, in main
    evaluate(model, data_loader_test, device=device)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "./scratch_19.py", line 889, in evaluate
    outputs = model(image)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 52, in forward
    detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py", line 550, in forward
    boxes, scores, labels = self.postprocess_detections(class_logits, box_regression, proposals, image_shapes)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py", line 474, in postprocess_detections
    pred_boxes = self.box_coder.decode(box_regression, proposals)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/_utils.py", line 168, in decode
    rel_codes.reshape(sum(boxes_per_image), -1), concat_boxes
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

It appears that during a forward pass rel_codes is empty which crashes the reshape operator.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 1
  • Comments: 16 (5 by maintainers)

Commits related to this issue

Most upvoted comments

Sometimes it is an exploding gradient problem where the model outputs very high values (> 10**20) that are considered NaN. In that case you must retrain your model from beginning and try lower learning rate or gradient clipping e.g.

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_value)
optimizer.step()

Is there a definitive solution for this problem? It’s happening for me when using FasterRCNN too.

Hi @fmassa, here’s a minimum working example. Obviously this initialization is purposely poor but it would be nice if the inference code didn’t crash. Note that none of the weights or inputs are NaN, so this could in principle happen by chance.

import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
torch.manual_seed(21)

model = fasterrcnn_resnet50_fpn(num_classes=2).cuda().eval()
state_dict = {k: torch.randn_like(v) for k, v in model.state_dict().items()}
model.load_state_dict(state_dict)
with torch.no_grad():
    model([torch.rand((3, 512, 512)).cuda()])

>>> RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] ...

The problem is with BoxCoder.decode. Here’s my attempt at a fix which seems to work for me (unchanged code omitted):

def decode(self, rel_codes, boxes):
    ...
    assert rel_codes.size(0) == box_sum
    pred_boxes = self.decode_single(rel_codes, concat_boxes)
    deltas_per_box = rel_codes.size(-1) // 4
    return pred_boxes.reshape(box_sum, deltas_per_box, 4)

I can submit PR if you think appropriate.

I have encountered this issue too. https://github.com/pytorch/vision/blob/d2c763e14efe57e4bf3ebf916ec243ce8ce3315c/torchvision/models/detection/generalized_rcnn.py#L66-L72 Specifically during evaluation of the model, for some training examples, on line 67 the call to self.backbone (which in my case is a FPN) returns a feature pyramid with all NaNs, this seems to be what is causing the problem.