vision: MaskRCNN crashes when reshaping an empty tensor rel_codes

torchvision ‘0.4.0+cu92’

Traceback:

creating index...
index created!
Traceback (most recent call last):
  File "./scratch_19.py", line 1068, in <module>
    main()
  File "./scratch_19.py", line 1054, in main
    evaluate(model, data_loader_test, device=device)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "./scratch_19.py", line 889, in evaluate
    outputs = model(image)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 52, in forward
    detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py", line 550, in forward
    boxes, scores, labels = self.postprocess_detections(class_logits, box_regression, proposals, image_shapes)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py", line 474, in postprocess_detections
    pred_boxes = self.box_coder.decode(box_regression, proposals)
  File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/_utils.py", line 168, in decode
    rel_codes.reshape(sum(boxes_per_image), -1), concat_boxes
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

It appears that during a forward pass rel_codes is empty which crashes the reshape operator.

About this issue

Original URL
State: open
Created 5 years ago
Reactions: 1
Comments: 16 (5 by maintainers)

Commits related to this issue

fix rel_codes decode bug causing crash #1568 — committed to jhultman/vision by jhultman 4 years ago
training works, prediction/eval stuck on pytorch vision issue, refer to https://github.com/pytorch/vision/issues/1568 — committed to bwolfson2/dl2020 by deleted user 4 years ago

Most upvoted comments

Sometimes it is an exploding gradient problem where the model outputs very high values (> 10**20) that are considered NaN. In that case you must retrain your model from beginning and try lower learning rate or gradient clipping e.g.

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_value)
optimizer.step()

Alieladi on Jul 20, 2020

Is there a definitive solution for this problem? It’s happening for me when using FasterRCNN too.

augustoolucas on Dec 5, 2020

Hi @fmassa, here’s a minimum working example. Obviously this initialization is purposely poor but it would be nice if the inference code didn’t crash. Note that none of the weights or inputs are NaN, so this could in principle happen by chance.

import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
torch.manual_seed(21)

model = fasterrcnn_resnet50_fpn(num_classes=2).cuda().eval()
state_dict = {k: torch.randn_like(v) for k, v in model.state_dict().items()}
model.load_state_dict(state_dict)
with torch.no_grad():
    model([torch.rand((3, 512, 512)).cuda()])

>>> RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] ...

The problem is with BoxCoder.decode. Here’s my attempt at a fix which seems to work for me (unchanged code omitted):

def decode(self, rel_codes, boxes):
    ...
    assert rel_codes.size(0) == box_sum
    pred_boxes = self.decode_single(rel_codes, concat_boxes)
    deltas_per_box = rel_codes.size(-1) // 4
    return pred_boxes.reshape(box_sum, deltas_per_box, 4)

I can submit PR if you think appropriate.

jhultman on Apr 1, 2020

I have encountered this issue too. https://github.com/pytorch/vision/blob/d2c763e14efe57e4bf3ebf916ec243ce8ce3315c/torchvision/models/detection/generalized_rcnn.py#L66-L72 Specifically during evaluation of the model, for some training examples, on line 67 the call to self.backbone (which in my case is a FPN) returns a feature pyramid with all NaNs, this seems to be what is causing the problem.

armbuster on Dec 31, 2019