vision: MaskRCNN crashes when reshaping an empty tensor rel_codes
torchvision ‘0.4.0+cu92’
Traceback:
creating index...
index created!
Traceback (most recent call last):
File "./scratch_19.py", line 1068, in <module>
main()
File "./scratch_19.py", line 1054, in main
evaluate(model, data_loader_test, device=device)
File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
return func(*args, **kwargs)
File "./scratch_19.py", line 889, in evaluate
outputs = model(image)
File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 52, in forward
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py", line 550, in forward
boxes, scores, labels = self.postprocess_detections(class_logits, box_regression, proposals, image_shapes)
File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py", line 474, in postprocess_detections
pred_boxes = self.box_coder.decode(box_regression, proposals)
File "/disk1/mattan/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/_utils.py", line 168, in decode
rel_codes.reshape(sum(boxes_per_image), -1), concat_boxes
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous
It appears that during a forward pass rel_codes is empty which crashes the reshape operator.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 1
- Comments: 16 (5 by maintainers)
Commits related to this issue
- fix rel_codes decode bug causing crash #1568 — committed to jhultman/vision by jhultman 4 years ago
- training works, prediction/eval stuck on pytorch vision issue, refer to https://github.com/pytorch/vision/issues/1568 — committed to bwolfson2/dl2020 by deleted user 4 years ago
Sometimes it is an exploding gradient problem where the model outputs very high values (> 10**20) that are considered NaN. In that case you must retrain your model from beginning and try lower learning rate or gradient clipping e.g.
Is there a definitive solution for this problem? It’s happening for me when using FasterRCNN too.
Hi @fmassa, here’s a minimum working example. Obviously this initialization is purposely poor but it would be nice if the inference code didn’t crash. Note that none of the weights or inputs are NaN, so this could in principle happen by chance.
The problem is with BoxCoder.decode. Here’s my attempt at a fix which seems to work for me (unchanged code omitted):
I can submit PR if you think appropriate.
I have encountered this issue too. https://github.com/pytorch/vision/blob/d2c763e14efe57e4bf3ebf916ec243ce8ce3315c/torchvision/models/detection/generalized_rcnn.py#L66-L72 Specifically during evaluation of the model, for some training examples, on line 67 the call to self.backbone (which in my case is a FPN) returns a feature pyramid with all NaNs, this seems to be what is causing the problem.