vision: Inconsistent representation of box in torchvision.ops
Feature suggestion
When checking the how the new operations in the torchvision 0.11 work, I found an inconsistency between different function.
import torchvision.ops as tops
import torch
import torch.nn as nn
a = torch.zeros(1, 1, 8, 8)
a[..., :4, 4:] = 1
bbox = tops.masks_to_boxes((a==1)[0].float())
area = tops.box_area(bbox) # the returned area is 9 which should be 16
new_box = tops.box_convert(bbox, 'xyxy', 'xywh') # which returns [4, 4, 3, 3] => expected output [4, 4, 4, 4]
Suggest a potential alternative/fix
This is not actually an error but could cause confusion.
I suggest to make the box representations between different function more consistent.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 23 (20 by maintainers)
We should return area / sizes as
x2 - x1
, and notx2 - x1 + 1
as it was previously done in early versions of other libraries (like Detectron).This was a deliberate decision, and as @rbgirshick pointed out in the past
Thanks.
Then I’m not sure this 0 vs 1 indexing discussion is relevant to what we’re concerned with here. Python is 0-based and we’re following that too.
The problem we have here is the relation between x1 - x0 and y1 - y0, and how it plays out in the area calculation, and other conversions. Those concerns are valid whether one uses 0 or 1-based indexing, but I don’t think we should worry about supporting 1-based indexing.
Thanks for the ping.
I believe there is a typo on this comment. This should have been:
The caveat to doing this is that the coordinates of the right element might be outside of the image.
Thanks for the report @npmhung , I agree this is a bit confusing.
The mask a is equal to
and
masks_to_boxes
returnstensor([[4., 0., 7., 3.]])
, which is correct (or is it??).box_area
returns 9 instead of 16 because the area computation is simply computed asWhen the the box coordinates are indices, the correct way to compute the area would be to add
+ 1
to both dimensions, but the problem is that this would make the computation incorrect in when the box coordinates are real numbers (which they often are).The issue is similar with
box_convert
.I’m honestly not sure how to best solve this. Should we decide that when a box is represented as indices, the upper values (x2 and y2) should be offset by 1? I.e.
mask_to_boxes
should return4, 0, 8, 4
instead? This would be consistent with Python indexing, i.e. we could do something likethe_box_img = orig_img[x1:x2, y1:y2]
, which is incorrect right now.Thoughts @fmassa @datumbox?