vision: Inconsistent representation of box in torchvision.ops
Feature suggestion
When checking the how the new operations in the torchvision 0.11 work, I found an inconsistency between different function.
import torchvision.ops as tops
import torch
import torch.nn as nn
a = torch.zeros(1, 1, 8, 8)
a[..., :4, 4:] = 1
bbox = tops.masks_to_boxes((a==1)[0].float())
area = tops.box_area(bbox) # the returned area is 9 which should be 16
new_box = tops.box_convert(bbox, 'xyxy', 'xywh') # which returns [4, 4, 3, 3] => expected output [4, 4, 4, 4]
Suggest a potential alternative/fix
This is not actually an error but could cause confusion.
I suggest to make the box representations between different function more consistent.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 23 (20 by maintainers)
We should return area / sizes as
x2 - x1, and notx2 - x1 + 1as it was previously done in early versions of other libraries (like Detectron).This was a deliberate decision, and as @rbgirshick pointed out in the past
Thanks.
Then I’m not sure this 0 vs 1 indexing discussion is relevant to what we’re concerned with here. Python is 0-based and we’re following that too.
The problem we have here is the relation between x1 - x0 and y1 - y0, and how it plays out in the area calculation, and other conversions. Those concerns are valid whether one uses 0 or 1-based indexing, but I don’t think we should worry about supporting 1-based indexing.
Thanks for the ping.
I believe there is a typo on this comment. This should have been:
The caveat to doing this is that the coordinates of the right element might be outside of the image.
Thanks for the report @npmhung , I agree this is a bit confusing.
The mask a is equal to
and
masks_to_boxesreturnstensor([[4., 0., 7., 3.]]), which is correct (or is it??).box_areareturns 9 instead of 16 because the area computation is simply computed asWhen the the box coordinates are indices, the correct way to compute the area would be to add
+ 1to both dimensions, but the problem is that this would make the computation incorrect in when the box coordinates are real numbers (which they often are).The issue is similar with
box_convert.I’m honestly not sure how to best solve this. Should we decide that when a box is represented as indices, the upper values (x2 and y2) should be offset by 1? I.e.
mask_to_boxesshould return4, 0, 8, 4instead? This would be consistent with Python indexing, i.e. we could do something likethe_box_img = orig_img[x1:x2, y1:y2], which is incorrect right now.Thoughts @fmassa @datumbox?