vision: Inconsistent representation of box in torchvision.ops

Feature suggestion

When checking the how the new operations in the torchvision 0.11 work, I found an inconsistency between different function.

import torchvision.ops as tops
import torch
import torch.nn as nn

a = torch.zeros(1, 1, 8, 8)
a[..., :4, 4:] = 1
bbox = tops.masks_to_boxes((a==1)[0].float()) 
area = tops.box_area(bbox) # the returned area is 9 which should be 16

new_box = tops.box_convert(bbox, 'xyxy', 'xywh') # which returns [4, 4, 3, 3] => expected output [4, 4, 4, 4]

Suggest a potential alternative/fix

This is not actually an error but could cause confusion.

I suggest to make the box representations between different function more consistent.

About this issue

Original URL
State: open
Created 3 years ago
Comments: 23 (20 by maintainers)

Most upvoted comments

We should return area / sizes as x2 - x1, and not x2 - x1 + 1 as it was previously done in early versions of other libraries (like Detectron).

This was a deliberate decision, and as @rbgirshick pointed out in the past

Anyone interested in this issue should read http://alvyray.com/Memos/CG/Microsoft/6_pixel.pdf. I believe that this provides the correct framework for thinking about continuous geometric entities (e.g., boxes, polygons, etc) and discrete rasterized entities (e.g., bit masks, feature maps, etc.).

fmassa on Oct 28, 2021

Thanks.

Then I’m not sure this 0 vs 1 indexing discussion is relevant to what we’re concerned with here. Python is 0-based and we’re following that too.

The problem we have here is the relation between x1 - x0 and y1 - y0, and how it plays out in the area calculation, and other conversions. Those concerns are valid whether one uses 0 or 1-based indexing, but I don’t think we should worry about supporting 1-based indexing.

NicolasHug on Oct 26, 2021

Thanks for the ping.

which returns [4, 4, 3, 3] => expected output [4, 4, 4, 4]

I believe there is a typo on this comment. This should have been:

which returns [4, 0, 3, 3] => expected output [4, 0, 4, 4]

I agree with that the mask_to_boxes should return 4, 0, 8, 4 instead.

The caveat to doing this is that the coordinates of the right element might be outside of the image.

datumbox on Oct 26, 2021

Thanks for the report @npmhung , I agree this is a bit confusing.

The mask a is equal to

tensor([[[[0., 0., 0., 0., 1., 1., 1., 1.],
          [0., 0., 0., 0., 1., 1., 1., 1.],
          [0., 0., 0., 0., 1., 1., 1., 1.],
          [0., 0., 0., 0., 1., 1., 1., 1.],
          [0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0.]]]])

and masks_to_boxes returns tensor([[4., 0., 7., 3.]]), which is correct (or is it??).

box_area returns 9 instead of 16 because the area computation is simply computed as

 (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])

When the the box coordinates are indices, the correct way to compute the area would be to add + 1 to both dimensions, but the problem is that this would make the computation incorrect in when the box coordinates are real numbers (which they often are).

The issue is similar with box_convert.

I’m honestly not sure how to best solve this. Should we decide that when a box is represented as indices, the upper values (x2 and y2) should be offset by 1? I.e. mask_to_boxes should return 4, 0, 8, 4 instead? This would be consistent with Python indexing, i.e. we could do something like the_box_img = orig_img[x1:x2, y1:y2], which is incorrect right now.

Thoughts @fmassa @datumbox?

NicolasHug on Oct 26, 2021