dino: Error using visualize_attention.py. The size of tensor a (3234) must match the size of tensor b (3181) at non-singleton dimension 1
Hi all, I am trying to execute visualize_attention.py with default pretrained weights on my own image as below
!python visualize_attention.py --image_path 'test/finalImg_249.png'
I get size mistamatch error. Could you please let me know what changes needs to be done here?
Error stack trace:
Please use the --pretrained_weights
argument to indicate the path of the checkpoint to evaluate.
Since no pretrained weights have been provided, we load the reference pretrained DINO weights.
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3458: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
“See the documentation of nn.Upsample for details.”.format(mode) /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3503: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. "The default behavior for interpolate/upsample with float scale_factor changed "
Traceback (most recent call last): File “visualize_attention.py”, line 162, in <module> attentions = model.forward_selfattention(img.to(device)) File “~/dino/vision_transformer.py”, line 246, in forward_selfattention x = x + pos_embed
RuntimeError: The size of tensor a (3234) must match the size of tensor b (3181) at non-singleton dimension 1
Image details: import cv2 img = cv2.imread(‘finalImg_249.png’) print (img.shape) #output: (427, 488, 3)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (9 by maintainers)
This could be the issue https://github.com/facebookresearch/dino/blob/8aa93fdc90eae4b183c4e3c005174a9f634ecfbf/vision_transformer.py#L238-L240
Here it is
pos_embed
while in the previous case https://github.com/facebookresearch/dino/blob/8aa93fdc90eae4b183c4e3c005174a9f634ecfbf/vision_transformer.py#L235-L237 it ispatch_pos_embed
Hi all, yes that’s definitely a typo… https://github.com/facebookresearch/dino/commit/91fd052deff3106feef93c4ac6791e89effc84a2
I have explained here. It’s probably a typo in the variable name. Need clarification from the authors. I am training greyscale images as well, will let you know how it goes!
Yes, it happens on one side because they have introduced a new variable
pos_embed
I think. https://github.com/facebookresearch/dino/blob/8aa93fdc90eae4b183c4e3c005174a9f634ecfbf/vision_transformer.py#L235-L246Here the
patch_pos_embed
needs to get padded to move from height60
to61
but instead of paddingpatch_pos_embed
, it is assigned topos_embed
and therein lies the problem I believe. Since,patch_pos_embed
shape remains1x384x53x60
instead of1x384x53x61
(batch, dim, patches_in_width, patches_in_height), after flatten and transpose, we getpatch_pos_embed
shape as1x3180x384
(60 * 53 = 3180). After concatenating theclass_pos_embed
, we get the shape as1x3181x384
. However, if it were padded properly, we would go from1x384x53x61
to1x3233x384
to1x3234x384
which is the required shape for adding tox
.Don’t think so it is because of grey. Have tried with grey.
this converts to RGB even if grey in the
visualize_attention
file I think.