dino: Error using visualize_attention.py. The size of tensor a (3234) must match the size of tensor b (3181) at non-singleton dimension 1

Hi all, I am trying to execute visualize_attention.py with default pretrained weights on my own image as below

!python visualize_attention.py --image_path 'test/finalImg_249.png'

I get size mistamatch error. Could you please let me know what changes needs to be done here?

Error stack trace:

Please use the --pretrained_weights argument to indicate the path of the checkpoint to evaluate. Since no pretrained weights have been provided, we load the reference pretrained DINO weights. /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3458: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.

“See the documentation of nn.Upsample for details.”.format(mode) /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3503: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. "The default behavior for interpolate/upsample with float scale_factor changed "

Traceback (most recent call last): File “visualize_attention.py”, line 162, in <module> attentions = model.forward_selfattention(img.to(device)) File “~/dino/vision_transformer.py”, line 246, in forward_selfattention x = x + pos_embed

RuntimeError: The size of tensor a (3234) must match the size of tensor b (3181) at non-singleton dimension 1

Image details: import cv2 img = cv2.imread(‘finalImg_249.png’) print (img.shape) #output: (427, 488, 3)

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 20 (9 by maintainers)

Most upvoted comments

This could be the issue https://github.com/facebookresearch/dino/blob/8aa93fdc90eae4b183c4e3c005174a9f634ecfbf/vision_transformer.py#L238-L240

Here it is pos_embed while in the previous case https://github.com/facebookresearch/dino/blob/8aa93fdc90eae4b183c4e3c005174a9f634ecfbf/vision_transformer.py#L235-L237 it is patch_pos_embed

user1234554321 on May 4, 2021

Hi all, yes that’s definitely a typo… https://github.com/facebookresearch/dino/commit/91fd052deff3106feef93c4ac6791e89effc84a2

mathildecaron31 on May 5, 2021

Yes, it happens on one side because they have introduced a new variable pos_embed I think.

https://github.com/facebookresearch/dino/blob/8aa93fdc90eae4b183c4e3c005174a9f634ecfbf/vision_transformer.py#L235-L246

Here the patch_pos_embed needs to get padded to move from height 60 to 61 but instead of padding patch_pos_embed, it is assigned to pos_embed and therein lies the problem I believe. Since, patch_pos_embed shape remains 1x384x53x60 instead of 1x384x53x61 (batch, dim, patches_in_width, patches_in_height), after flatten and transpose, we get patch_pos_embed shape as 1x3180x384 (60 * 53 = 3180). After concatenating the class_pos_embed, we get the shape as 1x3181x384. However, if it were padded properly, we would go from 1x384x53x61 to 1x3233x384 to 1x3234x384 which is the required shape for adding to x.

I have explained here. It’s probably a typo in the variable name. Need clarification from the authors. I am training greyscale images as well, will let you know how it goes!

user1234554321 on May 4, 2021

Yes, it happens on one side because they have introduced a new variable pos_embed I think. https://github.com/facebookresearch/dino/blob/8aa93fdc90eae4b183c4e3c005174a9f634ecfbf/vision_transformer.py#L235-L246

Here the patch_pos_embed needs to get padded to move from height 60 to 61 but instead of padding patch_pos_embed, it is assigned to pos_embed and therein lies the problem I believe. Since, patch_pos_embed shape remains 1x384x53x60 instead of 1x384x53x61 (batch, dim, patches_in_width, patches_in_height), after flatten and transpose, we get patch_pos_embed shape as 1x3180x384 (60 * 53 = 3180). After concatenating the class_pos_embed, we get the shape as 1x3181x384. However, if it were padded properly, we would go from 1x384x53x61 to 1x3233x384 to 1x3234x384 which is the required shape for adding to x.

user1234554321 on May 4, 2021

Don’t think so it is because of grey. Have tried with grey.

elif os.path.isfile(args.image_path):
    with open(args.image_path, 'rb') as f:
        img = Image.open(f)
        img = img.convert('RGB')

this converts to RGB even if grey in the visualize_attention file I think.

user1234554321 on May 4, 2021