dino: Error using visualize_attention.py. The size of tensor a (3234) must match the size of tensor b (3181) at non-singleton dimension 1

Hi all, I am trying to execute visualize_attention.py with default pretrained weights on my own image as below

!python visualize_attention.py --image_path 'test/finalImg_249.png'

I get size mistamatch error. Could you please let me know what changes needs to be done here?

Error stack trace:

Please use the --pretrained_weights argument to indicate the path of the checkpoint to evaluate. Since no pretrained weights have been provided, we load the reference pretrained DINO weights. /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3458: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.

“See the documentation of nn.Upsample for details.”.format(mode) /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:3503: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. "The default behavior for interpolate/upsample with float scale_factor changed "

Traceback (most recent call last): File “visualize_attention.py”, line 162, in <module> attentions = model.forward_selfattention(img.to(device)) File “~/dino/vision_transformer.py”, line 246, in forward_selfattention x = x + pos_embed

RuntimeError: The size of tensor a (3234) must match the size of tensor b (3181) at non-singleton dimension 1

Image details: import cv2 img = cv2.imread(‘finalImg_249.png’) print (img.shape) #output: (427, 488, 3)

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (9 by maintainers)

Most upvoted comments

Yes, it happens on one side because they have introduced a new variable pos_embed I think.

https://github.com/facebookresearch/dino/blob/8aa93fdc90eae4b183c4e3c005174a9f634ecfbf/vision_transformer.py#L235-L246

Here the patch_pos_embed needs to get padded to move from height 60 to 61 but instead of padding patch_pos_embed, it is assigned to pos_embed and therein lies the problem I believe. Since, patch_pos_embed shape remains 1x384x53x60 instead of 1x384x53x61 (batch, dim, patches_in_width, patches_in_height), after flatten and transpose, we get patch_pos_embed shape as 1x3180x384 (60 * 53 = 3180). After concatenating the class_pos_embed, we get the shape as 1x3181x384. However, if it were padded properly, we would go from 1x384x53x61 to 1x3233x384 to 1x3234x384 which is the required shape for adding to x.

I have explained here. It’s probably a typo in the variable name. Need clarification from the authors. I am training greyscale images as well, will let you know how it goes!

Yes, it happens on one side because they have introduced a new variable pos_embed I think. https://github.com/facebookresearch/dino/blob/8aa93fdc90eae4b183c4e3c005174a9f634ecfbf/vision_transformer.py#L235-L246

Here the patch_pos_embed needs to get padded to move from height 60 to 61 but instead of padding patch_pos_embed, it is assigned to pos_embed and therein lies the problem I believe. Since, patch_pos_embed shape remains 1x384x53x60 instead of 1x384x53x61 (batch, dim, patches_in_width, patches_in_height), after flatten and transpose, we get patch_pos_embed shape as 1x3180x384 (60 * 53 = 3180). After concatenating the class_pos_embed, we get the shape as 1x3181x384. However, if it were padded properly, we would go from 1x384x53x61 to 1x3233x384 to 1x3234x384 which is the required shape for adding to x.

Don’t think so it is because of grey. Have tried with grey.

elif os.path.isfile(args.image_path):
    with open(args.image_path, 'rb') as f:
        img = Image.open(f)
        img = img.convert('RGB')

this converts to RGB even if grey in the visualize_attention file I think.