transformers: Unable to use BLIP2 with caption_coco_opt6.7b at HEAD via salesforce-lavis (also HEAD)

System Info

working:

transformers version: 4.26.1
Platform: Linux-6.0.12-x86_64-with-glibc2.10
Python version: 3.8.16
Huggingface_hub version: 0.12.0
PyTorch version (GPU?): 1.13.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

broken:

transformers version: 4.27.0.dev0
Platform: Linux-6.0.12-x86_64-with-glibc2.10
Python version: 3.8.16
Huggingface_hub version: 0.12.0
PyTorch version (GPU?): 1.13.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@gante @NielsRogge

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Start with clean env setup via https://github.com/salesforce/LAVIS/blob/main/requirements.txt (transformers-4.26.1)
Run python test_simple.py, model is correctly loaded and prints a caption
pip install --upgrade git+https://github.com/huggingface/transformers (I wanted the new shiny blip2 conversion script so I can conver my finetuned model into HF format)
Resolved https://github.com/huggingface/transformers to commit 8b3db33a763ccef828fca89bac7e6cbff314f131
Run python test_simple.py
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 25 but got size 5 for tensor number 1 in the list.

import torch
from lavis.models import load_model_and_preprocess
import torch
from PIL import Image
import requests

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, _ = load_model_and_preprocess(name="blip2_opt", model_type="caption_coco_opt6.7b", is_eval=True, device=device)

url = "..."
raw_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
data = model.generate({"image": image})
print(data)

Expected behavior

Can use BLIP2 with latest HF

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (8 by maintainers)

Most upvoted comments

I have a PR here which aims to further verify equivalence: https://github.com/huggingface/transformers/pull/24854.

The conversion script can be found here and can be run as follows:

pip install -U git+https://github.com/nielsrogge/LAVIS.git@blip2_float32
git clone -b improve_blip2 git+https://github.com/nielsrogge/transformers.git
cd transformers
python src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py --model_name "blip2-flan-t5-xl"

The reason I forked LAVIS is to make sure I can compare both implementations using float32.

NielsRogge on Jul 25, 2023

@gante thank you for debugging!

I can confirm that syncing before https://github.com/huggingface/transformers/pull/21405 (edc1e734bfc01109b8c66881d950ebbda032a6d2) works, I’ll open an issue on SF side to warn them about the breakage, unfortunately this brings me to the original issue of trying to use convert_blip_2_original_to_pytorch.py, perhaps you can help me figure out how the BLIP2 models were converted? (I understand, this is irrelevant to most users but only a few brave souls who are finetuning BLIP2 via LAVIS but want to then load it in HF.)

I’ve tried both pip install git+https://github.com/nielsrogge/LAVIS.git@fix_lavis (mentioned in the script) and lavis from HEAD, but I am getting this trace

$ python ./convert_blip_2_original_to_pytorch.py
Loading original model...
Position interpolate from 16x16 to 26x26
tokenizer facebook/opt-6.7b
Loading checkpoint shards: Done!
Traceback (most recent call last):
  File "./convert_blip_2_original_to_pytorch.py", line 304, in <module>
    convert_blip2_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
  File "/.../envs/lavis/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "./convert_blip_2_original_to_pytorch.py", line 216, in convert_blip2_checkpoint
    original_logits = original_logits.logits
AttributeError: 'dict' object has no attribute 'logits' // indeed, this is a dictionary containing only 'loss'

what combination of versions of transformers and lavis was used during conversion?

AstraliteHeart on Feb 24, 2023

After some digging, we can see that the exception is raised as follows:

│ /home/joao/hf/lib/python3.10/site-packages/lavis/models/blip2_models/modeling_opt.py:703 in      │
│ forward                                                                                          │
│                                                                                                  │
│    700 │   │   │   inputs_embeds = self.embed_tokens(input_ids)                                  │
│    701 │   │                                                                                     │
│    702 │   │   if query_embeds is not None:                                                      │
│ ❱  703 │   │   │   inputs_embeds = torch.cat([query_embeds, inputs_embeds], dim=1)               │
│    704 │   │   │   input_shape = inputs_embeds.size()[:-1]                                       │
│    705 │   │                                                                                     │
│    706 │   │   # embed positions                                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 25 but got size 5 for tensor number 1 in the list.

From the full stack trace, we can conclude that the error arises from an issue in lavis, and not in transformers 😃 Actually, the root cause for this issue is something that we have addressed on this PR – lavis has a different implementation, where they have a modified OPT model to handle the image embeddings, where we decided to update .generate() to handle soft-prompting.

@AstraliteHeart This means you have two options:

Update your code to rely on transformers, as opposed to lavis. See here for examples.
Open an issue in lavis, so they can help you with this issue 😃

gante on Feb 22, 2023

Thank you for confirming @AstraliteHeart 🤗 I will dig deeper and let you know what I find!

gante on Feb 22, 2023

Hey @AstraliteHeart 👋 This issue seems to be a duplicate of https://github.com/huggingface/transformers/issues/21599, which is fixed.

Can I ask you to try to run your script using transformers main branch, i.e. after installing with pip install --upgrade git+https://github.com/huggingface/transformers.git?

gante on Feb 21, 2023