transformers: Unable to use BLIP2 with caption_coco_opt6.7b at HEAD via salesforce-lavis (also HEAD)

System Info

working:

  • transformers version: 4.26.1
  • Platform: Linux-6.0.12-x86_64-with-glibc2.10
  • Python version: 3.8.16
  • Huggingface_hub version: 0.12.0
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

broken:

  • transformers version: 4.27.0.dev0
  • Platform: Linux-6.0.12-x86_64-with-glibc2.10
  • Python version: 3.8.16
  • Huggingface_hub version: 0.12.0
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@gante @NielsRogge

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. Start with clean env setup via https://github.com/salesforce/LAVIS/blob/main/requirements.txt (transformers-4.26.1)
  2. Run python test_simple.py, model is correctly loaded and prints a caption
  3. pip install --upgrade git+https://github.com/huggingface/transformers (I wanted the new shiny blip2 conversion script so I can conver my finetuned model into HF format)
  4. Resolved https://github.com/huggingface/transformers to commit 8b3db33a763ccef828fca89bac7e6cbff314f131
  5. Run python test_simple.py
  6. RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 25 but got size 5 for tensor number 1 in the list.
import torch
from lavis.models import load_model_and_preprocess
import torch
from PIL import Image
import requests

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, _ = load_model_and_preprocess(name="blip2_opt", model_type="caption_coco_opt6.7b", is_eval=True, device=device)

url = "..."
raw_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
data = model.generate({"image": image})
print(data)

Expected behavior

Can use BLIP2 with latest HF

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

I have a PR here which aims to further verify equivalence: https://github.com/huggingface/transformers/pull/24854.

The conversion script can be found here and can be run as follows:

pip install -U git+https://github.com/nielsrogge/LAVIS.git@blip2_float32
git clone -b improve_blip2 git+https://github.com/nielsrogge/transformers.git
cd transformers
python src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py --model_name "blip2-flan-t5-xl"

The reason I forked LAVIS is to make sure I can compare both implementations using float32.

@gante thank you for debugging!

I can confirm that syncing before https://github.com/huggingface/transformers/pull/21405 (edc1e734bfc01109b8c66881d950ebbda032a6d2) works, I’ll open an issue on SF side to warn them about the breakage, unfortunately this brings me to the original issue of trying to use convert_blip_2_original_to_pytorch.py, perhaps you can help me figure out how the BLIP2 models were converted? (I understand, this is irrelevant to most users but only a few brave souls who are finetuning BLIP2 via LAVIS but want to then load it in HF.)

I’ve tried both pip install git+https://github.com/nielsrogge/LAVIS.git@fix_lavis (mentioned in the script) and lavis from HEAD, but I am getting this trace

$ python ./convert_blip_2_original_to_pytorch.py
Loading original model...
Position interpolate from 16x16 to 26x26
tokenizer facebook/opt-6.7b
Loading checkpoint shards: Done!
Traceback (most recent call last):
  File "./convert_blip_2_original_to_pytorch.py", line 304, in <module>
    convert_blip2_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
  File "/.../envs/lavis/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "./convert_blip_2_original_to_pytorch.py", line 216, in convert_blip2_checkpoint
    original_logits = original_logits.logits
AttributeError: 'dict' object has no attribute 'logits' // indeed, this is a dictionary containing only 'loss'

what combination of versions of transformers and lavis was used during conversion?

After some digging, we can see that the exception is raised as follows:

│ /home/joao/hf/lib/python3.10/site-packages/lavis/models/blip2_models/modeling_opt.py:703 in      │
│ forward                                                                                          │
│                                                                                                  │
│    700 │   │   │   inputs_embeds = self.embed_tokens(input_ids)                                  │
│    701 │   │                                                                                     │
│    702 │   │   if query_embeds is not None:                                                      │
│ ❱  703 │   │   │   inputs_embeds = torch.cat([query_embeds, inputs_embeds], dim=1)               │
│    704 │   │   │   input_shape = inputs_embeds.size()[:-1]                                       │
│    705 │   │                                                                                     │
│    706 │   │   # embed positions                                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 25 but got size 5 for tensor number 1 in the list.

From the full stack trace, we can conclude that the error arises from an issue in lavis, and not in transformers 😃 Actually, the root cause for this issue is something that we have addressed on this PRlavis has a different implementation, where they have a modified OPT model to handle the image embeddings, where we decided to update .generate() to handle soft-prompting.

@AstraliteHeart This means you have two options:

  1. Update your code to rely on transformers, as opposed to lavis. See here for examples.
  2. Open an issue in lavis, so they can help you with this issue 😃

Thank you for confirming @AstraliteHeart 🤗 I will dig deeper and let you know what I find!

Hey @AstraliteHeart 👋 This issue seems to be a duplicate of https://github.com/huggingface/transformers/issues/21599, which is fixed.

Can I ask you to try to run your script using transformers main branch, i.e. after installing with pip install --upgrade git+https://github.com/huggingface/transformers.git?