LLaVA: Error while tuning LLaVA-Lighting

When did you clone our code?

I cloned the code base after 5/1/23

Describe the issue

Issue:

Command:

#!/bin/bash

WEIGHT_VERSION=1


# Visual instruction tuning (1 hour)
srun -p rdbp1_a100_80g -n1 -N 1 --gres=gpu:1 \
torchrun --nnodes=1 --nproc_per_node=1 --master_port=25001 \
    llava/train/train_mem.py \
    --model_name_or_path /mnt/lustre/share_data/zhangzhao2/VG/ckpt/llava/llava_v1/7B \
    --version $WEIGHT_VERSION \
    --data_path /mnt/lustre/share_data/zhangzhao2/VG/instruction_data/LLaVA-Instruct-150K/llava_instruct_80k.json \
    --image_folder /mnt/lustre/share_data/dongzhiwei1/coco2014/train2014 \
    --vision_tower /mnt/lustre/share_data/zhangzhao2/VG/ckpt/openai/clip-vit-large-patch14 \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir ./checkpoints \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none

Log:

Loading checkpoint shards: 100%|          | 2/2 [01:31<00:00, 45.96s/it]
WARNING:root:Loading data...
WARNING:root:Formatting inputs...Skip in lazy mode
/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
  warnings.warn(
  0%|          | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/mnt/cache/zhangzhao2/codes/LLaVA/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/mnt/cache/zhangzhao2/codes/LLaVA/llava/train/train.py", line 569, in train
    trainer.train()
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/trainer.py", line 1927, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/trainer.py", line 2699, in training_step
    loss = self.compute_loss(model, inputs)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/trainer.py", line 2731, in compute_loss
    outputs = model(**inputs)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
    output = self._fsdp_wrapped_module(*args, **kwargs)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/cache/zhangzhao2/codes/LLaVA/llava/model/llava.py", line 218, in forward
    outputs = self.model(
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/cache/zhangzhao2/codes/LLaVA/llava/model/llava.py", line 126, in forward
    image_forward_outs = vision_tower(images, output_hidden_states=True)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 934, in forward
    return self.vision_model(
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 859, in forward
    hidden_states = self.embeddings(pixel_values)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 195, in forward
    patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation

About this issue

Original URL
State: closed
Created a year ago
Comments: 17 (9 by maintainers)

Most upvoted comments

@zzhanghub Great to hear that it works out! And thank you for your kind words. Btw, please still keep transformers version to the one that we include in README, as other versions may potentially lead to strange issues 😃

haotian-liu on May 7, 2023