LLaVA: Error while tuning LLaVA-Lighting
When did you clone our code?
I cloned the code base after 5/1/23
Describe the issue
Issue:
Command:
#!/bin/bash
WEIGHT_VERSION=1
# Visual instruction tuning (1 hour)
srun -p rdbp1_a100_80g -n1 -N 1 --gres=gpu:1 \
torchrun --nnodes=1 --nproc_per_node=1 --master_port=25001 \
llava/train/train_mem.py \
--model_name_or_path /mnt/lustre/share_data/zhangzhao2/VG/ckpt/llava/llava_v1/7B \
--version $WEIGHT_VERSION \
--data_path /mnt/lustre/share_data/zhangzhao2/VG/instruction_data/LLaVA-Instruct-150K/llava_instruct_80k.json \
--image_folder /mnt/lustre/share_data/dongzhiwei1/coco2014/train2014 \
--vision_tower /mnt/lustre/share_data/zhangzhao2/VG/ckpt/openai/clip-vit-large-patch14 \
--mm_vision_select_layer -2 \
--mm_use_im_start_end True \
--bf16 True \
--output_dir ./checkpoints \
--num_train_epochs 1 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 5000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to none
Log:
Loading checkpoint shards: 100%| | 2/2 [01:31<00:00, 45.96s/it]
WARNING:root:Loading data...
WARNING:root:Formatting inputs...Skip in lazy mode
/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
warnings.warn(
0%| | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/mnt/cache/zhangzhao2/codes/LLaVA/llava/train/train_mem.py", line 13, in <module>
train()
File "/mnt/cache/zhangzhao2/codes/LLaVA/llava/train/train.py", line 569, in train
trainer.train()
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/trainer.py", line 1927, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/trainer.py", line 2699, in training_step
loss = self.compute_loss(model, inputs)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/trainer.py", line 2731, in compute_loss
outputs = model(**inputs)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/cache/zhangzhao2/codes/LLaVA/llava/model/llava.py", line 218, in forward
outputs = self.model(
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/cache/zhangzhao2/codes/LLaVA/llava/model/llava.py", line 126, in forward
image_forward_outs = vision_tower(images, output_hidden_states=True)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 934, in forward
return self.vision_model(
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 859, in forward
hidden_states = self.embeddings(pixel_values)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 195, in forward
patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid]
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/mnt/cache/zhangzhao2/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (9 by maintainers)
@zzhanghub Great to hear that it works out! And thank you for your kind words. Btw, please still keep
transformersversion to the one that we include in README, as other versions may potentially lead to strange issues 😃