ColossalAI: [BUG]: Incompatible between colossalai_zero2 and LoRA tuning
π Describe the bug
When i run this script:
torchrun --standalone --nproc_per_node=1 train_sft.py \
--pretrain "/root/zl/download/pretrained/llama_7b_hf/" \
--model 'llama' \
--strategy colossalai_zero2 \
--log_interval 10 \
--save_path trained_models/Coati-7B \
--dataset /root/zl/code/InstructionWild/data/instinwild_en.json \
--batch_size 4 \
--accimulation_steps 1 \
--lr 2e-5 \
--max_epochs 3 \
--lora_rank 8 \
--max_datasets_size 2
I got an error during backward:
Traceback (most recent call last):
File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 184, in <module>
train(args)
File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 155, in train
trainer.fit(logger=logger, log_interval=args.log_interval)
File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/sft.py", line 110, in fit
self.strategy.optimizer_step(self.optimizer)
File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/strategies/colossalai.py", line 154, in optimizer_step
optimizer.step()
File "/opt/conda/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 467, in step
assert param_shape == flat_fp32_avg_grads.shape, \
AssertionError: fp32 param and grad have different shape torch.Size([20277248]) vs torch.Size([288768])
However, when i change the --strategy into ddp, it trains normally.
So is there a bug in implementing colossalai_zero2 , or it is incompatible with LoRA?
Environment
I use the docker image hpcaitech/colossalai:0.2.7
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (1 by maintainers)
when i use dpp, the loss is always nan
AssertionError: fp32 param and grad have different shape I have solved this error. I use GLM-10B to train reward model. The outputs of βmemsβ is used as last_hidden_states. But the βmemsβ is processed by detach which means it is removed in the computation graph. And the gradients can not penetrate to the model. Therefore, you should verify your last_hidden_states and ensure its presence in the computational graph.
I am having the same issue here