ColossalAI: [BUG]: Incompatible between colossalai_zero2 and LoRA tuning

🐛 Describe the bug

When i run this script:

torchrun --standalone --nproc_per_node=1 train_sft.py \
    --pretrain "/root/zl/download/pretrained/llama_7b_hf/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --log_interval 10 \
    --save_path trained_models/Coati-7B \
    --dataset /root/zl/code/InstructionWild/data/instinwild_en.json \
    --batch_size 4 \
    --accimulation_steps 1 \
    --lr 2e-5 \
    --max_epochs 3 \
    --lora_rank 8 \
    --max_datasets_size 2

I got an error during backward:

Traceback (most recent call last):
  File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 184, in <module>
    train(args)
  File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 155, in train
    trainer.fit(logger=logger, log_interval=args.log_interval)
  File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/sft.py", line 110, in fit
    self.strategy.optimizer_step(self.optimizer)
  File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/strategies/colossalai.py", line 154, in optimizer_step
    optimizer.step()
  File "/opt/conda/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 467, in step
    assert param_shape == flat_fp32_avg_grads.shape, \
AssertionError: fp32 param and grad have different shape torch.Size([20277248]) vs torch.Size([288768])

However, when i change the --strategy into ddp, it trains normally.
So is there a bug in implementing colossalai_zero2 , or it is incompatible with LoRA?

Environment

I use the docker image hpcaitech/colossalai:0.2.7

About this issue

Original URL
State: closed
Created a year ago
Comments: 16 (1 by maintainers)

Most upvoted comments

Additionally, when i use ddp and LoRA, there is also a bug when saving the checkpoint. Script:
torchrun --standalone --nproc_per_node=1 train_sft.py \
    --pretrain "/root/zl/download/pretrained/llama_7b_hf/" \
    --model 'llama' \
    --strategy ddp \
    --log_interval 10 \
    --save_path trained_models/Coati-7B \
    --dataset /root/zl/code/InstructionWild/data/instinwild_en.json \
    --batch_size 4 \
    --accimulation_steps 1 \
    --lr 2e-5 \
    --max_epochs 3 \
    --lora_rank 8 \
    --max_datasets_size 10
Feedback:
Traceback (most recent call last):
  File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 184, in <module>
    train(args)
  File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 158, in train
    trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)
  File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/sft.py", line 158, in save_model
    self.strategy.save_model(model=self.model, path=path, only_rank0=only_rank0, tokenizer=tokenizer)
TypeError: save_model() got an unexpected keyword argument 'tokenizer'
I guess you can solve this one with this #3357 (comment)
according to the traceback message , i handled it simply :

ColossalAI/applications/Chat/train_sft.py", line 158 if args.strategy==“ddp”: trainer.save_model(path=args.save_path, only_rank0=True) else: trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)

when i use dpp, the loss is always nan

qinqinqaq on Apr 4, 2023

AssertionError: fp32 param and grad have different shape I have solved this error. I use GLM-10B to train reward model. The outputs of ‘mems’ is used as last_hidden_states. But the ‘mems’ is processed by detach which means it is removed in the computation graph. And the gradients can not penetrate to the model. Therefore, you should verify your last_hidden_states and ensure its presence in the computational graph.

pvop on Jul 25, 2023

I am having the same issue here

qinqinqaq on Apr 4, 2023