ColossalAI: [BUG]: Incompatible between colossalai_zero2 and LoRA tuning

πŸ› Describe the bug

When i run this script:

torchrun --standalone --nproc_per_node=1 train_sft.py \
    --pretrain "/root/zl/download/pretrained/llama_7b_hf/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --log_interval 10 \
    --save_path trained_models/Coati-7B \
    --dataset /root/zl/code/InstructionWild/data/instinwild_en.json \
    --batch_size 4 \
    --accimulation_steps 1 \
    --lr 2e-5 \
    --max_epochs 3 \
    --lora_rank 8 \
    --max_datasets_size 2

I got an error during backward:

Traceback (most recent call last):
  File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 184, in <module>
    train(args)
  File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 155, in train
    trainer.fit(logger=logger, log_interval=args.log_interval)
  File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/sft.py", line 110, in fit
    self.strategy.optimizer_step(self.optimizer)
  File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/strategies/colossalai.py", line 154, in optimizer_step
    optimizer.step()
  File "/opt/conda/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/colossalai/zero/sharded_optim/low_level_optim.py", line 467, in step
    assert param_shape == flat_fp32_avg_grads.shape, \
AssertionError: fp32 param and grad have different shape torch.Size([20277248]) vs torch.Size([288768])

However, when i change the --strategy into ddp, it trains normally.
So is there a bug in implementing colossalai_zero2 , or it is incompatible with LoRA?

Environment

I use the docker image hpcaitech/colossalai:0.2.7

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (1 by maintainers)

Most upvoted comments

Additionally, when i use ddp and LoRA, there is also a bug when saving the checkpoint. Script:

torchrun --standalone --nproc_per_node=1 train_sft.py \
    --pretrain "/root/zl/download/pretrained/llama_7b_hf/" \
    --model 'llama' \
    --strategy ddp \
    --log_interval 10 \
    --save_path trained_models/Coati-7B \
    --dataset /root/zl/code/InstructionWild/data/instinwild_en.json \
    --batch_size 4 \
    --accimulation_steps 1 \
    --lr 2e-5 \
    --max_epochs 3 \
    --lora_rank 8 \
    --max_datasets_size 10

Feedback:

Traceback (most recent call last):
  File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 184, in <module>
    train(args)
  File "/data2/zl/code/ColossalAI/applications/Chat/train_sft.py", line 158, in train
    trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)
  File "/data2/zl/code/ColossalAI/applications/Chat/coati/trainer/sft.py", line 158, in save_model
    self.strategy.save_model(model=self.model, path=path, only_rank0=only_rank0, tokenizer=tokenizer)
TypeError: save_model() got an unexpected keyword argument 'tokenizer'

I guess you can solve this one with this #3357 (comment)

according to the traceback message , i handled it simply :

ColossalAI/applications/Chat/train_sft.py", line 158 if args.strategy==β€œddp”: trainer.save_model(path=args.save_path, only_rank0=True) else: trainer.save_model(path=args.save_path, only_rank0=True, tokenizer=tokenizer)

when i use dpp, the loss is always nan

AssertionError: fp32 param and grad have different shape I have solved this error. I use GLM-10B to train reward model. The outputs of β€˜mems’ is used as last_hidden_states. But the β€˜mems’ is processed by detach which means it is removed in the computation graph. And the gradients can not penetrate to the model. Therefore, you should verify your last_hidden_states and ensure its presence in the computational graph.

I am having the same issue here