ColossalAI: Can not train llama-7b due to OOM on 40GA100

GPU 40GA1008

I want to train the 7B model of Llama on 40GA100, but it prompts that there is not enough GPU memory. The training command is:

torchrun --standalone --nproc_per_node=4 examples/train_sft.py --pretrain "**********7B/llama-7b" --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path output/Coati-7B --dataset *********/data/merged_file.json --batch_size 1 --accimulation_steps 4 --lr 2e-5 --max_epochs 1 --lora_rank 4 image

40G should be enough for Llama’s 7B model. When I control the max_datasets_size like this --max_datasets_size 4096 The training process will be done. But the usage of GPU memory is different at the beginning stage and near the end.

Beginning Stage image Ending Stage image

Another question, I found that will confirm the W&B choice multi times。 Can this process be simplified?

image

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 48 (13 by maintainers)

Most upvoted comments

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

23/04/17 https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#faq we have provide a a low resources example. Thanks.

Add WANDB_MODE=disabled before torchrun

Hello, in the process of training Bloom-7B’s model, I found that OOM appeared after training for a period of time. May I ask why? @binmakeswell @easonfzw @ver217 1 2

This is my training script: 0

@binmakeswell Is the strategy of ‘colossalai_zero2_cpu’ a complete solution for low-resource machines? Will there be other solutions

For reference, under the same configuration, colossalai_zero2 takes 5.11s/it in the first few steps (then OOM on 32G V100), while colossalai_zero2_cpu takes 13.08s/it.

Heal our children!

Hello, please check my question, thank you @easonfzw

I just verified that it will still be OOM in the case of zero2_cuda pure fp16

For LLaMA-7B, model data will take (2+14/4) * 7 = 38.5 GB memory for each GPU if using 4 GPUs

Thanks! If I understand correctly, “2” is the parameter memory consumption per billion parameter in fp16, “14” is the optimizer and gradient memory consumption per billion parameter? “14” is much larger than what I thought.

Hello, please check my question, thank you @easonfzw

I just verified that it will still be OOM in the case of zero2_cuda pure fp16

For LLaMA-7B, model data will take (2+14/4) * 7 = 38.5 GB memory for each GPU if using 4 GPUs

@Fazziekey @FrankLeeeee Same OOM issue. The same is A100 40GB, 1 gpu running llama7B model, batch=1, max_seq_len=512, colossalai_zero2 placement_policy=‘cuda’, use torch.cuda.memory_allocated() to analyze memory usage, in SFTTrainer self.optimizer = strategy.setup_optimizer (optim, self.model) After running this step, 38590.52MB of cuda memory has already been occupied, and the remaining cuda memory is obviously not enough to run data. Is this normal? In addition, after using the colossalai_gemini strategy, the cuda memory will be exploded directly at the step of self.optimizer = strategy.setup_optimizer(optim, self.model). This feels very strange. Can you give me a solution?

Train script:

CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 train_sft.py \ –pretrain xxx/llama-7b --model ‘llama’ --strategy colossalai_zero2 --log_interval 10 --save_path exp/Coati-7B --dataset data/instinwild_en.json --batch_size 1 \ –accimulation_steps 8 --lr 2e-5 --max_datasets_size 512 \ –max_epochs 1 \

It creates FP32 master weights after initializing trainer. However, FP32 master weights will be sharded if using multiple GPUs.

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn’t works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~

For larger task like this, you can contact our commercial or me directly, which may help you solve this problem more faster

8*V100 32G got the same OOM

same OOM question

Set placement_policy=‘cpu’ can alleviate this question.