ColossalAI: Can not train llama-7b due to OOM on 40GA100

GPU 40GA1008

I want to train the 7B model of Llama on 40GA100, but it prompts that there is not enough GPU memory. The training command is:

torchrun --standalone --nproc_per_node=4 examples/train_sft.py --pretrain "**********7B/llama-7b" --model 'llama' --strategy colossalai_zero2 --log_interval 10 --save_path output/Coati-7B --dataset *********/data/merged_file.json --batch_size 1 --accimulation_steps 4 --lr 2e-5 --max_epochs 1 --lora_rank 4

40G should be enough for Llama’s 7B model. When I control the max_datasets_size like this --max_datasets_size 4096 The training process will be done. But the usage of GPU memory is different at the beginning stage and near the end.

Beginning Stage Ending Stage

Another question, I found that will confirm the W&B choice multi times。 Can this process be simplified?

About this issue

Original URL
State: open
Created a year ago
Comments: 48 (13 by maintainers)

Most upvoted comments

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

23/04/17 https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#faq we have provide a a low resources example. Thanks.

+12

FrankLeeeee on Apr 17, 2023

Add WANDB_MODE=disabled before torchrun

qijiaxing on Mar 30, 2023

Hello, in the process of training Bloom-7B’s model, I found that OOM appeared after training for a period of time. May I ask why? @binmakeswell @easonfzw @ver217

This is my training script:

tianbuwei on Apr 25, 2023

@binmakeswell Is the strategy of ‘colossalai_zero2_cpu’ a complete solution for low-resource machines? Will there be other solutions

For reference, under the same configuration, colossalai_zero2 takes 5.11s/it in the first few steps (then OOM on 32G V100), while colossalai_zero2_cpu takes 13.08s/it.

kinghuin on Apr 11, 2023

Heal our children!

MrRace on Apr 6, 2023

Hello, please check my question, thank you @easonfzw

I just verified that it will still be OOM in the case of zero2_cuda pure fp16

For LLaMA-7B, model data will take (2+14/4) * 7 = 38.5 GB memory for each GPU if using 4 GPUs

Thanks! If I understand correctly, “2” is the parameter memory consumption per billion parameter in fp16, “14” is the optimizer and gradient memory consumption per billion parameter? “14” is much larger than what I thought.

puyuanOT on Apr 14, 2023

Hello, please check my question, thank you @easonfzw

I just verified that it will still be OOM in the case of zero2_cuda pure fp16

For LLaMA-7B, model data will take (2+14/4) * 7 = 38.5 GB memory for each GPU if using 4 GPUs

ver217 on Apr 12, 2023

@Fazziekey @FrankLeeeee Same OOM issue. The same is A100 40GB, 1 gpu running llama7B model, batch=1, max_seq_len=512, colossalai_zero2 placement_policy=‘cuda’, use torch.cuda.memory_allocated() to analyze memory usage, in SFTTrainer self.optimizer = strategy.setup_optimizer (optim, self.model) After running this step, 38590.52MB of cuda memory has already been occupied, and the remaining cuda memory is obviously not enough to run data. Is this normal? In addition, after using the colossalai_gemini strategy, the cuda memory will be exploded directly at the step of self.optimizer = strategy.setup_optimizer(optim, self.model). This feels very strange. Can you give me a solution?

Train script：

CUDA_VISIBLE_DEVICES=2 torchrun --standalone --nproc_per_node=1 train_sft.py \ –pretrain xxx/llama-7b --model ‘llama’ --strategy colossalai_zero2 --log_interval 10 --save_path exp/Coati-7B --dataset data/instinwild_en.json --batch_size 1 \ –accimulation_steps 8 --lr 2e-5 --max_datasets_size 512 \ –max_epochs 1 \

It creates FP32 master weights after initializing trainer. However, FP32 master weights will be sharded if using multiple GPUs.

ver217 on Apr 10, 2023

This might be because that zero2 is not enough to save memory on 40 GB card, we used 80GB A100. To enable lower memory consumption, we should use colossalai_gemini strategy but it happens to have some bugs, we are fixing it and expect it to work next week. We will update the bug fix status here.

We are trying to train 30B llama model on 8x8 80G A100 card, and find colossalai_gemini doesn’t works. Looking forward to your fixing work as well as larger model training hyper-parameters (30B/65B), thanks~

For larger task like this, you can contact our commercial or me directly, which may help you solve this problem more faster

Fazziekey on Apr 6, 2023

8*V100 32G got the same OOM

balcklive on Apr 5, 2023

same OOM question

Set placement_policy=‘cpu’ can alleviate this question.

okzhili on Apr 3, 2023