DeepSpeed: [BUG] try to finetune a llama 33b on 8*A100 40G, 600G RAM. But always OOM on RAM.
I am fine-tuning the llama 33B Llama model on a server with 8*A100 40G GPUs and 600GB RAM, but I keep running into OOM on RAM. I am mainly using the default zero3.config template.
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
I’ve tried modifying this config by not offloading parameters and only offloading the optimizer to the CPU, or not offloading parameters and only offloading the optimizer to the NVMe. However, none of these attempts have been successful, as they all result in OOM RAM. Do you have any suggestions for my situation?
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 6
- Comments: 38 (4 by maintainers)
Hi @s1ghhh and @memray, you can check my general scripts here: https://github.com/LuJunru/LLM_SFT/tree/main if you still need.
@memray
Hi Rui,
You can in reference to: https://github.com/tatsu-lab/stanford_alpaca. I used trainer class from HF to load models, and just use --deepspeed to add DP plugin. Hope this can help you!
Junru
@memray
Full-model tuning
@memray I used to meet similar issues. In my situation, it was caused by environmental variable: CUDA_LAUNCH_BLOCKING=1, not sure about your case. I fine-tuned on Vicuna 33B.
@memray Exactly. I used deepspeed zero3 offloads + flash attention.
@Dominic789654 you may try my latest PR https://github.com/microsoft/DeepSpeed/pull/3629 This patch would allow loading checkpoint in serial way, so that it would not lead to memory peak for resume from the checkpoint training.