DeepSpeed: [BUG] try to finetune a llama 33b on 8*A100 40G, 600G RAM. But always OOM on RAM.

I am fine-tuning the llama 33B Llama model on a server with 8*A100 40G GPUs and 600GB RAM, but I keep running into OOM on RAM. I am mainly using the default zero3.config template.

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

I’ve tried modifying this config by not offloading parameters and only offloading the optimizer to the CPU, or not offloading parameters and only offloading the optimizer to the NVMe. However, none of these attempts have been successful, as they all result in OOM RAM. Do you have any suggestions for my situation?

About this issue

Original URL
State: closed
Created a year ago
Reactions: 6
Comments: 38 (4 by maintainers)

Most upvoted comments

Hi @s1ghhh and @memray, you can check my general scripts here: https://github.com/LuJunru/LLM_SFT/tree/main if you still need.

LuJunru on Aug 21, 2023

@memray

Hi Rui,

You can in reference to: https://github.com/tatsu-lab/stanford_alpaca. I used trainer class from HF to load models, and just use --deepspeed to add DP plugin. Hope this can help you!

Junru

LuJunru on Jul 26, 2023

@memray

Full-model tuning

LuJunru on Jul 25, 2023

@memray I used to meet similar issues. In my situation, it was caused by environmental variable: CUDA_LAUNCH_BLOCKING=1, not sure about your case. I fine-tuned on Vicuna 33B.

LuJunru on Jul 25, 2023

@memray Exactly. I used deepspeed zero3 offloads + flash attention.

LuJunru on Jul 25, 2023

@Dominic789654 you may try my latest PR https://github.com/microsoft/DeepSpeed/pull/3629 This patch would allow loading checkpoint in serial way, so that it would not lead to memory peak for resume from the checkpoint training.

leiwen83 on May 30, 2023