DeepSpeed: [BUG] Loss scale already at minimum - Training LlaMA2 7B via HF+deepspeed consistently fails

Describe the bug When training the LLaMA2 7B HF Model with deepspeed on a single-node multi-gpu setup, the loss_scale gets decreased consistently to 1 (minimum) and exits with error.

Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

This appears both with ZeRO Stage2 + CPU Offload and ZeRO Stage3 + CPU Offload.

To Reproduce

  • DeepSpeed with ZeRO Stage 2 + CPU offload.
  • HF Trainer (v4.32.0.dev0)

Expected behavior Training Completes

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch']
torch version .................... 1.12.1
deepspeed install path ........... ['/home/<BLANKED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.7

System info (please complete the following information):

  • OS: Rocky Linux 8
  • GPU count and types: 1 Machine, 4x Nvidia Tesla V100
  • Python version: 3.10.8
  • HuggingFace Transformers: 4.32.0.dev0

Launcher context Launching with Deepspeed as follows:

srun --jobid 3530272 bash -c "NCCL_DEBUG=INFO deepspeed 
--num_gpus=4 
03_train_llama2.py 
--model_name meta-llama/Llama-2-7b-hf 
--cache_dir ./cache 
--use_fast_tokenizer false 
--model_revision main 
--use_auth_token true 
--hugging_token <BLANKED>
--torch_dtype auto 
--low_cpu_mem_usage false 
--train_file ./input/health_information_systems_epub.md 
--max_train_samples 1000 
--overwrite_cache false 
--validation_split_percentage 5 
--preprocessing_num_workers 1 
--keep_linebreaks true 
--output_dir ./trained/7B 
--overwrite_output_dir false 
--do_train true 
--do_eval false 
--per_device_train_batch_size 1 
--per_device_eval_batch_size 1 
--evaluation_strategy steps 
--eval_steps 100 
--learning_rate 3e-4 
--weight_decay 0.1 
--adam_beta1 0.9 
--adam_beta2 0.95 
--adam_epsilon 1e-8 
--max_grad_norm 1.0 
--num_train_epochs 3 
--lr_scheduler_type cosine 
--warmup_steps 0 
--log_level passive 
--save_strategy steps 
--save_steps 500 
--save_total_limit 1 
--no_cuda false 
--seed 42 
--fp16 true 
--bf16 false 
--half_precision_backend auto 
--local_rank 0 
--ddp_backend nccl 
--deepspeed ./ds_configs/stage2_offload.json 
--optim adamw_torch"

Docker context No Docker

Additional context DS Config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": { 
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "allgather_bucket_size": 2e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        }
    },
    "gradient_clipping": 1.0,
    "steps_per_print": 500,
    "wall_clock_breakdown": false,
    "train_micro_batch_size_per_gpu": 1
}

Slurm Setup

#SBATCH --job-name=deepspeed-llama2-7b-hf        # name
#SBATCH --nodes=1                                                 # nodes
#SBATCH --ntasks-per-node=1                                 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=4
#SBATCH --partition=clara
#SBATCH --mem=256G                                            # 128G was not enough
#SBATCH --gres=gpu:v100:4                                    # number of gpus
#SBATCH --output=logs/%x-%j.out                           # output file name

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 27 (1 by maintainers)

Commits related to this issue

Most upvoted comments

Try to use bf16 as LLaMA-2 was pretrained using bf16. Continuing the training with fp16 will be problematic.

Any updates here? I also face this problem with my 8*V100 machines.

None so far, v100 lead to overflow and huge loss in performance as by per latest evaluation. For now I think there are three options:

  • not use V100 if Model with BF16 and DeepSpeed required
  • not use DeepSpeed if Model with BF16 and V100 required
  • not use Model with BF16 if V100 and DeepSpeed required

The PullRequest ist also open and had no recent changes. @YuFan-Microsoft

@scorixear

Other options could be:

Both work according to my friendsā€˜ practice.

I encountered same problems when training VIT,I set scale_window to a relative small value(eg,100),let the loss scale have opportunity to raise when decrease at some batchs.That sloved problem,and you may need to choose an appropriate scale_window.And i am,using V100