DeepSpeed: [BUG] Loss scale already at minimum - Training LlaMA2 7B via HF+deepspeed consistently fails
Describe the bug When training the LLaMA2 7B HF Model with deepspeed on a single-node multi-gpu setup, the loss_scale gets decreased consistently to 1 (minimum) and exits with error.
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
This appears both with ZeRO Stage2 + CPU Offload and ZeRO Stage3 + CPU Offload.
To Reproduce
- DeepSpeed with ZeRO Stage 2 + CPU offload.
- HF Trainer (v4.32.0.dev0)
Expected behavior Training Completes
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/software/all/staging/PyTorch/1.12.1-foss-2022a-CUDA-11.7.0/lib/python3.10/site-packages/torch']
torch version .................... 1.12.1
deepspeed install path ........... ['/home/<BLANKED>/LLaMA_Training/.env/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.7
System info (please complete the following information):
- OS: Rocky Linux 8
- GPU count and types: 1 Machine, 4x Nvidia Tesla V100
- Python version: 3.10.8
- HuggingFace Transformers: 4.32.0.dev0
Launcher context Launching with Deepspeed as follows:
srun --jobid 3530272 bash -c "NCCL_DEBUG=INFO deepspeed
--num_gpus=4
03_train_llama2.py
--model_name meta-llama/Llama-2-7b-hf
--cache_dir ./cache
--use_fast_tokenizer false
--model_revision main
--use_auth_token true
--hugging_token <BLANKED>
--torch_dtype auto
--low_cpu_mem_usage false
--train_file ./input/health_information_systems_epub.md
--max_train_samples 1000
--overwrite_cache false
--validation_split_percentage 5
--preprocessing_num_workers 1
--keep_linebreaks true
--output_dir ./trained/7B
--overwrite_output_dir false
--do_train true
--do_eval false
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--evaluation_strategy steps
--eval_steps 100
--learning_rate 3e-4
--weight_decay 0.1
--adam_beta1 0.9
--adam_beta2 0.95
--adam_epsilon 1e-8
--max_grad_norm 1.0
--num_train_epochs 3
--lr_scheduler_type cosine
--warmup_steps 0
--log_level passive
--save_strategy steps
--save_steps 500
--save_total_limit 1
--no_cuda false
--seed 42
--fp16 true
--bf16 false
--half_precision_backend auto
--local_rank 0
--ddp_backend nccl
--deepspeed ./ds_configs/stage2_offload.json
--optim adamw_torch"
Docker context No Docker
Additional context DS Config:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"allgather_bucket_size": 2e8,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"gradient_clipping": 1.0,
"steps_per_print": 500,
"wall_clock_breakdown": false,
"train_micro_batch_size_per_gpu": 1
}
Slurm Setup
#SBATCH --job-name=deepspeed-llama2-7b-hf # name
#SBATCH --nodes=1 # nodes
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=4
#SBATCH --partition=clara
#SBATCH --mem=256G # 128G was not enough
#SBATCH --gres=gpu:v100:4 # number of gpus
#SBATCH --output=logs/%x-%j.out # output file name
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 27 (1 by maintainers)
Commits related to this issue
- Fixes for training models with bf16 + freshly initialized optimizer via `load_module_only` (#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoi... — committed to microsoft/DeepSpeed by haileyschoelkopf 5 months ago
- Fixes for training models with bf16 + freshly initialized optimizer via `load_module_only` (#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoi... — committed to mauryaavinash95/DeepSpeed by haileyschoelkopf 5 months ago
Try to use
bf16
as LLaMA-2 was pretrained usingbf16
. Continuing the training withfp16
will be problematic.None so far, v100 lead to overflow and huge loss in performance as by per latest evaluation. For now I think there are three options:
The PullRequest ist also open and had no recent changes. @YuFan-Microsoft
@scorixear
Other options could be:
Both work according to my friendsā practice.
I encountered same problems when training VIT,I set scale_window to a relative small value(eg,100),let the loss scale have opportunity to raise when decrease at some batchs.That sloved problem,and you may need to choose an appropriate scale_window.And i am,using V100
You may be interested in this HF doc: https://huggingface.co/docs/transformers/v4.15.0/performance