DeepSpeed: [BUG] zero2 and zero3 has different behavior using the same hyperparameter to train a large model
Describe the bug Zero3 and zero2 grad_norm are totally not in the same scale, which eventually leads to the crash of zero2’s loss.
Expected behavior Zero2 should have the same smoothing loss as that of zero3.
ds_report output
Please run `ds_report` to give us details about your setup.
[2023-09-10 20:53:54,125] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.9/dist-packages/torch']
torch version .................... 2.1.0.dev20230424+cu117
deepspeed install path ........... ['/usr/local/lib/python3.9/dist-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.7
Additional context The training will crash after one day or two. Hard to reproduce.
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 23 (12 by maintainers)
Hi, teams, any update?
Update: We have investigate further. The norm difference is caused by the gradient difference. Our current solution is to turn the overlap_comm to False when using Zero2. This fixes the issue, however, makes the training slower. @tjruwase hope this may help.
Has there been any recent progress? 🤔️
Here I mean the prefetch of next layer of the model to do all_gather, you can refer to stage3_prefetch_bucket_size for reference. The prefetch of dataloader also overlap the process time of data, but the main communication time costs lay in the all_gather and reduce ops in llm with zero3.
double-check that there has a pretty serious question in stage-2.
stage-1 & stage-3 look good and almost equal in loss & grad_norm. cc @tjruwase @loadams
As I’m using Ethernet between nodes, ZeRO3 is not a choice… When the training is done with Megatron, I may try it.
Hello, recently I’m using ZeRO2 to train a llama2 13B model on 64 gpus. I also find similar issue. The loss seems going down in the first dozens of steps, then loss divergence happens. So is this a new bug imported in the recent version? Cause this is a pretty serious question and should have been noticed long ago.