DeepSpeed: [BUG] zero2 and zero3 has different behavior using the same hyperparameter to train a large model

Describe the bug Zero3 and zero2 grad_norm are totally not in the same scale, which eventually leads to the crash of zero2’s loss.

Expected behavior Zero2 should have the same smoothing loss as that of zero3.

ds_report output


Please run `ds_report` to give us details about your setup.
[2023-09-10 20:53:54,125] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.9/dist-packages/torch']
torch version .................... 2.1.0.dev20230424+cu117
deepspeed install path ........... ['/usr/local/lib/python3.9/dist-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.7

Additional context The training will crash after one day or two. Hard to reproduce.

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Comments: 23 (12 by maintainers)

Most upvoted comments

Hi, teams, any update?

Update: We have investigate further. The norm difference is caused by the gradient difference. Our current solution is to turn the overlap_comm to False when using Zero2. This fixes the issue, however, makes the training slower. @tjruwase hope this may help.

Has there been any recent progress? 🤔️

@kisseternity could u try stage-3 to double-check this issue?

I’ve tried ZeRO3 and it’s fine. Besides the training speed is fine compared to ZeRO2 with communication overlap, better than expected. It seems that the prefetch in ZeRO3 can overlap some 1.5x communication costs, making the speed pretty fast even between Ethernet connected nodes.

Does ‘prefetch’ here refer to a parameter of the torch dataloader or a parameter of deepspeed?

Here I mean the prefetch of next layer of the model to do all_gather, you can refer to stage3_prefetch_bucket_size for reference. The prefetch of dataloader also overlap the process time of data, but the main communication time costs lay in the all_gather and reduce ops in llm with zero3.

double-check that there has a pretty serious question in stage-2.

stage-1 & stage-3 look good and almost equal in loss & grad_norm. cc @tjruwase @loadams

image

@kisseternity could u try stage-3 to double-check this issue?

As I’m using Ethernet between nodes, ZeRO3 is not a choice… When the training is done with Megatron, I may try it.

By comparing deepspeed stage-2 with native torch DDP (both fp32 training), I also encountered a similar problem, I think there must be some gap in the smoothing strategy between DS stage-2 and DDP thus leading to different scale of loss (or grad_norm)

you can see that within the same training config, the grad_norm of deepspeed (stage-2) is about 5 times higher than DDP, and this will eventually cause a diverge issue.

image

Also, if we scale up to more gpus (1node 8gpu -> 2node 16 gpu), we can immediately see a huge mismatch and loss crash:

image

Hello, recently I’m using ZeRO2 to train a llama2 13B model on 64 gpus. I also find similar issue. The loss seems going down in the first dozens of steps, then loss divergence happens. So is this a new bug imported in the recent version? Cause this is a pretty serious question and should have been noticed long ago.