DeepSpeed: Performance Degradation with ZERO Stage 3

Hi,

I am trying to benchmark a 10B parameter Huggingface RobertaForMaskedLM model with both ZERO Stage 2 and ZERO Stage 3 to compare the latency impact of parameter partitioning.

I am seeing much worse performance with Stage 3 than expected however, so want to check if something looks wrong.

|--------------------+-------+--------+-------+---------+--------|
| Description        | Model | # p3dn | batch | Samples | TFLOPS |
|                    | Size  |  hosts |  size | Per Sec |  / GPU |
|--------------------+-------+--------+-------+---------+--------|
| Baseline (stage2)  | 10B   |     16 |     8 |     178 |   44.7 |
| Stage3, no offload | 10B   |     16 |     8 |      77 |   19.5 |
| Stage3, no offload | 10B   |      8 |     8 |      41 |   21.9 |
| Stage3, no offload | 10B   |      4 |     8 |      23 |   23.5 |
| Stage3, no offload | 10B   |      2 |     8 |    11.6 |   23.5 |
| Stage3, no offload | 10B   |      1 |     8 |     OOM |    OOM |
|--------------------+-------+--------+-------+---------+--------|

The problem does not seem to be related to network bandwidth, because when I move to p4d machines, which have 4x the bandwidth of p3dn machines (400 Gbps vs 100 Gbps) I see similar degradation:

|--------------------+-------+--------+-------+---------+--------|
| Description        | Model | # p4dn | batch | Samples | TFLOPS |
|                    | Size  |  hosts |  size | Per Sec |  / GPU |
|--------------------+-------+--------+-------+---------+--------|
| Baseline (stage2)  | 10B   |     16 |     8 |     432 |    109 |
| Stage3, no offload | 10B   |      4 |     8 |      44 |   44.5 |
|--------------------+-------+--------+-------+---------+--------|

I tried increasing stage3_max_live_parameters from 1e9 → 2e9, and stage3_prefetch_bucket_size from 5e8 → 10e8, but neither change impacted performance.

In addition, I ended up adding some time.time() statements before + after: a. The blocking fetch() call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1520 b. The non-blocking pre-fetch() call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1525-L1528 c. The release() call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1540

And I noticed counter-intuitively that the majority of time was spent in what is supposed to be the non-blocking pre-fetch call:

Total fetch time = 599.5581150054932 ms;
Total pre-fetch time = 4473.618030548096 ms;
Total release time = 1130.7482719421387 ms

Total time = 6203.9244174957275 ms

In fact after a bit of digging and some additional timing statements added to code, I isolated the place that is causing pre-fetch to take so long to this line: https://github.com/microsoft/DeepSpeed/blob/18a26e8604c4cb8562ed8d57241ca64dbeb4318a/deepspeed/runtime/zero/partition_parameters.py#L798

Any ideas why I am seeing a 2x or bigger drop in performance when moving to stage 3 (compared to stage 2)? And why pre-fetching seems to be taking so much time when it is supposed to be an asynchronous background operation?

Thanks, Stephen

P.S. Details are here: Model config:

RobertaForMaskedLM:
    max_position_embeddings: 512
    type_vocab_size: 1
    num_attention_heads: 40
    num_hidden_layers: 30
    hidden_size: 5120
    intermediate_size: 20480
    gradient_checkpointing: true

Zero stage 2 config:

  zero_optimization:
    stage: 2
    overlap_comm: true

Zero stage 3 config:

  zero_optimization:
    stage: 3
    overlap_comm: true

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 78 (70 by maintainers)

Most upvoted comments

hey @tjruwase we have published a PR with optimizations and some other improvements here https://github.com/microsoft/DeepSpeed/pull/1453

quick note: c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allgather_coalesced is stubbed out in PyTorch - @zarzen and I both have DeepSpeed/Python implementations of this but could benefit everyone to move the implementation to Pytorch/C++ at some point

https://github.com/pytorch/pytorch/blob/44922f26f5b1d7b6ee8d217a9d32bc0e40ec47a6/torch/lib/c10d/ProcessGroupNCCL.cpp#L1312-L1318

thanks @zarzen for the insight and the idea to batch the all gather calls, its really promising. I think it would still be good to have an async batched all gather. I’m working on some ideas for optimizing prefetching, still working on testing/analysis and fixing some issues but so far performance seems pretty good (~37 TFLOPS/GPU compared to the ~16TFLOPS/GPU i started with)

https://github.com/jfc4050/DeepSpeed/commit/319e2af4d4516c38ed7513d3f5c9456a0f1fb046 EDIT: broke this link by rebasing, use this instead https://github.com/jfc4050/DeepSpeed/tree/stage3

  • remove torch.cuda.synchronize() calls, make it so each module fetch can be blocked on individually
  • added all_gather_coalesced, which is similar to the batched all gather you implemented but can by async. I’m missing some of the optimizations you proposed related to avoiding memcpys. The parameters in each module are fetched in a single all_gather_coalesced call
  • using queue for prefetching which can be computed once instead of determining which modules to prefetch at each step
  • change fetching order from (fetch current, block, prefetch) -> (fetch current, prefetch, block) so prefetches can progress while we are waiting for current module
  • minor changes to module re-use logic

@jfc4050 and @zarzen, thanks for sharing your insights, these are very valuable. I will resume this line next week, as I am off this week.

from what i can tell

  • there’s a pre module hook which resets the step id (step id is used to track where we are in the forward pass)
  • part of pre sub module hook increments the step id (along with fetching params and other stage 3 stuff)
  • a pre sub module hook is executing before the pre module hook, which means it gets incremented to 1 and then reset to zero, this causes the rest of the step ids in the forward pass to be off by 1, which was causing the prefetches to be skipped

@yukw777, since the original issue is so old, could you please open a new ticket and share steps to repro? Thanks!

This is awesome to hear. Thanks for sharing these exciting updates.

few updates:

  • the bug mentioned earlier turned out to be related to gradient accumulation and has been fixed in https://github.com/jfc4050/DeepSpeed/commit/3136c12e30c1a2f72653bf9ff3d013b95e8abcf2
  • we were seeing good scaling up until 32 p4d nodes, when the performance fell off hard. this seems to be more of a hardware issue, there’s a few nodes in our cluster with significantly lower bandwidth utilizations than the others (measured w/ NCCL perf tests). Once this is cleared up will see how it scales to larger clusters.
  • @zarzen is seeing some performance issues on his single p4d setup, trying to repro those now

in the multi-node (p3dn), i was seeing that the cudaMemcpyAsync API calls are taking a lot of time, which results in the python process becoming the bottleneck. the cudaMemcpyAsyncs are called during the torch allgather call, which launches a memcpy for each rank (16 in this case), but typically only one of them takes forever while the rest have more reasonable latencies.memcpyasync-slow

the memcpy was DtoD and actually asynchronous. actual memcpy has reasonable latency, just the cuda API call that takes super long Screen Shot 2021-07-27 at 10 50 43 AM

on suggestion from @zarzen (thank you!!), i tried using the new torch.distributed._all_gather_base API, which has gotten rid of the memcpy overhead. now we are clearly communication bound. I need to take a closer look at the prefetching order and make sure its optimal. Also we can try to fiddle with various configuration options to try and amortize communication cost over more computation or make the communication more efficient (NCCL configuration mostly) after-allgather-base

also there’s still some bugs with different model architectures, working with @zarzen to iron that out

@zarzen and @jfc4050, yes it would be great to coordinate your efforts. Additionally, @zarzen, do you need me to commit #1170 before your PR goes in?

hey @zarzen, thanks for the analysis, seems like you uncovered an edge case related to modules re-appearing in the forward pass (idk why it only seems to happen when partitioning activations). I didn’t look too far into it because i was changing it back to not batch by module in order to try and better saturate communication bandwidth. (just finished, the newest changes are pushed to https://github.com/jfc4050/DeepSpeed/tree/stage3)

the code is still in a pretty rough state and not very thoroughly tested so you may still find bugs

@tjruwase based on @zarzen 's analysis its a bug i introduced in my stage3 branch so we should be good in master

BTW, I have finished a rough version of inplace_allgather to replace the torch.distributed.all_gather. The inplace_allgather avoids the memcpy step at here: memcpy of torch.distributed.all_gather Besides, it allows you to specify the cuda stream for launching the communications. And it returns a wrapper of two cuda events for you to do query/sync the communication task. #1188

I guess the ipc-handle is different from the handle returned from torch.distributed... APIs. Because the returned handle is a wrapper for the class ProcessGroup::Work.

In general.

The handle.wait is only working to block for completion of the job, when we set the envirnoment variable NCCL_BLOCKING_WAIT=1.

The underlying function of handle.wait() calls this function: https://github.com/pytorch/pytorch/blob/44922f26f5b1d7b6ee8d217a9d32bc0e40ec47a6/torch/lib/c10d/ProcessGroupNCCL.cpp#L316_L320

Which checks a condition of the variableblockingWait_, which is by default False. Thus, it basically skips all following condition checkings/synchronizations. To enable this branch, you have to set the NCCL_BLOCKING_WAIT=1.

@jfc4050, thanks for your response and help.

Yes, you have read the PR correctly. I have not removed the synchronization points as they require a bit more analysis/testing. Please share what you find out about reproducing the original perf that you saw. Anyways, I consider the PR to be a work-in-progress at this point. I hope that with help from you and @stephenrawls, we can get a better understanding of the perf blockers in zero-3 relative to zero-2.

By the way, I added some timers for the zero-3 fetching and prefetching, which you can enable by setting wall_clock_breakdown to true. Unfortunately, the timers themselves introduce synch points and so do affect timing somewhat.

Never mind, I see these snippets above. I noticed the profiler was not counting the number of parameters correctly for zero3 (394 instead of 2020M).

image

looking at this random subsection of a train run, seems like there’s a lot of overhead from the pre and post sub module hooks, similar to what @stephenrawls saw. i’m not seeing the same overhead from the non-blocking prefetch call though. the fetch operation taking some time seems reasonable, but we also spend quite a lot of time doing what appears to be releasing parameters.

the GPU isnt all that busy during this time, hopefully there’s room to push the utilization up a bit

wrapping up for the day but tomorrow will spend some time familiarizing myself with the code and looking a little more closely at the profile.