DeepSpeed: Performance Degradation with ZERO Stage 3
Hi,
I am trying to benchmark a 10B parameter Huggingface RobertaForMaskedLM
model with both ZERO Stage 2 and ZERO Stage 3 to compare the latency impact of parameter partitioning.
I am seeing much worse performance with Stage 3 than expected however, so want to check if something looks wrong.
|--------------------+-------+--------+-------+---------+--------|
| Description | Model | # p3dn | batch | Samples | TFLOPS |
| | Size | hosts | size | Per Sec | / GPU |
|--------------------+-------+--------+-------+---------+--------|
| Baseline (stage2) | 10B | 16 | 8 | 178 | 44.7 |
| Stage3, no offload | 10B | 16 | 8 | 77 | 19.5 |
| Stage3, no offload | 10B | 8 | 8 | 41 | 21.9 |
| Stage3, no offload | 10B | 4 | 8 | 23 | 23.5 |
| Stage3, no offload | 10B | 2 | 8 | 11.6 | 23.5 |
| Stage3, no offload | 10B | 1 | 8 | OOM | OOM |
|--------------------+-------+--------+-------+---------+--------|
The problem does not seem to be related to network bandwidth, because when I move to p4d machines, which have 4x the bandwidth of p3dn machines (400 Gbps vs 100 Gbps) I see similar degradation:
|--------------------+-------+--------+-------+---------+--------|
| Description | Model | # p4dn | batch | Samples | TFLOPS |
| | Size | hosts | size | Per Sec | / GPU |
|--------------------+-------+--------+-------+---------+--------|
| Baseline (stage2) | 10B | 16 | 8 | 432 | 109 |
| Stage3, no offload | 10B | 4 | 8 | 44 | 44.5 |
|--------------------+-------+--------+-------+---------+--------|
I tried increasing stage3_max_live_parameters
from 1e9 → 2e9, and stage3_prefetch_bucket_size
from 5e8 → 10e8, but neither change impacted performance.
In addition, I ended up adding some time.time()
statements before + after:
a. The blocking fetch()
call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1520
b. The non-blocking pre-fetch()
call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1525-L1528
c. The release()
call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1540
And I noticed counter-intuitively that the majority of time was spent in what is supposed to be the non-blocking pre-fetch call:
Total fetch time = 599.5581150054932 ms;
Total pre-fetch time = 4473.618030548096 ms;
Total release time = 1130.7482719421387 ms
Total time = 6203.9244174957275 ms
In fact after a bit of digging and some additional timing statements added to code, I isolated the place that is causing pre-fetch to take so long to this line: https://github.com/microsoft/DeepSpeed/blob/18a26e8604c4cb8562ed8d57241ca64dbeb4318a/deepspeed/runtime/zero/partition_parameters.py#L798
Any ideas why I am seeing a 2x or bigger drop in performance when moving to stage 3 (compared to stage 2)? And why pre-fetching seems to be taking so much time when it is supposed to be an asynchronous background operation?
Thanks, Stephen
P.S. Details are here: Model config:
RobertaForMaskedLM:
max_position_embeddings: 512
type_vocab_size: 1
num_attention_heads: 40
num_hidden_layers: 30
hidden_size: 5120
intermediate_size: 20480
gradient_checkpointing: true
Zero stage 2 config:
zero_optimization:
stage: 2
overlap_comm: true
Zero stage 3 config:
zero_optimization:
stage: 3
overlap_comm: true
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 78 (70 by maintainers)
hey @tjruwase we have published a PR with optimizations and some other improvements here https://github.com/microsoft/DeepSpeed/pull/1453
quick note:
c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allgather_coalesced
is stubbed out in PyTorch - @zarzen and I both have DeepSpeed/Python implementations of this but could benefit everyone to move the implementation to Pytorch/C++ at some pointhttps://github.com/pytorch/pytorch/blob/44922f26f5b1d7b6ee8d217a9d32bc0e40ec47a6/torch/lib/c10d/ProcessGroupNCCL.cpp#L1312-L1318
thanks @zarzen for the insight and the idea to batch the all gather calls, its really promising. I think it would still be good to have an async batched all gather. I’m working on some ideas for optimizing prefetching, still working on testing/analysis and fixing some issues but so far performance seems pretty good (~37 TFLOPS/GPU compared to the ~16TFLOPS/GPU i started with)
https://github.com/jfc4050/DeepSpeed/commit/319e2af4d4516c38ed7513d3f5c9456a0f1fb046 EDIT: broke this link by rebasing, use this instead https://github.com/jfc4050/DeepSpeed/tree/stage3
torch.cuda.synchronize()
calls, make it so each module fetch can be blocked on individuallyall_gather_coalesced
, which is similar to the batched all gather you implemented but can by async. I’m missing some of the optimizations you proposed related to avoiding memcpys. The parameters in each module are fetched in a singleall_gather_coalesced
call@jfc4050 and @zarzen, thanks for sharing your insights, these are very valuable. I will resume this line next week, as I am off this week.
from what i can tell
@yukw777, since the original issue is so old, could you please open a new ticket and share steps to repro? Thanks!
This is awesome to hear. Thanks for sharing these exciting updates.
few updates:
in the multi-node (p3dn), i was seeing that the cudaMemcpyAsync API calls are taking a lot of time, which results in the python process becoming the bottleneck. the cudaMemcpyAsyncs are called during the torch allgather call, which launches a memcpy for each rank (16 in this case), but typically only one of them takes forever while the rest have more reasonable latencies.
the memcpy was DtoD and actually asynchronous. actual memcpy has reasonable latency, just the cuda API call that takes super long
on suggestion from @zarzen (thank you!!), i tried using the new
torch.distributed._all_gather_base
API, which has gotten rid of the memcpy overhead. now we are clearly communication bound. I need to take a closer look at the prefetching order and make sure its optimal. Also we can try to fiddle with various configuration options to try and amortize communication cost over more computation or make the communication more efficient (NCCL configuration mostly)also there’s still some bugs with different model architectures, working with @zarzen to iron that out
@zarzen and @jfc4050, yes it would be great to coordinate your efforts. Additionally, @zarzen, do you need me to commit #1170 before your PR goes in?
hey @zarzen, thanks for the analysis, seems like you uncovered an edge case related to modules re-appearing in the forward pass (idk why it only seems to happen when partitioning activations). I didn’t look too far into it because i was changing it back to not batch by module in order to try and better saturate communication bandwidth. (just finished, the newest changes are pushed to https://github.com/jfc4050/DeepSpeed/tree/stage3)
the code is still in a pretty rough state and not very thoroughly tested so you may still find bugs
@tjruwase based on @zarzen 's analysis its a bug i introduced in my stage3 branch so we should be good in master
BTW, I have finished a rough version of
inplace_allgather
to replace thetorch.distributed.all_gather
. Theinplace_allgather
avoids the memcpy step at here: memcpy of torch.distributed.all_gather Besides, it allows you to specify the cuda stream for launching the communications. And it returns a wrapper of two cuda events for you to doquery/sync
the communication task. #1188I guess the ipc-handle is different from the handle returned from
torch.distributed...
APIs. Because the returned handle is a wrapper for the classProcessGroup::Work
.In general.
The
handle.wait
is only working to block for completion of the job, when we set the envirnoment variableNCCL_BLOCKING_WAIT=1
.The underlying function of
handle.wait()
calls this function: https://github.com/pytorch/pytorch/blob/44922f26f5b1d7b6ee8d217a9d32bc0e40ec47a6/torch/lib/c10d/ProcessGroupNCCL.cpp#L316_L320Which checks a condition of the variable
blockingWait_
, which is by defaultFalse
. Thus, it basically skips all following condition checkings/synchronizations. To enable this branch, you have to set theNCCL_BLOCKING_WAIT=1
.@jfc4050, thanks for your response and help.
Yes, you have read the PR correctly. I have not removed the synchronization points as they require a bit more analysis/testing. Please share what you find out about reproducing the original perf that you saw. Anyways, I consider the PR to be a work-in-progress at this point. I hope that with help from you and @stephenrawls, we can get a better understanding of the perf blockers in zero-3 relative to zero-2.
By the way, I added some timers for the zero-3 fetching and prefetching, which you can enable by setting
wall_clock_breakdown
to true. Unfortunately, the timers themselves introduce synch points and so do affect timing somewhat.Never mind, I see these snippets above. I noticed the profiler was not counting the number of parameters correctly for zero3 (394 instead of 2020M).
looking at this random subsection of a train run, seems like there’s a lot of overhead from the pre and post sub module hooks, similar to what @stephenrawls saw. i’m not seeing the same overhead from the non-blocking prefetch call though. the fetch operation taking some time seems reasonable, but we also spend quite a lot of time doing what appears to be releasing parameters.
the GPU isnt all that busy during this time, hopefully there’s room to push the utilization up a bit
wrapping up for the day but tomorrow will spend some time familiarizing myself with the code and looking a little more closely at the profile.