DeepSpeed: [BUG] Fused Adam optimizer requires more memory than unfused optimizer
Describe the bug FusedAdam requires more memory than non-fused adam.
To Reproduce Steps to reproduce the behavior:
-
I train with pipeline parallelization.
-
config of optimizer is like this:
"bf16": {
"enabled": false
},
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-5,
"betas": [
0.9,
0.999
],
"eps": 1e-8,
"weight_decay": 4e-5
}
},
I can launch training, but I see memory usage is smaller when I modify this line: https://github.com/microsoft/DeepSpeed/blob/0c75f4a3f937febc8c15610fcab7b81466b216c7/deepspeed/runtime/engine.py#L1342 into
if False:
#if isinstance(optimizer, fused_opts) \
# or self.optimizer_name() in [ONEBIT_ADAM_OPTIMIZER, ZERO_ONE_ADAM_OPTIMIZER]:
which means that we always use “unfused” optimizer. Also the training speed is roughly same.
Expected behavior Expect fused adam optimizer(original implementation) requires less gpu memory.
ds_report output
[2023-08-10 19:19:35,777] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/.conda/envs/py39/lib/python3.9/site-packages/torch']
torch version .................... 2.0.1
deepspeed install path ........... ['/home/.conda/envs/py39/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Ubuntu 18.04
- GPU count and types: one machine with x8 A100s
- Python version: 3.9.16
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
With command deepspeed train.py
Docker context No
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 23 (23 by maintainers)
@CoinCheung, thanks for the feedback. I will cleanup the PR for merging. However, I will for now keep fused as default. The reason is to avoid breaking backward-compatibility until we are able to conduct a more extensive testing of the unfused to ascertain functional equivalence. I will also consult internally about this. Thanks.
@tjruwase This is the memory usage in my platform, for “fused” adamw, the memory usage on 8 gpus is:
for “unfused” adamw, the memory usage is: