DeepSpeed: [BUG] Fused Adam optimizer requires more memory than unfused optimizer

Describe the bug FusedAdam requires more memory than non-fused adam.

To Reproduce Steps to reproduce the behavior:

I train with pipeline parallelization.
config of optimizer is like this:

"bf16": {
    "enabled": false 
},
"fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
},

"optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-5,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 4e-5
    }
},

I can launch training, but I see memory usage is smaller when I modify this line: https://github.com/microsoft/DeepSpeed/blob/0c75f4a3f937febc8c15610fcab7b81466b216c7/deepspeed/runtime/engine.py#L1342 into

        if False:
        #if isinstance(optimizer, fused_opts) \
         #      or self.optimizer_name() in [ONEBIT_ADAM_OPTIMIZER, ZERO_ONE_ADAM_OPTIMIZER]:

which means that we always use “unfused” optimizer. Also the training speed is roughly same.

Expected behavior Expect fused adam optimizer(original implementation) requires less gpu memory.

ds_report output

[2023-08-10 19:19:35,777] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/.conda/envs/py39/lib/python3.9/site-packages/torch']
torch version .................... 2.0.1
deepspeed install path ........... ['/home/.conda/envs/py39/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 18.04
GPU count and types: one machine with x8 A100s
Python version: 3.9.16
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

With command deepspeed train.py

Docker context No

About this issue

Original URL
State: open
Created a year ago
Comments: 23 (23 by maintainers)

Most upvoted comments

@CoinCheung, thanks for the feedback. I will cleanup the PR for merging. However, I will for now keep fused as default. The reason is to avoid breaking backward-compatibility until we are able to conduct a more extensive testing of the unfused to ascertain functional equivalence. I will also consult internally about this. Thanks.

tjruwase on Aug 22, 2023

@tjruwase This is the memory usage in my platform, for “fused” adamw, the memory usage on 8 gpus is:

[31.2G, 32.4G, 30.7G, 28.8G, 27.1G, 25.4G, 25.4G, 26.1G]

for “unfused” adamw, the memory usage is:

[29.5G, 30.9G, 29.1G, 27.3G, 25.5G, 23.7G, 21.9G, 21.3G]

CoinCheung on Aug 16, 2023