DeepSpeed: why cpu_checkpointing can't work?

I have partition_activations and cpu_checkpointing enabled, but it seems like activations still on GPU, I have just one GPU, can’t do model parallel, do cpu_checkpointing just work for model parallel? why single GPU same as 1 GPU model parallel can’t offload all its checkpoints to the CPU? My CPU memory is enough, configs:

{
       'zero_optimization': {
          'stage': 2,
          'cpu_offload': True,
          'contiguous_gradients': True,
          },
       'train_batch_size': 2,
       'fp16': {
          'enabled': True,
          "loss_scale": 0,
          "loss_scale_window": 1000,
          "hysteresis": 2,
          "min_loss_scale": 1,
          },
        "activation_checkpointing": {
          "partition_activations": True,
          "contiguous_memory_optimization": True,
          "cpu_checkpointing": True
        },
       "wall_clock_breakdown": False,
}

Environment: python 3.6 torch 1.6.0 deepspeed 0.3.7

About this issue

Original URL
State: open
Created 4 years ago
Comments: 18 (11 by maintainers)

Commits related to this issue

Moving quantization into post_init_method and add int4 dequantization kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix — committed to microsoft/DeepSpeed by donglinz a year ago
ZeRO-Inference refresh (#4197) * INT4 weight only quantization (#479) * INT4 weight only quantization * pre commit * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * fix... — committed to microsoft/DeepSpeed by tjruwase 10 months ago

Most upvoted comments

@hpourmodheji, that is helpful context. We did not enable activation checkpointing for bert because models that are less ~1B may not benefit much given the overhead of re-computation introduced by activation checkpointing. However, if you want to enable it do the following:

Switch the flag here to True
Replace this import with from deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint

tjruwase on Mar 2, 2022