DeepSpeed: why cpu_checkpointing can't work?

I have partition_activations and cpu_checkpointing enabled, but it seems like activations still on GPU, I have just one GPU, can’t do model parallel, do cpu_checkpointing just work for model parallel? why single GPU same as 1 GPU model parallel can’t offload all its checkpoints to the CPU? My CPU memory is enough, configs:

{
       'zero_optimization': {
          'stage': 2,
          'cpu_offload': True,
          'contiguous_gradients': True,
          },
       'train_batch_size': 2,
       'fp16': {
          'enabled': True,
          "loss_scale": 0,
          "loss_scale_window": 1000,
          "hysteresis": 2,
          "min_loss_scale": 1,
          },
        "activation_checkpointing": {
          "partition_activations": True,
          "contiguous_memory_optimization": True,
          "cpu_checkpointing": True
        },
       "wall_clock_breakdown": False,
}

Environment: python 3.6 torch 1.6.0 deepspeed 0.3.7

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 18 (11 by maintainers)

Commits related to this issue

Most upvoted comments

@hpourmodheji, that is helpful context. We did not enable activation checkpointing for bert because models that are less ~1B may not benefit much given the overhead of re-computation introduced by activation checkpointing. However, if you want to enable it do the following:

  1. Switch the flag here to True
  2. Replace this import with from deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint