DeepSpeed: why cpu_checkpointing can't work?
I have partition_activations and cpu_checkpointing enabled, but it seems like activations still on GPU, I have just one GPU, can’t do model parallel, do cpu_checkpointing just work for model parallel? why single GPU same as 1 GPU model parallel can’t offload all its checkpoints to the CPU? My CPU memory is enough, configs:
{
'zero_optimization': {
'stage': 2,
'cpu_offload': True,
'contiguous_gradients': True,
},
'train_batch_size': 2,
'fp16': {
'enabled': True,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1,
},
"activation_checkpointing": {
"partition_activations": True,
"contiguous_memory_optimization": True,
"cpu_checkpointing": True
},
"wall_clock_breakdown": False,
}
Environment: python 3.6 torch 1.6.0 deepspeed 0.3.7
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 18 (11 by maintainers)
Commits related to this issue
- Moving quantization into post_init_method and add int4 dequantization kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix — committed to microsoft/DeepSpeed by donglinz a year ago
- ZeRO-Inference refresh (#4197) * INT4 weight only quantization (#479) * INT4 weight only quantization * pre commit * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * fix... — committed to microsoft/DeepSpeed by tjruwase 10 months ago
@hpourmodheji, that is helpful context. We did not enable activation checkpointing for bert because models that are less ~1B may not benefit much given the overhead of re-computation introduced by activation checkpointing. However, if you want to enable it do the following:
from deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint