DeepSpeed: [deepspeed checkpointing] AttributeError: 'NoneType' object has no attribute 'numel'
So I took a public GPT-2 class implementation (not Megatron-LM) and I added deepspeed checkpointing to it for all 48 layers. In my train script for this class, I added the following line:
deepspeed.checkpointing.configure(mpu_=None, deepspeed_config=args.deepspeed_config)
My deepspeed config JSON is as follows:
{
"train_batch_size": 128,
"gradient_accumulation_steps": 8,
"gradient_clipping": 1.0,
"optimizer": {
"type": "adam",
"params": {
"lr": 6.25e-5
}
},
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 2,
"cpu_offload": true,
"contiguous_gradients": true,
"overlap_comm": false,
"allgather_bucket_size": 500000000
},
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true,
"number_checkpoints": 48,
"cpu_checkpointing": true
}
}
When I try running my script, I get the following error:
File "/path/to/my/modeling_gpt2.py", line 221, in forward
encoder_attention_mask)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 582, in checkpoint
return CheckpointFunction.apply(function, *args)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 376, in forward
partition_size = get_partition_size(item)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 275, in get_partition_size
size = item.numel()
AttributeError: 'NoneType' object has no attribute 'numel'
Any ideas what’s going on?
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 23 (10 by maintainers)
@ShadenSmith thanks for the fix! Will test this out later this week and revert back to you!
Hey @g-karthik , happy new year and thanks for the ping! I hope you had a nice holiday. I spent some time away to focus on python scalability 😃.
Thanks for the deep dive on debugging this issue. I think you’re right. DeepSpeed’s activation checkpointing should work as a drop-in replacement in the simple case without partitioning, etc.
I don’t know if we’ve done any benchmarking beyond at large scale with model parallelism where it’s critical for huge activations. Maybe @samyam knows more.
Ah @ShadenSmith I think I know what the issue is, but correct me if I’m wrong. The default
torch.utils.checkpoint.checkpoint
implementation does not assume anything about availability of all*args
, i.e., theCheckpointFunction
’sforward()
method implementation allows for some arguments to beNone
.However, in the corresponding deepspeed
CheckpointFunction
’sforward()
method, the implementation seems to assume at multiple places that all passedargs
are notNone
.Perhaps these enumerations need to be modified and made more generic to account for cases where a layer being checkpointed could have some args as
None
, depending on the consumer’s choice?EDIT: Yep, I ran a test where I simply replaced
deepspeed.checkpointing.checkpoint
withtorch.utils.checkpoint.checkpoint
(all else equal), I can confirm the job runs successfully with the replacement. So there definitely needs to be some updates to the deepspeed checkpointing implementation.Also, do you happen to have performed any side-by-side benchmarking of
torch.utils.checkpoint.checkpoint
anddeepspeed.checkpointing.checkpoint
? I’d love to know more about it if y’all have done it! 😃