DeepSpeed: [deepspeed checkpointing] AttributeError: 'NoneType' object has no attribute 'numel'

So I took a public GPT-2 class implementation (not Megatron-LM) and I added deepspeed checkpointing to it for all 48 layers. In my train script for this class, I added the following line:

deepspeed.checkpointing.configure(mpu_=None, deepspeed_config=args.deepspeed_config)

My deepspeed config JSON is as follows:

{
  "train_batch_size": 128,
  "gradient_accumulation_steps": 8,
  "gradient_clipping": 1.0,
  "optimizer": {
    "type": "adam",
    "params": {
      "lr": 6.25e-5
    }
  },
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 2,
    "cpu_offload": true,
    "contiguous_gradients": true,
    "overlap_comm": false,
    "allgather_bucket_size": 500000000
  },

  "activation_checkpointing": {
    "partition_activations": true,
    "contiguous_memory_optimization": true,
    "number_checkpoints": 48,
    "cpu_checkpointing": true
  }

}

When I try running my script, I get the following error:

  File "/path/to/my/modeling_gpt2.py", line 221, in forward
    encoder_attention_mask)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 582, in checkpoint
    return CheckpointFunction.apply(function, *args)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 376, in forward
    partition_size = get_partition_size(item)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 275, in get_partition_size
    size = item.numel()
AttributeError: 'NoneType' object has no attribute 'numel'

Any ideas what’s going on?

@tjruwase @ShadenSmith

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 23 (10 by maintainers)

Most upvoted comments

@ShadenSmith thanks for the fix! Will test this out later this week and revert back to you!

Hey @g-karthik , happy new year and thanks for the ping! I hope you had a nice holiday. I spent some time away to focus on python scalability 😃.

Thanks for the deep dive on debugging this issue. I think you’re right. DeepSpeed’s activation checkpointing should work as a drop-in replacement in the simple case without partitioning, etc.

I don’t know if we’ve done any benchmarking beyond at large scale with model parallelism where it’s critical for huge activations. Maybe @samyam knows more.

Ah @ShadenSmith I think I know what the issue is, but correct me if I’m wrong. The default torch.utils.checkpoint.checkpoint implementation does not assume anything about availability of all *args, i.e., the CheckpointFunction’s forward() method implementation allows for some arguments to be None.

However, in the corresponding deepspeed CheckpointFunction’s forward() method, the implementation seems to assume at multiple places that all passed args are not None.

Perhaps these enumerations need to be modified and made more generic to account for cases where a layer being checkpointed could have some args as None, depending on the consumer’s choice?

EDIT: Yep, I ran a test where I simply replaced deepspeed.checkpointing.checkpoint with torch.utils.checkpoint.checkpoint (all else equal), I can confirm the job runs successfully with the replacement. So there definitely needs to be some updates to the deepspeed checkpointing implementation.

Also, do you happen to have performed any side-by-side benchmarking of torch.utils.checkpoint.checkpoint and deepspeed.checkpointing.checkpoint? I’d love to know more about it if y’all have done it! 😃