dolly: CUDA out of memory. Can this be run on a p3.16xlarge?

I am training using the EleutherAI/pythia-2.8b model and I’m using a p3.16xlarge. I tried the instructions for training on smaller instances, but still got a CUDA out of memory error on epoch 0.14. Aside from that, I wasn’t able to use a batch size of 3. I had to reduce to 1. I’m also using 8 GPUs.

Here is my deepseed config:

{
  "fp16": {
        "enabled": true,
        "auto_cast": false,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Any advice?

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 23

Most upvoted comments

See guidance here: https://github.com/databrickslabs/dolly#v100-gpus-1 p3dn.24xlarge (32GB V100s) would be better. That’s a 16GB V100. You may be able to make it work by configuring optimizer offload and turning down batch size.

Thanks for your help. Successfully got it working. Closing this issue now.

Thanks Sean as well for all your help and great support for the project!

16GB is small for training, yeah. You can try param offload too. But then it’ll be slower. You want bigger GPUs - maybe g5 / A10?