dolly: CUDA out of memory. Can this be run on a p3.16xlarge?

I am training using the EleutherAI/pythia-2.8b model and I’m using a p3.16xlarge. I tried the instructions for training on smaller instances, but still got a CUDA out of memory error on epoch 0.14. Aside from that, I wasn’t able to use a batch size of 3. I had to reduce to 1. I’m also using 8 GPUs.

Here is my deepseed config:

{
  "fp16": {
        "enabled": true,
        "auto_cast": false,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Any advice?

About this issue

Original URL
State: closed
Created a year ago
Comments: 23

Most upvoted comments

See guidance here: https://github.com/databrickslabs/dolly#v100-gpus-1 p3dn.24xlarge (32GB V100s) would be better. That’s a 16GB V100. You may be able to make it work by configuring optimizer offload and turning down batch size.

srowen on Apr 29, 2023

Thanks for your help. Successfully got it working. Closing this issue now.

rileyhun on May 2, 2023

Thanks Sean as well for all your help and great support for the project!

css459 on May 1, 2023

16GB is small for training, yeah. You can try param offload too. But then it’ll be slower. You want bigger GPUs - maybe g5 / A10?

srowen on Apr 30, 2023