accelerate: OOM Error on fine-tuning gpt-j-6b

I am trying to fine-tune gpt-j-6b on wikitext data using the run_clm.py provieded here https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling I am trying to launch the code as

accelerate launch run_clm.py \
    --model_name_or_path EleutherAI/gpt-j-6b \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm

I have am trying to use fsdp to run this on Nvidia-A100 GPUs each with 40gb memory. I tried running on 2 and 4 GPUs still it is giving Out Of Memory error. I tried by minimising the train_batch_size and eval_batch_size to 1. Not sure if I can run it using accelerate library or not. I thought 40GB x4 A100 GPUs should be okay.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 29

Most upvoted comments

Also, with accelerate launch you must use run_clm_no_trainer.py and not run_clm.py. run_clm.py uses Trainer’s integration of FSDP with related args mentioned here: https://huggingface.co/docs/transformers/main_classes/trainer#pytorch-fully-sharded-data-parallel