accelerate: OOM Error on fine-tuning gpt-j-6b
I am trying to fine-tune gpt-j-6b on wikitext data using the run_clm.py provieded here https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling
I am trying to launch the code as
accelerate launch run_clm.py \
--model_name_or_path EleutherAI/gpt-j-6b \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--do_train \
--do_eval \
--output_dir /tmp/test-clm
I have am trying to use fsdp to run this on Nvidia-A100 GPUs each with 40gb memory. I tried running on 2 and 4 GPUs still it is giving Out Of Memory error. I tried by minimising the train_batch_size and eval_batch_size to 1. Not sure if I can run it using accelerate library or not. I thought 40GB x4 A100 GPUs should be okay.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 29
By forums, I meant https://discuss.huggingface.co/c/accelerate/18
Also, with
accelerate launchyou must userun_clm_no_trainer.pyand notrun_clm.py.run_clm.pyuses Trainer’s integration of FSDP with related args mentioned here: https://huggingface.co/docs/transformers/main_classes/trainer#pytorch-fully-sharded-data-parallel