accelerate: Script hangs on `set_seed` and `accelerator.backward(loss)` with A100 GPU
System Info
- `Accelerate` version: 0.12.0
- Platform: Linux-4.19.0-21-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: no
- use_cpu: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- downcast_bf16: no
I also tested using Accelerate version: 0.13.2 and can confirm the issue is still present.
OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: version 3.24.1
Libc version: glibc-2.10
Python version: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-4.19.0-20-cloud-amd64-x86_64-with-debian-10.13
Is CUDA available: True
CUDA runtime version: 11.3.109
CUDA_MODULE_LOADING set to:
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
I’m using the Debian 10 based Deep Learning VM for PyTorch CPU/GPU with CUDA 11.3 M98 image on Google Cloud Compute.
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
pip install git+https://github.com/ShivamShrirao/diffusers.git
wget https://raw.githubusercontent.com/ShivamShrirao/diffusers/main/examples/dreambooth/requirements.txt
pip install -r requirements.txt
huggingface-cli login
accelerate config # Single GPU on Google Cloud Compute, No CPU, No half precision.
wget -q https://github.com/ShivamShrirao/diffusers/raw/main/examples/dreambooth/train_dreambooth.py
# Optional xformers
# pip install -q https://github.com/TheLastBen/fast-stable-diffusion/raw/main/precompiled/A100/xformers-0.0.13.dev0-py3-none-any.whl
Setup the dataset and then run a bash script pointing to it:
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export VAE_NAME="stabilityai/sd-vae-ft-mse"
export INSTANCE_DIR="concept_images"
export CLASS_DIR="class_reg_images"
export OUTPUT_DIR="path-to-save-model"
accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_name_or_path=$VAE_NAME \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="a photo of sks <concept>" \
--class_prompt="<concept>" \
--save_sample_prompt="photo of sks <concept>" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=1290 \
--save_interval=500 \
--max_train_steps=25000 \
--train_text_encoder \
--mixed_precision="no" \
--not_cache_latents
Use the --seed=1337 to trigger the hang at the set_seed function.
Expected behavior
It should not hang.
Issue originally reported here: https://github.com/ShivamShrirao/diffusers/issues/60
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19
Hi, yes. It worked for me after uninstalling
torch-xlaI faced this same issue with pytorch-lightning as well In pytorch-lightning, it was trying to pick TPU since xla was installed, I guess something similar might be happening here as well
Uninstalling the package fixed it
this image
Hi,
I had the same issue, but it worked after uninstalling
torch-xla:In case it’s helpful, creating a new conda environment from scratch fixed this for me. I followed this comment, but I changed to the official diffusers repo, set
cuda-toolkit=11.6to match the image, and didn’t install xformers.Hey @seeM (and others 😃 ) I’ll be looking at this during the coming week 😃
Hey @muellerzr 😄 I’m having this issue with a Tesla T4 on GCP, with the default PyTorch image, VM created today:
Google, Debian 10 based Deep Learning VM for PyTorch CPU/GPU with CUDA 11.3, M103, Deep Learning VM Image with PyTorch 1.13 and fast.ai preinstalled.I just had it occur with StableTuner (that uses accelerate) using the c2-deeplearning-pytorch-1-12-cu113-v20230126-debian-10 image.
I had to use a conda environment to solve the issue.
I’ve the same issue on GCP - it’s solved for now by using custom Docker image -
pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime, instead of using GCP’s own image