accelerate: Script hangs on `set_seed` and `accelerator.backward(loss)` with A100 GPU

System Info

- `Accelerate` version: 0.12.0
- Platform: Linux-4.19.0-21-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.12.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - downcast_bf16: no

I also tested using Accelerate version: 0.13.2 and can confirm the issue is still present.

OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: version 3.24.1
Libc version: glibc-2.10

Python version: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-4.19.0-20-cloud-amd64-x86_64-with-debian-10.13
Is CUDA available: True
CUDA runtime version: 11.3.109
CUDA_MODULE_LOADING set to: 
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

I’m using the Debian 10 based Deep Learning VM for PyTorch CPU/GPU with CUDA 11.3 M98 image on Google Cloud Compute.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

pip install git+https://github.com/ShivamShrirao/diffusers.git
wget https://raw.githubusercontent.com/ShivamShrirao/diffusers/main/examples/dreambooth/requirements.txt
pip install -r requirements.txt

huggingface-cli login

accelerate config # Single GPU on Google Cloud Compute, No CPU, No half precision.

wget -q https://github.com/ShivamShrirao/diffusers/raw/main/examples/dreambooth/train_dreambooth.py

# Optional xformers
# pip install -q https://github.com/TheLastBen/fast-stable-diffusion/raw/main/precompiled/A100/xformers-0.0.13.dev0-py3-none-any.whl

Setup the dataset and then run a bash script pointing to it:

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export VAE_NAME="stabilityai/sd-vae-ft-mse"
export INSTANCE_DIR="concept_images"
export CLASS_DIR="class_reg_images"
export OUTPUT_DIR="path-to-save-model"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_name_or_path=$VAE_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="a photo of sks <concept>" \
  --class_prompt="<concept>" \
  --save_sample_prompt="photo of sks <concept>" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=1290 \
  --save_interval=500 \
  --max_train_steps=25000 \
  --train_text_encoder \
  --mixed_precision="no" \
  --not_cache_latents

Use the --seed=1337 to trigger the hang at the set_seed function.

Expected behavior

It should not hang.

Issue originally reported here: https://github.com/ShivamShrirao/diffusers/issues/60

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19

Most upvoted comments

Can others verify if they had torch-xla installed on their machines? That’s amazing if so. Nice catch @josephenguehard, I’ll check this out today as well on my a100 setup

Hi, yes. It worked for me after uninstalling torch-xla

I faced this same issue with pytorch-lightning as well In pytorch-lightning, it was trying to pick TPU since xla was installed, I guess something similar might be happening here as well

Uninstalling the package fixed it

Hi,

I had the same issue, but it worked after uninstalling torch-xla:

pip uninstall torch-xla

In case it’s helpful, creating a new conda environment from scratch fixed this for me. I followed this comment, but I changed to the official diffusers repo, set cuda-toolkit=11.6 to match the image, and didn’t install xformers.

Hey @seeM (and others 😃 ) I’ll be looking at this during the coming week 😃

Hey @muellerzr 😄 I’m having this issue with a Tesla T4 on GCP, with the default PyTorch image, VM created today: Google, Debian 10 based Deep Learning VM for PyTorch CPU/GPU with CUDA 11.3, M103, Deep Learning VM Image with PyTorch 1.13 and fast.ai preinstalled.

I just had it occur with StableTuner (that uses accelerate) using the c2-deeplearning-pytorch-1-12-cu113-v20230126-debian-10 image.

I had to use a conda environment to solve the issue.

I’ve the same issue on GCP - it’s solved for now by using custom Docker image - pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime, instead of using GCP’s own image