diffusers: RuntimeError: CUDA error: invalid argument when using xformers
Describe the bug
When trying to run train_dreambooth.py with --enable_xformers_memory_efficient_attention the process exits with this error:
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Steps: 0%| | 0/400 [00:07<?, ?it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/*****/anaconda3/envs/sd-gpu/bin/accelerate:8 in <module> │
│ │
│ 5 from accelerate.commands.accelerate_cli import main │
│ 6 if __name__ == '__main__': │
│ 7 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │
│ ❱ 8 │ sys.exit(main()) │
│ 9 │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/accelerate/commands/accelerate_c │
│ li.py:45 in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if __name__ == "__main__": │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/accelerate/commands/launch.py:11 │
│ 04 in launch_command │
│ │
│ 1101 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │
│ 1102 │ │ sagemaker_launcher(defaults, args) │
│ 1103 │ else: │
│ ❱ 1104 │ │ simple_launcher(args) │
│ 1105 │
│ 1106 │
│ 1107 def main(): │
│ │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/accelerate/commands/launch.py:56 │
│ 7 in simple_launcher │
│ │
│ 564 │ process = subprocess.Popen(cmd, env=current_env) │
│ 565 │ process.wait() │
│ 566 │ if process.returncode != 0: │
│ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │
│ 568 │
│ 569 │
│ 570 def multi_gpu_launcher(args): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Reproduction
accelerate launch train_dreambooth.py --pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4 --instance_data_dir=./inputs --output_dir=./outputs --instance_prompt=“a photo of sks dog” --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=5e-6 --lr_scheduler=“constant” --lr_warmup_steps=0 --max_train_steps=400 --enable_xformers_memory_efficient_attention
Logs
No response
System Info
diffusers
version: 0.12.0.dev0- Platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.10.8
- PyTorch version (GPU?): 1.13.0 (True)
- Huggingface_hub version: 0.11.1
- Transformers version: 0.15.0
- Accelerate version: not installed
- xFormers version: 0.0.15.dev395+git.7e05e2c
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: single GPU
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 19 (13 by maintainers)
While I’m no longer getting an error, it looks like the model doesn’t learn anymore. The images which are generated after the training are the same as before it.
However, I’ve found an older version of xformers which works just fine: https://github.com/facebookresearch/xformers/commit/0bad001ddd56c080524d37c84ff58d9cd030ebfd. This seems to be the last commit that works for me, as far as I can tell from a few tests using later commits.
Here’s my environment and installation process.
GPU: 3060 CUDA version: 11.8 Python version: 3.10 OS: Arch Linux
Installation:
If nvcc is not on
$PATH
(like on Arch Linux), you can change the last line and specify the path to cuda like this:Some details about versions:
ninja
is installed to build xformers fasterbitsandbytes
must be0.35
because of this. Also, training with0.35.4
makes the model generate blue noise for me, while0.35.1
works fine.Full package version list
Edit: seems to work with both torch 1.12.1 and 1.13.1, updated the version information.
The two where definitely it works 😃. The arch I know has issues is SM8x except SM80 (so 30xx and 40xx mostly).
(Although it looks like there’s a bit more action in the xformers repo, so this might actually get fixed upstream at some point now.)
I tried this and the 0.17 pre-release.
I’ll report in xformers, but I believe I found a related issue there already.
Best
Evan Jones Website: www.ea-jones.com
On Wed, Feb 1, 2023 at 3:44 AM Suraj Patil @.***> wrote:
Could be an issue with
xformers
version, I have been using thexformers
pre-release and it seems to be working without any issues https://pypi.org/project/xformers/#history@davidpfahler in the meantime, using this to enable xformers instead of the built-in enable xformers method should work:
https://github.com/cloneofsimo/lora/blob/master/lora_diffusion/xformers_utils.py#L42
This might be an upstream bug in xformers https://github.com/facebookresearch/xformers/issues/563