accelerate: ValueError: Query/Key/Value should either all have the same dtype,

--enable_xformers_memory_efficient_attention fails with the following error when run with accelerate

File "/anaconda/envs/diffusers-ikin/lib/python3.8/site-packages/xformers/ops/fmha/__init__.py", line 348, in _memory_efficient_attention_forward_requires_grad
    inp.validate_inputs()
  File "/anaconda/envs/diffusers-ikin/lib/python3.8/site-packages/xformers/ops/fmha/common.py", line 121, in validate_inputs
    raise ValueError(
ValueError: Query/Key/Value should either all have the same dtype, or (in the quantized case) Key/Value should have dtype torch.int32
  query.dtype: torch.float32
  key.dtype  : torch.float16
  value.dtype: torch.float16
Steps:   0%|                                           | 0/1000 [00:02<?, ?it/s]
[2023-11-25 14:03:17,898] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5264) of binary: /anaconda/envs/diffusers-ikin/bin/python
Traceback (most recent call last):
  File "/anaconda/envs/diffusers-ikin/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/anaconda/envs/diffusers-ikin/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/anaconda/envs/diffusers-ikin/lib/python3.8/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/anaconda/envs/diffusers-ikin/lib/python3.8/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/anaconda/envs/diffusers-ikin/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/anaconda/envs/diffusers-ikin/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/anaconda/envs/diffusers-ikin/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py FAILED

I am running accelerate as following

accelerate launch diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py \
      --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
      --instance_data_dir={input_dir} \
      --output_dir={output_dir} \
      --instance_prompt=instance_prompt \
      --mixed_precision="fp16" \
      --resolution=1024 \
      --train_batch_size=1 \
      --gradient_accumulation_steps=4 \
      --learning_rate=1e-4 \
      --lr_scheduler="constant" \
      --lr_warmup_steps=0 \
      --checkpointing_steps=500 \
      --max_train_steps=1000 \
      --seed="0" \
      --checkpoints_total_limit=5 \
      --enable_xformers_memory_efficient_attention
Accelerate config
{
  "compute_environment": "LOCAL_MACHINE",
  "debug": false,
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "machine_rank": 0,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 1,
  "num_processes": 2,
  "rdzv_backend": "static",
  "same_network": false,
  "tpu_use_cluster": false,
  "tpu_use_sudo": false,
  "use_cpu": false
}

Versions:

xformers==0.0.23.dev687
accelerate==0.24.1
torch==2.1.0
torchvision==0.16.1

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Comments: 25 (8 by maintainers)

Most upvoted comments

Thanks for looking into it. I’ll keep --enable_xformers_memory_efficient_attention enabled to reap its benefits though. 🙂

If it helps you, it’s working if I remove mixed precision like this:

accelerate launch $SCRIPT_PATH \
  --enable_xformers_memory_efficient_attention \

  # Same as before

Still, I’d like to use mixed_precision="fp16" for the speed boost.

If I run accelerate without --enable_xformers_memory_efficient_attention flag, training works fine. Looks like somehow xformers upscales query vector to float32.

ValueError: Attempting to unscale FP16 gradients. error means that the model is on fp16. See this thread for more information.

As for the other issue, it seems that compiling both pytorch and xformers with the same cuda version doesn’t work. I will try to debug that.

i think the problem is caused by xformers and pytorch used different cuda version to compile, xformers is compiled used cuda 11.8 or higher,but the cuda version your pytorch compiled is 11.7 or lower.solution is to download xfomers source code and compile it in your environment

Still the same issue thanks for the workaround @JakobLS