diffusers: Custom Diffusion: RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::Half != float
Describe the bug
when running custom diffusion on my 20 photos repositories … I run into this error that is related to data type difference…
Reproduction
!accelerate launch train_custom_diffusion.py
–pretrained_model_name_or_path=$MODEL_NAME
–instance_data_dir=$INSTANCE_DIR
–output_dir=$OUTPUT_DIR
–class_data_dir=$class_data_dir
–with_prior_preservation
–prior_loss_weight=1.0
–class_prompt=“person”
–num_class_images=200
–instance_prompt=“photo of a <new1> person”
–resolution=512
–train_batch_size=2
–learning_rate=5e-6
–lr_warmup_steps=0
–max_train_steps=1200
–freeze_model=crossattn
–scale_lr
–hflip
–use_8bit_adam
–gradient_checkpointing
–enable_xformers_memory_efficient_attention
–modifier_token “<new1>”
–validation_prompt=“<new1> person sitting in a bucket”
Logs
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
/home/anasrezklinux/.local/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/anasrezklinux/.local/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
02/06/2024 23:50:18 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'thresholding', 'variance_type', 'dynamic_thresholding_ratio', 'clip_sample_range', 'sample_max_value', 'timestep_spacing', 'rescale_betas_zero_snr'} was not found in config. Values will be initialized to default values.
{'scaling_factor', 'force_upcast'} was not found in config. Values will be initialized to default values.
{'mid_block_only_cross_attention', 'cross_attention_norm', 'encoder_hid_dim', 'encoder_hid_dim_type', 'reverse_transformer_layers_per_block', 'attention_type', 'time_embedding_act_fn', 'projection_class_embeddings_input_dim', 'time_embedding_dim', 'mid_block_type', 'transformer_layers_per_block', 'class_embed_type', 'conv_out_kernel', 'class_embeddings_concat', 'addition_time_embed_dim', 'addition_embed_type', 'addition_embed_type_num_heads', 'conv_in_kernel', 'time_embedding_type', 'resnet_skip_time_act', 'num_attention_heads', 'resnet_out_scale_factor', 'resnet_time_scale_shift', 'time_cond_proj_dim', 'timestep_post_act', 'dropout'} was not found in config. Values will be initialized to default values.
[42170]
02/06/2024 23:52:45 - INFO - __main__ - ***** Running training *****
02/06/2024 23:52:45 - INFO - __main__ - Num examples = 200
02/06/2024 23:52:45 - INFO - __main__ - Num batches each epoch = 100
02/06/2024 23:52:45 - INFO - __main__ - Num Epochs = 12
02/06/2024 23:52:45 - INFO - __main__ - Instantaneous batch size per device = 2
02/06/2024 23:52:45 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 2
02/06/2024 23:52:45 - INFO - __main__ - Gradient Accumulation steps = 1
02/06/2024 23:52:45 - INFO - __main__ - Total optimization steps = 1200
Steps: 0%| | 0/1200 [00:00<?, ?it/s]/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
Traceback (most recent call last):
File "/home/anasrezklinux/test_pycharm_link/diffusers/examples/custom_diffusion/train_custom_diffusion.py", line 1350, in <module>
main(args)
File "/home/anasrezklinux/test_pycharm_link/diffusers/examples/custom_diffusion/train_custom_diffusion.py", line 1131, in main
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_condition.py", line 1121, in forward
sample, res_samples = downsample_block(
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 1189, in forward
hidden_states = attn(
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py", line 379, in forward
hidden_states = torch.utils.checkpoint.checkpoint(
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint
ret = function(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py", line 374, in custom_forward
return module(*inputs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/diffusers/models/attention.py", line 366, in forward
attn_output = self.attn2(
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 512, in forward
return self.processor(
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 1429, in __call__
query = self.to_q_custom_diffusion(hidden_states).to(attn.to_q.weight.dtype)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::Half != float
Steps: 0%| | 0/1200 [00:22<?, ?it/s]
Traceback (most recent call last):
File "/home/anasrezklinux/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command
simple_launcher(args)
File "/home/anasrezklinux/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_custom_diffusion.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2-1', '--instance_data_dir=/mnt/c/Users/noobw/PycharmProjects/pythonProject/Anas', '--output_dir=/mnt/c/Users/noobw/PycharmProjects/pythonProject/custom_diffusion_anas', '--class_data_dir=/mnt/c/Users/noobw/PycharmProjects/pythonProject/custom_diffusion_anas/class_prior', '--with_prior_preservation', '--prior_loss_weight=1.0', '--class_prompt=person', '--num_class_images=200', '--instance_prompt=photo of a <new1> person', '--resolution=512', '--train_batch_size=2', '--learning_rate=5e-6', '--lr_warmup_steps=0', '--max_train_steps=1200', '--freeze_model=crossattn', '--scale_lr', '--hflip', '--use_8bit_adam', '--gradient_checkpointing', '--enable_xformers_memory_efficient_attention', '--modifier_token', '<new1>', '--validation_prompt=<new1> person sitting in a bucket']' returned non-zero exit status 1.
System Info
diffusers
version: 0.26.1- Platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- Huggingface_hub version: 0.20.3
- Transformers version: 4.37.0
- Accelerate version: 0.25.0
- xFormers version: 0.0.24
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Who can help?
About this issue
- Original URL
- State: open
- Created 5 months ago
- Comments: 16 (6 by maintainers)
@rezkanas @shinnosukeono, sorry for the delayed response. I believe the error might be because of float16 training. Without it I am able to train with
--freeze_model="crossattn"
as well. I will see if I can update the code to support float16 with full crossattn fine-tuning as well. In the meantime disabling that should work. Thanks.Will try to reproduce.