diffusers: [BUG] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training.
In train_text_to_image_lora.py
, I notice that the LORA parameters are extracted into an AttnProcsLayers
class:
518 lora_layers = AttnProcsLayers(unet.attn_processors)
And it is only the lora_layers
that is wrapped by DistributedDataParallel in the following code:
670 lora_layers, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
lora_layers, optimizer, train_dataloader, lr_scheduler
)
In the training process, it seems that the lora_layers
are not explicitly used but only the unet
is used:
776 model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
My question is that when using Multi-GPUs or Multi-Machines, will the gradients be successfully averaged across all processes in the above way?
It is true that in each process, the gradients will be backward to unet.attn_processors
, and these gradients will be shared by lora_layers
, so we can use optimizer
to update the weights. However, since we actually use unet.attn_processors
to do the forward operation, but not the wrapped lora_layers
, can the gradients be correctly averaged? From here, it seems that a wrapped module will have a different forward compared to its original forward operation.
I am not quite familiar with torch.nn.parallel.DistributedDataParallel
wrapper, and I do worry about whether the current code in train_text_to_image_lora.py
will lead to different LORA weights in different processes (if the gradients failed to broadcast among processes).
Hope to find some help here, thank you.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 31 (17 by maintainers)
I got it!
This should be fixed by passing multiple models with
accelerator.accumulate
, yes @hkunzheI have a solution that is not elegant but works: wrapping around it with another network.
Only use this module with accelerate and in the training loop instead of the original code call this module.
This originates as a solution to training text encoder and unet simultaneously for dreambooth. You could look here: https://github.com/huggingface/accelerate/issues/668#issuecomment-1614043548
All of them should be put into
prepare
, specifically all the ones that expect to have their gradients updated. Those same ones should then also be passed toaccumulate
.You can send both into
prepare
at the same time.Hi yes, @WindVChen this is correct. The issue is that accelerate prepare works by wrapping the passed class in ddp and then we’re supposed to call the returned ddp class. Similarly accelerate mixed precision works by monkey patching the forward method of the passed in class.
Any script where we use the AttnProcsLayers class will not properly work with accelerate because that class just holds the given parameters but it isn’t actually used as a part of the model.
I fixed this for the dreambooth lora script here: https://github.com/huggingface/diffusers/pull/3778
We should really remove the AttnProcsLayers class and always pass the top level model to accelerate.prepare. I’m going to open an issue better documenting this but unfortunately I can’t get to it right away as these cross training script refactors are relatively involved