diffusers: [BUG] train_text_to_image_lora.py not support Multi-nodes or Multi-gpus training.

In train_text_to_image_lora.py, I notice that the LORA parameters are extracted into an AttnProcsLayers class:

518    lora_layers = AttnProcsLayers(unet.attn_processors)

And it is only the lora_layers that is wrapped by DistributedDataParallel in the following code:

670    lora_layers, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
            lora_layers, optimizer, train_dataloader, lr_scheduler
          )

In the training process, it seems that the lora_layers are not explicitly used but only the unet is used:

776    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample

My question is that when using Multi-GPUs or Multi-Machines, will the gradients be successfully averaged across all processes in the above way?

It is true that in each process, the gradients will be backward to unet.attn_processors, and these gradients will be shared by lora_layers, so we can use optimizer to update the weights. However, since we actually use unet.attn_processors to do the forward operation, but not the wrapped lora_layers, can the gradients be correctly averaged? From here, it seems that a wrapped module will have a different forward compared to its original forward operation.

I am not quite familiar with torch.nn.parallel.DistributedDataParallel wrapper, and I do worry about whether the current code in train_text_to_image_lora.py will lead to different LORA weights in different processes (if the gradients failed to broadcast among processes).

Hope to find some help here, thank you.

About this issue

Original URL
State: closed
Created a year ago
Comments: 31 (17 by maintainers)

Most upvoted comments

This should be fixed by passing multiple models with accelerator.accumulate, yes @hkunzhe

I got it!

hkunzhe on Aug 21, 2023

This should be fixed by passing multiple models with accelerator.accumulate, yes @hkunzhe

muellerzr on Aug 21, 2023

I have a solution that is not elegant but works: wrapping around it with another network.

class SuperNet(torch.nn.ModuleDict):
    def forward(self, text_encoder, unet, batch, class_labels, noisy_model_input, timesteps):
        # Get the text embedding for conditioning
        encoder_hidden_states = encode_prompt(
            text_encoder,
            batch["input_ids"],
            None
        )
        # Predict the noise residual
        return unet(
            noisy_model_input, timesteps, encoder_hidden_states, class_labels=class_labels
        ).sample

Only use this module with accelerate and in the training loop instead of the original code call this module.

This originates as a solution to training text encoder and unet simultaneously for dreambooth. You could look here: https://github.com/huggingface/accelerate/issues/668#issuecomment-1614043548

eliphatfs on Jul 13, 2023

All of them should be put into prepare, specifically all the ones that expect to have their gradients updated. Those same ones should then also be passed to accumulate.

You can send both into prepare at the same time.

muellerzr on Mar 8, 2024

Hi yes, @WindVChen this is correct. The issue is that accelerate prepare works by wrapping the passed class in ddp and then we’re supposed to call the returned ddp class. Similarly accelerate mixed precision works by monkey patching the forward method of the passed in class.

Any script where we use the AttnProcsLayers class will not properly work with accelerate because that class just holds the given parameters but it isn’t actually used as a part of the model.

I fixed this for the dreambooth lora script here: https://github.com/huggingface/diffusers/pull/3778

We should really remove the AttnProcsLayers class and always pass the top level model to accelerate.prepare. I’m going to open an issue better documenting this but unfortunately I can’t get to it right away as these cross training script refactors are relatively involved

williamberman on Aug 21, 2023