diffusers: run train_dreambooth_lora.py failed with accelerate

Describe the bug

Thanks for this awesome project! When I run the script “train_dreambooth_lora.py” without acceleration, it works fine. But when I use acceleration launch, it fails when the number of steps reaches “checkpointing_steps”. I am running the script in a Docker with 4 * 3090 vGPUs. And I ran accelerate test, it’s successed. I am new to this and would appreciate any guidance or suggestions you can offer.

Reproduction

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="/diffusers/examples/dreambooth/dunhuang512"
export OUTPUT_DIR="path-to-save-model"
cd /diffusers/examples/dreambooth/
accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --logging_dir='./logs' \
  --instance_prompt="dhstyle_test" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=100 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="dhstyle_test" \
  --validation_epochs=50 \
  --seed="0"\
  --enable_xformers_memory_efficient_attention \
  --use_8bit_adam

Logs


  File "/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1093, in <module>
    main(args)
  File "/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 972, in main
    LoraLoaderMixin.save_lora_weights(
  File "/diffusers/src/diffusers/loaders.py", line 1111, in save_lora_weights
    for module_name, param in unet_lora_layers.state_dict().items()
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1820, in state_dict
    hook_result = hook(self, destination, prefix, local_metadata)
  File "/diffusers/src/diffusers/loaders.py", line 74, in map_to
    num = int(key.split(".")[1])  # 0 is always "layers"
ValueError: invalid literal for int() with base 10: 'layers'
Steps:  20%|████████████████████▊                                                                                   | 100/500 [03:35<14:20,  2.15s/it, loss=0.217, lr=0.0001]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63642 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63643 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 63644 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 63641) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 914, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dreambooth_lora.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-29_00:59:00
  host      : sd-5b564dfd58-7v76h
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 63641)
  error_file: <N/A>

System Info

  • diffusers version: 0.17.0.dev0

  • Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31

  • Python version: 3.10.9

  • PyTorch version (GPU?): 2.0.0+cu117 (True)

  • Huggingface_hub version: 0.14.0

  • Transformers version: 4.25.1

  • Accelerate version: 0.18.0

  • xFormers version: 0.0.19

  • Using GPU in script?: <fill in>

  • Using distributed or parallel set-up in script?: <fill in>

  • Accelerate default config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: no - use_cpu: False - num_processes: 4 - machine_rank: 0 - num_machines: 1 - gpu_ids: all - rdzv_backend: static - same_network: True - main_training_function: main - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 23 (8 by maintainers)

Most upvoted comments

i’m not sure why error occurs with accelerate, but it can be fixed by modfiy following lines of here

class AttnProcsLayers(torch.nn.Module):
    def __init__(self, state_dict: Dict[str, torch.Tensor]):
        super().__init__()
        self.layers = torch.nn.ModuleList(state_dict.values())
        self.mapping = dict(enumerate(state_dict.keys()))
        self.rev_mapping = {v: k for k, v in enumerate(state_dict.keys())}

        # we add a hook to state_dict() and load_state_dict() so that the
        # naming fits with `unet.attn_processors`
        def map_to(module, state_dict, *args, **kwargs):
            new_state_dict = {}
            for key, value in state_dict.items():
                layer_index = 2 if 'module' in key else 1 ## you should add this line
                num = int(key.split(".")[layer_index])  # 0 is always "layers"
                new_key = key.replace(f"layers.{num}", module.mapping[num])
                new_state_dict[new_key] = value

            return new_state_dict

this is because key of state_dict is like module.layers.0.to_q_lora.down.weight not layers.0.to_q_lora.down.weight. so layers, the second elements of key could not be int. i guess latest version did not check this with accelerate

this function is called by here when we run the code with accelerate

                    if accelerator.is_main_process:
                        save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
                        # We combine the text encoder and UNet LoRA parameters with a simple
                        # custom logic. `accelerator.save_state()` won't know that. So,
                        # use `LoraLoaderMixin.save_lora_weights()`.
                        LoraLoaderMixin.save_lora_weights(
                            save_directory=save_path,
                            unet_lora_layers=unet_lora_layers,
                            text_encoder_lora_layers=text_encoder_lora_layers,
                        )

should be

        def map_to(module, state_dict, *args, **kwargs):
            new_state_dict = {}
            for key, value in state_dict.items():
                # num = int(key.split(".")[layer_index])  # 0 is always "layers"
                # new_key = key.replace(f"layers.{num}", module.mapping[num])
                if 'module' in key:
                    num = int(key.split(".")[2]) 
                    replace_key = f"module.layers.{num}"
                else: 
                    num = int(key.split(".")[1]) 
                    replace_key = f"layers.{num}"
                new_key = key.replace(replace_key, module.mapping[num])
                new_state_dict[new_key] = value

so you can load pytorch_lora_weights.bin correctly

Unfortunetly, I don’t know what’s behind it. I’m not, and have not used deepspeed though. I’ve also tried a large mix of accerate and diffusers versions, currently on latest source builds for both. I’m using the latest dreambooth-lora script - it fails as soon as it tries to save a checkpoint.

I can add that the previous fix above didn’t actually work for me this time, and I had to expand it to this:

def map_to(module, state_dict, *args, **kwargs):
    new_state_dict = {}
    for key, value in state_dict.items():
        print("key:" + key)
        key_parts = key.split(".")
        if key_parts[0] == '_orig_mod':
            num = int(key_parts[2])
            replace_key = f"{key_parts[0]}.{key_parts[1]}.{num}"
        elif 'module' in key:
            num = int(key_parts[2])
            replace_key = f"{key_parts[0]}.{key_parts[1]}.{num}"
        else:
            num = int(key_parts[1])
            replace_key = f"{key_parts[0]}.{num}"
        print("replace_key:" + replace_key)
        new_key = key.replace(replace_key, module.mapping[num])
        print("new_key:" + new_key)
        new_state_dict[new_key] = value

So it’s related to these keys mappings, but why I see it and someone else doesn’t, I have no clue. As mentioned, if I don’t run with accelerate, it doesn’t happen.