diffusers: get stuck when save_state using DeepSpeed backend under training train_text_to_image_lora
Describe the bug
When using DeepSpeed backend, training is ok but get stuck in accelerator.save_state(save_path)
. If use MULTI_GPU, the process is OK.
The training script is
accelerate launch train_text_to_image_lora.py \
--pretrained_model_name_or_path="pretrain_models/stable-diffusion-v1-4/" \
--dataset_name="lambdalabs/pokemon-blip-captions" \
--output_dir="sd-pokemon-model-lora" \
--resolution=512 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=100 \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_epochs=50 \
--seed="0" \
--checkpointing_steps 50 \
--train_batch_size=1 \
--use_8bit_adam \
--enable_xformers_memory_efficient_attention
Reproduction
MULTI_GPU backend xx/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 1,2,3
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false
logs
03/08/2023 21:57:44 - INFO - __main__ - ***** Running training *****
03/08/2023 21:57:44 - INFO - __main__ - Num examples = 833
03/08/2023 21:57:44 - INFO - __main__ - Num Epochs = 2
03/08/2023 21:57:44 - INFO - __main__ - Instantaneous batch size per device = 1
03/08/2023 21:57:44 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 21:57:44 - INFO - __main__ - Gradient Accumulation steps = 1
03/08/2023 21:57:44 - INFO - __main__ - Total optimization steps = 500
Steps: 10%|████████▎ | 50/500 [00:11<01:31, 4.94it/s, lr=0.0001, step_loss=0.00245]03/08/2023 21:57:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-50/pytorch_model.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-50/optimizer.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-50/scheduler.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-50/scaler.pt
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-50/random_states_0.pkl
03/08/2023 21:57:55 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-50
Steps: 20%|████████████████▌ | 100/500 [00:22<01:21, 4.92it/s, lr=0.0001, step_loss=0.0787]03/08/2023 21:58:06 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-100
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-100/pytorch_model.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-100/optimizer.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-100/scheduler.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-100/scaler.pt
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-100/random_states_0.pkl
03/08/2023 21:58:06 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-100
DeepSpeed backend xx/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false
Which I have commented self._checkpoint_tag_validation(tag)
in runtime/engine.py
or it stuck in this place.
If commented, the logs is
03/08/2023 22:06:10 - INFO - __main__ - ***** Running training *****
03/08/2023 22:06:10 - INFO - __main__ - Num examples = 833
03/08/2023 22:06:10 - INFO - __main__ - Num Epochs = 2
03/08/2023 22:06:10 - INFO - __main__ - Instantaneous batch size per device = 1
03/08/2023 22:06:10 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 22:06:10 - INFO - __main__ - Gradient Accumulation steps = 1
03/08/2023 22:06:10 - INFO - __main__ - Total optimization steps = 500
Steps: 10%|████████▎ | 50/500 [00:11<01:36, 4.68it/s, lr=0.0001, step_loss=0.00255]03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2023-03-08 22:06:22,219] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is begin to save!
/home/deepwisdom/anaconda3/envs/wjl/lib/python3.10/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2023-03-08 22:06:22,222] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt
[2023-03-08 22:06:22,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt...
[2023-03-08 22:06:22,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt.
...
get stuck in deepspeed/runtime/engine.py
# save_checkpoint
# https://github.com/microsoft/DeepSpeed/blob/v0.8.1/deepspeed/runtime/engine.py#LL3123C12-L3123C12
if self.save_zero_checkpoint:
self._create_zero_checkpoint_files(save_dir, tag)
self._save_zero_checkpoint(save_dir, tag)
Logs
No response
System Info
Ubuntu 20.04 Nvidia GTX 3090 CUDA Version: 11.7 Torch: 1.13.1 Diffusers: 0.15.0.dev0 deepspeed: 0.8.1 xformers: 0.0.17.dev466 accelerate: 0.16.0
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 4
- Comments: 15 (5 by maintainers)
@better629 Thank you for your solution. When I comment
accelerator.is_main_process
, I meet another error insave_model_hook
fuction. Do you have any ideas?Traceback (most recent call last): File "train_text_to_image_dit.py", line 857, in <module> main() File "train_text_to_image_dit.py", line 824, in main accelerator.save_state(save_path) File "/new_share/dengjincan/conda/envs/diffusers2/lib/python3.8/site-packages/accelerate/accelerator.py", line 2026, in save_state hook(self._models, weights, output_dir) File "train_text_to_image_dit.py", line 494, in save_model_hook weights.pop() IndexError: pop from empty list
@patrickvonplaten I have checked the logic of deepspeed to use
save_state
under it’s examples. And found thataccelerator.save_state
not need to under main process by usingif accelerator.is_main_process
.So I update the save code in
train_text_to_image_lora.py
to disable judge logicis_main_process
fromto
And then train and save successed using deepspeed zero-stage = 0 or 2.
It’s a little confused for that I have seen that
save the model only for the main process when using the distributed training mode
. And it’s a little different with deepspeed.I ran into the same
IndexError: pop from empty list
issue with the provided example scripttrain_dreambooth.py
under the./examples/dreambooth
directory in this repository, so all I ended up doing to fix the issue was to add a check to see ifweights
is NOT empty before popping it, and everything ran great after that. So basically intrain_dreambooth.py
thesave_model_hook
function went from being:to being:
Hope this may help anyone else running into the same issue 😃
@patrickvonplaten since DeepSpeed saving-training-checkpoints has mentioned that
all processes must call this method and not just the process with rank 0
. So, there is no need to useaccelerator.is_main_process
before saving state.@kzwang001 Have you solved it? I met the same problem as well.