diffusers: In training the script train_text_to_image_lora.py on Colab with a V100 GPU, the error ValueError: Attempting to unscale FP16 gradients occurred.
Describe the bug
12/07/2023 07:37:24 - INFO - main - ***** Running training ***** 12/07/2023 07:37:24 - INFO - main - Num examples = 833 12/07/2023 07:37:24 - INFO - main - Num Epochs = 72 12/07/2023 07:37:24 - INFO - main - Instantaneous batch size per device = 1 12/07/2023 07:37:24 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 12/07/2023 07:37:24 - INFO - main - Gradient Accumulation steps = 4 12/07/2023 07:37:24 - INFO - main - Total optimization steps = 15000 Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last): File “/content/diffusers/examples/text_to_image/train_text_to_image_lora.py”, line 960, in <module> main() File “/content/diffusers/examples/text_to_image/train_text_to_image_lora.py”, line 798, in main accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm) File “/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py”, line 2040, in clip_grad_norm_ self.unscale_gradients() File “/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py”, line 2003, in unscale_gradients self.scaler.unscale_(opt) File “/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py”, line 307, in unscale_ optimizer_state[“found_inf_per_device”] = self.unscale_grads( File “/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py”, line 229, in unscale_grads raise ValueError(“Attempting to unscale FP16 gradients.”) ValueError: Attempting to unscale FP16 gradients. Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last): File “/usr/local/bin/accelerate”, line 8, in <module> sys.exit(main()) File “/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py”, line 47, in main args.func(args) File “/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py”, line 1017, in launch_command simple_launcher(args) File “/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py”, line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command ‘[’/usr/bin/python3’, ‘train_text_to_image_lora.py’, ‘–pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5’, ‘–dataset_name=lambdalabs/pokemon-blip-captions’, ‘–dataloader_num_workers=8’, ‘–resolution=512’, ‘–center_crop’, ‘–random_flip’, ‘–train_batch_size=1’, ‘–gradient_accumulation_steps=4’, ‘–max_train_steps=15000’, ‘–learning_rate=1e-04’, ‘–max_grad_norm=1’, ‘–lr_scheduler=cosine’, ‘–lr_warmup_steps=0’, ‘–output_dir=/sddata/finetune/lora/pokemon’, ‘–push_to_hub’, ‘–hub_model_id=pokemon-lora’, ‘–report_to=wandb’, ‘–checkpointing_steps=500’, ‘–validation_prompt=A pokemon with blue eyes.’, ‘–seed=1337’]’ returned non-zero exit status 1.
Reproduction
!git clone https://github.com/huggingface/diffusers %cd diffusers !pip install . %cd examples/text_to_image !pip install -r requirements.txt !accelerate config default !pip install huggingface_hub wandb
from huggingface_hub import HfFolder, login
使用 Hugging Face 的 API 密钥登录
login(token=‘hf_tlt---------BRqMBjwdi’)
设置 WandB 的 API 密钥
import wandb wandb.login(key=‘b6a210-------------7f543c’)
运行训练脚本
!accelerate launch --mixed_precision=“fp16” train_text_to_image_lora.py
–pretrained_model_name_or_path=“runwayml/stable-diffusion-v1-5”
–dataset_name=“lambdalabs/pokemon-blip-captions”
–dataloader_num_workers=8
–resolution=512
–center_crop
–random_flip
–train_batch_size=1
–gradient_accumulation_steps=4
–max_train_steps=15000
–learning_rate=1e-04
–max_grad_norm=1
–lr_scheduler=“cosine”
–lr_warmup_steps=0
–output_dir=“/sddata/finetune/lora/pokemon”
–push_to_hub
–hub_model_id=“pokemon-lora”
–report_to=wandb
–checkpointing_steps=500
–validation_prompt=“A pokemon with blue eyes.”
–seed=1337
Logs
|Timestamp|Level|Message|
|---|---|---|
|Dec 7, 2023, 3:42:20 PM|INFO|Kernel started: 27fdce74-a69a-40c5-989e-8877ec3aa3d0, name: python3|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.2:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.12:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:04 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:04 PM|WARNING| /root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING| /root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING| /usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING| /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING| /usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING| /etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:03 PM|WARNING| /root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING| /root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING| /usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING| /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING| /usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING| /etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.975 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.899 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.881 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.877 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.872 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.861 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|
System Info
processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel® Xeon® CPU @ 2.20GHz stepping : 0 microcode : 0xffffffff cpu MHz : 2199.998 cache size : 56320 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed bogomips : 4399.99 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
Who can help?
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Reactions: 1
- Comments: 21 (8 by maintainers)
Commits related to this issue
- fix: unscale fp16 gradient problem & potential error (#6086) — committed to lvzii/diffusers by lvzii 6 months ago
- fix: unscale fp16 gradient problem & potential error (#6086) (#6231) Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> — committed to huggingface/diffusers by lvzii 6 months ago
- fix: unscale fp16 gradient problem & potential error (#6086) (#6231) Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> — committed to donhardman/diffusers by lvzii 6 months ago
That is a separate script and you should report a separate issue for that 😃
Please tag @linoytsaban there.
I ran the script on 4 * A800 GPU, PyTorch 2.1.1 and CUDA 12.1, and it produced the following error in DiffusionPipeline:
RuntimeError: Input type (c10::Half) and bias type (float) should be the same.
It seems the reason is consistent with what was pointed out in #4796, which modified SDXL pipelines so that
vae.dtype
andlatents.dtype
can match. It works for me to change Line 861 to StableDiffusionPipeline and modify pipeline_stable_diffusion.py Line 957-978 as StableDiffusionXLPipelineI’m not sure if this error can be addressed without modification in source code.
This fixed it.
Make sure to also uninstall
peft
, otherwise it raises"AttributeError: 'Linear' object has no attribute 'set_lora_layer'"
.Could not get past this issue.
I used the same colab for this. It is because the dataset library when using load_dataset() is not provided with caption file.
@haofanwang do let me know if this is not an error for you. how did you get passed this? Can you share you colab link?
This is a continued error till now.
Also the datasets creation, does not take data_files as any input and this is lacking the caption file if passed as metadata.jsonl.
Here in this code in "train_dreambooth_lora_sdxl.py "
I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55fb36acc862254cf935826d54349b0fcd8c).