diffusers: In training the script train_text_to_image_lora.py on Colab with a V100 GPU, the error ValueError: Attempting to unscale FP16 gradients occurred.

Describe the bug

12/07/2023 07:37:24 - INFO - main - ***** Running training ***** 12/07/2023 07:37:24 - INFO - main - Num examples = 833 12/07/2023 07:37:24 - INFO - main - Num Epochs = 72 12/07/2023 07:37:24 - INFO - main - Instantaneous batch size per device = 1 12/07/2023 07:37:24 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 12/07/2023 07:37:24 - INFO - main - Gradient Accumulation steps = 4 12/07/2023 07:37:24 - INFO - main - Total optimization steps = 15000 Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last): File “/content/diffusers/examples/text_to_image/train_text_to_image_lora.py”, line 960, in <module> main() File “/content/diffusers/examples/text_to_image/train_text_to_image_lora.py”, line 798, in main accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm) File “/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py”, line 2040, in clip_grad_norm_ self.unscale_gradients() File “/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py”, line 2003, in unscale_gradients self.scaler.unscale_(opt) File “/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py”, line 307, in unscale_ optimizer_state[“found_inf_per_device”] = self.unscale_grads( File “/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py”, line 229, in unscale_grads raise ValueError(“Attempting to unscale FP16 gradients.”) ValueError: Attempting to unscale FP16 gradients. Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last): File “/usr/local/bin/accelerate”, line 8, in <module> sys.exit(main()) File “/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py”, line 47, in main args.func(args) File “/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py”, line 1017, in launch_command simple_launcher(args) File “/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py”, line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command ‘[’/usr/bin/python3’, ‘train_text_to_image_lora.py’, ‘–pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5’, ‘–dataset_name=lambdalabs/pokemon-blip-captions’, ‘–dataloader_num_workers=8’, ‘–resolution=512’, ‘–center_crop’, ‘–random_flip’, ‘–train_batch_size=1’, ‘–gradient_accumulation_steps=4’, ‘–max_train_steps=15000’, ‘–learning_rate=1e-04’, ‘–max_grad_norm=1’, ‘–lr_scheduler=cosine’, ‘–lr_warmup_steps=0’, ‘–output_dir=/sddata/finetune/lora/pokemon’, ‘–push_to_hub’, ‘–hub_model_id=pokemon-lora’, ‘–report_to=wandb’, ‘–checkpointing_steps=500’, ‘–validation_prompt=A pokemon with blue eyes.’, ‘–seed=1337’]’ returned non-zero exit status 1.

Reproduction

!git clone https://github.com/huggingface/diffusers %cd diffusers !pip install . %cd examples/text_to_image !pip install -r requirements.txt !accelerate config default !pip install huggingface_hub wandb

from huggingface_hub import HfFolder, login

使用 Hugging Face 的 API 密钥登录

设置 WandB 的 API 密钥

import wandb wandb.login(key=‘b6a210-------------7f543c’)

运行训练脚本

!accelerate launch --mixed_precision=“fp16” train_text_to_image_lora.py
–pretrained_model_name_or_path=“runwayml/stable-diffusion-v1-5”
–dataset_name=“lambdalabs/pokemon-blip-captions”
–dataloader_num_workers=8
–resolution=512
–center_crop
–random_flip
–train_batch_size=1
–gradient_accumulation_steps=4
–max_train_steps=15000
–learning_rate=1e-04
–max_grad_norm=1
–lr_scheduler=“cosine”
–lr_warmup_steps=0
–output_dir=“/sddata/finetune/lora/pokemon”
–push_to_hub
–hub_model_id=“pokemon-lora”
–report_to=wandb
–checkpointing_steps=500
–validation_prompt=“A pokemon with blue eyes.”
–seed=1337

Logs

|Timestamp|Level|Message|
|---|---|---|
|Dec 7, 2023, 3:42:20 PM|INFO|Kernel started: 27fdce74-a69a-40c5-989e-8877ec3aa3d0, name: python3|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.2:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.12:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:04 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.975 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.899 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.881 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.877 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.872 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.861 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|

System Info

processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel® Xeon® CPU @ 2.20GHz stepping : 0 microcode : 0xffffffff cpu MHz : 2199.998 cache size : 56320 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed bogomips : 4399.99 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:

Who can help?

@sayakpaul @patrickvonplaten

About this issue

Original URL
State: closed
Created 7 months ago
Reactions: 1
Comments: 21 (8 by maintainers)

Commits related to this issue

fix: unscale fp16 gradient problem & potential error (#6086) — committed to lvzii/diffusers by lvzii 6 months ago
fix: unscale fp16 gradient problem & potential error (#6086) (#6231) Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> — committed to huggingface/diffusers by lvzii 6 months ago
fix: unscale fp16 gradient problem & potential error (#6086) (#6231) Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> — committed to donhardman/diffusers by lvzii 6 months ago

Most upvoted comments

That is a separate script and you should report a separate issue for that 😃

Please tag @linoytsaban there.

sayakpaul on Jan 6, 2024

#6119 should fix it.

I ran the script on 4 * A800 GPU, PyTorch 2.1.1 and CUDA 12.1, and it produced the following error in DiffusionPipeline:

RuntimeError: Input type (c10::Half) and bias type (float) should be the same.

It seems the reason is consistent with what was pointed out in #4796, which modified SDXL pipelines so that vae.dtype and latents.dtype can match. It works for me to change Line 861 to StableDiffusionPipeline and modify pipeline_stable_diffusion.py Line 957-978 as StableDiffusionXLPipeline

        if not output_type == "latent":
            # image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
            #     0
            # ]
            # ==== script from StableDiffusionXLPipeline  ====
            # make sure the VAE is in float32 mode, as it overflows in float16
            needs_upcasting = self.vae.dtype == torch.float16 and self.vae.config.force_upcast

            if needs_upcasting:
                # self.upcast_vae()
                latents = latents.to(next(iter(self.vae.post_quant_conv.parameters())).dtype)

            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]

            # cast back to fp16 if needed
            # if needs_upcasting:
            #     self.vae.to(dtype=torch.float16)
            # ==== script from StableDiffusionXLPipeline  ====

            image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
        else:
            image = latents
            has_nsfw_concept = None

I’m not sure if this error can be addressed without modification in source code.

ancient-blue on Dec 13, 2023

I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55f).

This fixed it.

Make sure to also uninstall peft, otherwise it raises "AttributeError: 'Linear' object has no attribute 'set_lora_layer'".

yaneq on Dec 12, 2023

Could not get past this issue. Screenshot 2023-12-12 at 3 31 12 PM I used the same colab for this. It is because the dataset library when using load_dataset() is not provided with caption file.

@haofanwang do let me know if this is not an error for you. how did you get passed this? Can you share you colab link?

yashveer08 on Dec 12, 2023

This is a continued error till now.

Also the datasets creation, does not take data_files as any input and this is lacking the caption file if passed as metadata.jsonl.

Here in this code in "train_dreambooth_lora_sdxl.py "

dataset = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            cache_dir=args.cache_dir,
        )
        
        It shows the error: 
        
        ValueError: `--caption_column` value 'text' not found in dataset columns. Dataset columns are: image.
        
        
        @sayakpaul  can you check this as I tried you latest fix branch code too, it fails here giving the above error and that is because datasets library is unable to get the metadata.jsonl file.
        
        Correct me if I am taking it wrong.
        
        Tried branch: fix/lora-training
        @sayakpaul @bekkblando

yashveer08 on Dec 12, 2023

I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55fb36acc862254cf935826d54349b0fcd8c).

bekkblando on Dec 7, 2023