diffusers: In training the script train_text_to_image_lora.py on Colab with a V100 GPU, the error ValueError: Attempting to unscale FP16 gradients occurred.

Describe the bug

12/07/2023 07:37:24 - INFO - main - ***** Running training ***** 12/07/2023 07:37:24 - INFO - main - Num examples = 833 12/07/2023 07:37:24 - INFO - main - Num Epochs = 72 12/07/2023 07:37:24 - INFO - main - Instantaneous batch size per device = 1 12/07/2023 07:37:24 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 12/07/2023 07:37:24 - INFO - main - Gradient Accumulation steps = 4 12/07/2023 07:37:24 - INFO - main - Total optimization steps = 15000 Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last): File “/content/diffusers/examples/text_to_image/train_text_to_image_lora.py”, line 960, in <module> main() File “/content/diffusers/examples/text_to_image/train_text_to_image_lora.py”, line 798, in main accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm) File “/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py”, line 2040, in clip_grad_norm_ self.unscale_gradients() File “/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py”, line 2003, in unscale_gradients self.scaler.unscale_(opt) File “/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py”, line 307, in unscale_ optimizer_state[“found_inf_per_device”] = self.unscale_grads( File “/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py”, line 229, in unscale_grads raise ValueError(“Attempting to unscale FP16 gradients.”) ValueError: Attempting to unscale FP16 gradients. Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last): File “/usr/local/bin/accelerate”, line 8, in <module> sys.exit(main()) File “/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py”, line 47, in main args.func(args) File “/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py”, line 1017, in launch_command simple_launcher(args) File “/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py”, line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command ‘[’/usr/bin/python3’, ‘train_text_to_image_lora.py’, ‘–pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5’, ‘–dataset_name=lambdalabs/pokemon-blip-captions’, ‘–dataloader_num_workers=8’, ‘–resolution=512’, ‘–center_crop’, ‘–random_flip’, ‘–train_batch_size=1’, ‘–gradient_accumulation_steps=4’, ‘–max_train_steps=15000’, ‘–learning_rate=1e-04’, ‘–max_grad_norm=1’, ‘–lr_scheduler=cosine’, ‘–lr_warmup_steps=0’, ‘–output_dir=/sddata/finetune/lora/pokemon’, ‘–push_to_hub’, ‘–hub_model_id=pokemon-lora’, ‘–report_to=wandb’, ‘–checkpointing_steps=500’, ‘–validation_prompt=A pokemon with blue eyes.’, ‘–seed=1337’]’ returned non-zero exit status 1.

Reproduction

!git clone https://github.com/huggingface/diffusers %cd diffusers !pip install . %cd examples/text_to_image !pip install -r requirements.txt !accelerate config default !pip install huggingface_hub wandb

from huggingface_hub import HfFolder, login

使用 Hugging Face 的 API 密钥登录

login(token=‘hf_tlt---------BRqMBjwdi’)

设置 WandB 的 API 密钥

import wandb wandb.login(key=‘b6a210-------------7f543c’)

运行训练脚本

!accelerate launch --mixed_precision=“fp16” train_text_to_image_lora.py
–pretrained_model_name_or_path=“runwayml/stable-diffusion-v1-5”
–dataset_name=“lambdalabs/pokemon-blip-captions”
–dataloader_num_workers=8
–resolution=512
–center_crop
–random_flip
–train_batch_size=1
–gradient_accumulation_steps=4
–max_train_steps=15000
–learning_rate=1e-04
–max_grad_norm=1
–lr_scheduler=“cosine”
–lr_warmup_steps=0
–output_dir=“/sddata/finetune/lora/pokemon”
–push_to_hub
–hub_model_id=“pokemon-lora”
–report_to=wandb
–checkpointing_steps=500
–validation_prompt=“A pokemon with blue eyes.”
–seed=1337

Logs

|Timestamp|Level|Message|
|---|---|---|
|Dec 7, 2023, 3:42:20 PM|INFO|Kernel started: 27fdce74-a69a-40c5-989e-8877ec3aa3d0, name: python3|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.2:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.12:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:04 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|    	/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|    	/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.975 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.899 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.881 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.877 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.872 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.861 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|

System Info

processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel® Xeon® CPU @ 2.20GHz stepping : 0 microcode : 0xffffffff cpu MHz : 2199.998 cache size : 56320 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed bogomips : 4399.99 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:

Who can help?

@sayakpaul @patrickvonplaten

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Reactions: 1
  • Comments: 21 (8 by maintainers)

Commits related to this issue

Most upvoted comments

That is a separate script and you should report a separate issue for that 😃

Please tag @linoytsaban there.

#6119 should fix it.

I ran the script on 4 * A800 GPU, PyTorch 2.1.1 and CUDA 12.1, and it produced the following error in DiffusionPipeline:

RuntimeError: Input type (c10::Half) and bias type (float) should be the same.

It seems the reason is consistent with what was pointed out in #4796, which modified SDXL pipelines so that vae.dtype and latents.dtype can match. It works for me to change Line 861 to StableDiffusionPipeline and modify pipeline_stable_diffusion.py Line 957-978 as StableDiffusionXLPipeline

        if not output_type == "latent":
            # image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
            #     0
            # ]
            # ==== script from StableDiffusionXLPipeline  ====
            # make sure the VAE is in float32 mode, as it overflows in float16
            needs_upcasting = self.vae.dtype == torch.float16 and self.vae.config.force_upcast

            if needs_upcasting:
                # self.upcast_vae()
                latents = latents.to(next(iter(self.vae.post_quant_conv.parameters())).dtype)

            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]

            # cast back to fp16 if needed
            # if needs_upcasting:
            #     self.vae.to(dtype=torch.float16)
            # ==== script from StableDiffusionXLPipeline  ====

            image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
        else:
            image = latents
            has_nsfw_concept = None

I’m not sure if this error can be addressed without modification in source code.

I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55f).

This fixed it.

Make sure to also uninstall peft, otherwise it raises "AttributeError: 'Linear' object has no attribute 'set_lora_layer'".

Could not get past this issue. Screenshot 2023-12-12 at 3 31 12 PM I used the same colab for this. It is because the dataset library when using load_dataset() is not provided with caption file.

@haofanwang do let me know if this is not an error for you. how did you get passed this? Can you share you colab link?

This is a continued error till now.

Also the datasets creation, does not take data_files as any input and this is lacking the caption file if passed as metadata.jsonl.

Here in this code in "train_dreambooth_lora_sdxl.py "

dataset = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            cache_dir=args.cache_dir,
        )
        
        It shows the error: 
        
        ValueError: `--caption_column` value 'text' not found in dataset columns. Dataset columns are: image.
        
        
        @sayakpaul  can you check this as I tried you latest fix branch code too, it fails here giving the above error and that is because datasets library is unable to get the metadata.jsonl file.
        
        Correct me if I am taking it wrong.
        
        Tried branch: fix/lora-training
        @sayakpaul @bekkblando 

I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55fb36acc862254cf935826d54349b0fcd8c).