fast-stable-diffusion: Error when running a training session.
Trying to resume network training based on v2.1 768px. Trying to resume network training based on v2.1 768px. Almost immediately I get an error.
Resuming Training...
Training the UNet...
'########:'########:::::'###::::'####:'##::: ##:'####:'##::: ##::'######:::
... ##..:: ##.... ##:::'## ##:::. ##:: ###:: ##:. ##:: ###:: ##:'##... ##::
::: ##:::: ##:::: ##::'##:. ##::: ##:: ####: ##:: ##:: ####: ##: ##:::..:::
::: ##:::: ########::'##:::. ##:: ##:: ## ## ##:: ##:: ## ## ##: ##::'####:
::: ##:::: ##.. ##::: #########:: ##:: ##. ####:: ##:: ##. ####: ##::: ##::
::: ##:::: ##::. ##:: ##.... ##:: ##:: ##:. ###:: ##:: ##:. ###: ##::: ##::
::: ##:::: ##:::. ##: ##:::: ##:'####: ##::. ##:'####: ##::. ##:. ######:::
:::..:::::..:::::..::..:::::..::....::..::::..::....::..::::..:::......::::
2023-02-19 10:19:49.512117: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-19 10:19:53.707580: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-02-19 10:19:53.708294: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-02-19 10:19:53.708348: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
0% 0/3000 [00:00<?, ?it/s] JrCr JrCr Traceback (most recent call last):
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 789, in <module>
main()
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 676, in main
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 507, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_condition.py", line 339, in forward
sample, res_samples = downsample_block(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 637, in forward
hidden_states = torch.utils.checkpoint.checkpoint(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 630, in custom_forward
return module(*inputs, return_dict=return_dict)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py", line 213, in forward
hidden_states = self.proj_in(hidden_states)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 4D
0% 0/3000 [00:12<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
upd: There are no errors during training on v1.5.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 17 (9 by maintainers)
Training
Training the UNet… Traceback (most recent call last): File “/content/diffusers/examples/dreambooth/train_dreambooth.py”, line 789, in <module> main() File “/content/diffusers/examples/dreambooth/train_dreambooth.py”, line 436, in main accelerator = Accelerator( File “/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py”, line 286, in init raise ValueError(err.format(mode=“fp16”, requirement=“a GPU”)) ValueError: fp16 mixed precision requires a GPU Traceback (most recent call last): File “/usr/local/bin/accelerate”, line 8, in <module> sys.exit(main()) File “/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py”, line 43, in main args.func(args) File “/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py”, line 837, in launch_command simple_launcher(args) File “/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py”, line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command ‘[’/usr/bin/python3’, ‘/content/diffusers/examples/dreambooth/train_dreambooth.py’, ‘–image_captions_filename’, ‘–train_only_unet’, ‘–save_starting_step=500’, ‘–save_n_steps=0’, ‘–Session_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/PuliDADA02241330’, ‘–pretrained_model_name_or_path=/content/stable-diffusion-v1-5’, ‘–instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/PuliDADA02241330/instance_images’, ‘–output_dir=/content/models/PuliDADA02241330’, ‘–captions_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/PuliDADA02241330/captions’, ‘–instance_prompt=’, ‘–seed=959221’, ‘–resolution=512’, ‘–mixed_precision=fp16’, ‘–train_batch_size=1’, ‘–gradient_accumulation_steps=1’, ‘–use_8bit_adam’, ‘–learning_rate=5e-06’, ‘–lr_scheduler=linear’, ‘–lr_warmup_steps=0’, ‘–max_train_steps=1500’]’ returned non-zero exit status 1. Something went wrong