fast-stable-diffusion: Error when running a training session.

Trying to resume network training based on v2.1 768px. Trying to resume network training based on v2.1 768px. Almost immediately I get an error.

Resuming Training...
Training the UNet...
'########:'########:::::'###::::'####:'##::: ##:'####:'##::: ##::'######:::
... ##..:: ##.... ##:::'## ##:::. ##:: ###:: ##:. ##:: ###:: ##:'##... ##::
::: ##:::: ##:::: ##::'##:. ##::: ##:: ####: ##:: ##:: ####: ##: ##:::..:::
::: ##:::: ########::'##:::. ##:: ##:: ## ## ##:: ##:: ## ## ##: ##::'####:
::: ##:::: ##.. ##::: #########:: ##:: ##. ####:: ##:: ##. ####: ##::: ##::
::: ##:::: ##::. ##:: ##.... ##:: ##:: ##:. ###:: ##:: ##:. ###: ##::: ##::
::: ##:::: ##:::. ##: ##:::: ##:'####: ##::. ##:'####: ##::. ##:. ######:::
:::..:::::..:::::..::..:::::..::....::..::::..::....::..::::..:::......::::

2023-02-19 10:19:49.512117: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-19 10:19:53.707580: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-02-19 10:19:53.708294: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-02-19 10:19:53.708348: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
  0% 0/3000 [00:00<?, ?it/s] JrCr   JrCr  Traceback (most recent call last):
  File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 789, in <module>
    main()
  File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 676, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 507, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_condition.py", line 339, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 637, in forward
    hidden_states = torch.utils.checkpoint.checkpoint(
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 630, in custom_forward
    return module(*inputs, return_dict=return_dict)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py", line 213, in forward
    hidden_states = self.proj_in(hidden_states)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 4D
  0% 0/3000 [00:12<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

upd: There are no errors during training on v1.5.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

Training

Training the UNet… Traceback (most recent call last): File “/content/diffusers/examples/dreambooth/train_dreambooth.py”, line 789, in <module> main() File “/content/diffusers/examples/dreambooth/train_dreambooth.py”, line 436, in main accelerator = Accelerator( File “/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py”, line 286, in init raise ValueError(err.format(mode=“fp16”, requirement=“a GPU”)) ValueError: fp16 mixed precision requires a GPU Traceback (most recent call last): File “/usr/local/bin/accelerate”, line 8, in <module> sys.exit(main()) File “/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py”, line 43, in main args.func(args) File “/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py”, line 837, in launch_command simple_launcher(args) File “/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py”, line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command ‘[’/usr/bin/python3’, ‘/content/diffusers/examples/dreambooth/train_dreambooth.py’, ‘–image_captions_filename’, ‘–train_only_unet’, ‘–save_starting_step=500’, ‘–save_n_steps=0’, ‘–Session_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/PuliDADA02241330’, ‘–pretrained_model_name_or_path=/content/stable-diffusion-v1-5’, ‘–instance_data_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/PuliDADA02241330/instance_images’, ‘–output_dir=/content/models/PuliDADA02241330’, ‘–captions_dir=/content/gdrive/MyDrive/Fast-Dreambooth/Sessions/PuliDADA02241330/captions’, ‘–instance_prompt=’, ‘–seed=959221’, ‘–resolution=512’, ‘–mixed_precision=fp16’, ‘–train_batch_size=1’, ‘–gradient_accumulation_steps=1’, ‘–use_8bit_adam’, ‘–learning_rate=5e-06’, ‘–lr_scheduler=linear’, ‘–lr_warmup_steps=0’, ‘–max_train_steps=1500’]’ returned non-zero exit status 1. Something went wrong