diffusers: CUDA out of memory and invalid value encountered in cast with train_text_to_image_lora_sdxl.py
Describe the bug
I encountered two distinct issues while attempting to run the lambdalabs/pokemon-blip-captions
example of train_text_to_image_lora_sdxl.py on an RTX 4090, utilizing bf16.
Problem 1: RuntimeWarning and Image Processing:
RuntimeWarning: invalid value encountered in cast
images = (images * 255).round().astype("uint8")
Problem 2: CUDA Out of Memory Error:
Despite the GPU memory usage during training consistently remaining at 67%, I also encounter a CUDA out-of-memory issue after the training concludes:
The error message is as follows:
hidden_states = hidden_states.to(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.64 GiB total capacity; 20.89 GiB already allocated; 497.75 MiB free; 22.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Hypothesis:
I suspect that memory might not be fully released before the test inference step. Could it be?
I intend to investigate this matter further on my own, and I’ll provide updates here. If anyone else encounters a solution before I do, kindly share it here as well.
Reproduction
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
accelerate launch train_text_to_image_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--caption_column="text" \
--resolution=1024 \
--random_flip \
--train_batch_size=1 \
--num_train_epochs=2 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=500 \
--learning_rate=1e-04 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--dataloader_num_workers=0 \
--seed=42 \
--output_dir="sd-pokemon-model-lora-sdxl-txt" \
--train_text_encoder \
--validation_prompt="cute dragon creature" \
--report_to="wandb" \
--mixed_precision="bf16" \
--rank=4
Logs
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
accelerate launch train_text_to_image_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--caption_column="text" \
--resolution=1024 \
--random_flip \
--train_batch_size=1 \
--num_train_epochs=2 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=500 \
--learning_rate=1e-04 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--dataloader_num_workers=0 \
--seed=42 \
--output_dir="sd-pokemon-model-lora-sdxl-txt" \
--train_text_encoder \
--validation_prompt="cute dragon creature" \
--report_to="wandb" \
--mixed_precision="bf16" \
--rank=4
08/23/2023 11:18:30 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: bf16
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'dynamic_thresholding_ratio', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
{'attention_type'} was not found in config. Values will be initialized to default values.
wandb: Currently logged in as: mnslarcher. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.8
wandb: Run data is saved locally in /home/mnslarcher/ai/hands/wandb/run-20230823_111845-ngknp8t5
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run bumbling-brook-7
wandb: ⭐️ View project at https://wandb.ai/mnslarcher/text2image-fine-tune
wandb: 🚀 View run at https://wandb.ai/mnslarcher/text2image-fine-tune/runs/ngknp8t5
08/23/2023 11:18:49 - INFO - __main__ - ***** Running training *****
08/23/2023 11:18:49 - INFO - __main__ - Num examples = 833
08/23/2023 11:18:49 - INFO - __main__ - Num Epochs = 2
08/23/2023 11:18:49 - INFO - __main__ - Instantaneous batch size per device = 1
08/23/2023 11:18:49 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
08/23/2023 11:18:49 - INFO - __main__ - Gradient Accumulation steps = 1
08/23/2023 11:18:49 - INFO - __main__ - Total optimization steps = 1666
Steps: 30%|████████████████████████████████████████▌ | 500/1666 [08:05<19:20, 1.00it/s, lr=0.0001, step_loss=0.0274]08/23/2023 11:26:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-500
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/pytorch_lora_weights.safetensors
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/optimizer.bin
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/scheduler.bin
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/random_states_0.pkl
08/23/2023 11:26:55 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-500
Steps: 50%|████████████████████████████████████████████████████████████████████ | 833/1666 [13:28<13:20, 1.04it/s, lr=0.0001, step_loss=0.134]08/23/2023 11:32:18 - INFO - __main__ - Running validation...
Generating 4 images with prompt: cute dragon creature.
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0. | 0/7 [00:00<?, ?it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0. | 3/7 [00:00<00:00, 27.36it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 62.86it/s]
/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/image_processor.py:65: RuntimeWarning: invalid value encountered in cast
images = (images * 255).round().astype("uint8")
Steps: 60%|████████████████████████████████████████████████████████████████████████████████▍ | 1000/1666 [17:06<10:46, 1.03it/s, lr=0.0001, step_loss=0.0518]08/23/2023 11:35:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1000
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/pytorch_lora_weights.safetensors
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/optimizer.bin
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/scheduler.bin
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/random_states_0.pkl
08/23/2023 11:35:56 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1000
Steps: 90%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 1500/1666 [25:13<02:39, 1.04it/s, lr=0.0001, step_loss=0.0561]08/23/2023 11:44:02 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1500
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/pytorch_lora_weights.safetensors
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/optimizer.bin
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/scheduler.bin
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/random_states_0.pkl
08/23/2023 11:44:03 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1500
Steps: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1666/1666 [27:55<00:00, 1.04it/s, lr=0.0001, step_loss=0.0427]08/23/2023 11:46:45 - INFO - __main__ - Running validation...
Generating 4 images with prompt: cute dragon creature.
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0. | 0/7 [00:00<?, ?it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0. | 3/7 [00:00<00:00, 27.61it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 63.49it/s]
/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/image_processor.py:65: RuntimeWarning: invalid value encountered in cast
images = (images * 255).round().astype("uint8")
Model weights saved in sd-pokemon-model-lora-sdxl-txt/pytorch_lora_weights.safetensors
Loaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0. | 0/7 [00:00<?, ?it/s]
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0. | 1/7 [00:00<00:05, 1.13it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
{'attention_type'} was not found in config. Values will be initialized to default values.
Loaded unet as UNet2DConditionModel from `unet` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.██████▎ | 4/7 [00:03<00:02, 1.05it/s]
Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.71it/s]
Loading unet.ine components...: 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 6/7 [00:04<00:00, 1.66it/s]
Loading text_encoder.
Loading text_encoder_2.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00, 4.31it/s]
Traceback (most recent call last):█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00, 4.31it/s]
File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1505, in <module>
main(args)
File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1458, in main
images = [
File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1459, in <listcomp>
pipeline(
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py", line 845, in __call__
image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 270, in decode
decoded = self._decode(z).sample
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 257, in _decode
dec = self.decoder(z)
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/vae.py", line 271, in forward
sample = up_block(sample, latent_embeds)
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 2334, in forward
hidden_states = upsampler(hidden_states)
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/resnet.py", line 164, in forward
hidden_states = hidden_states.to(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.64 GiB total capacity; 20.89 GiB already allocated; 497.75 MiB free; 22.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: | 0.042 MB of 0.042 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: train_loss ▂▆▂▁▄▃▅▄▂▃▁▂▁▁▄▂▁▄▁▃▅▁▂▆▁▁▅▄▃▁▄▆▄█▅▁▇▂▅▁
wandb:
wandb: Run summary:
wandb: train_loss 0.04268
wandb:
wandb: 🚀 View run bumbling-brook-7 at: https://wandb.ai/mnslarcher/text2image-fine-tune/runs/ngknp8t5
wandb: Synced 6 W&B file(s), 2 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230823_111845-ngknp8t5/logs
Traceback (most recent call last):
File "/home/mnslarcher/anaconda3/envs/hands/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
simple_launcher(args)
File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/mnslarcher/anaconda3/envs/hands/bin/python', 'train_text_to_image_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--dataset_name=lambdalabs/pokemon-blip-captions', '--caption_column=text', '--resolution=1024', '--random_flip', '--train_batch_size=1', '--num_train_epochs=2', '--gradient_accumulation_steps=1', '--checkpointing_steps=500', '--learning_rate=1e-04', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--dataloader_num_workers=0', '--seed=42', '--output_dir=sd-pokemon-model-lora-sdxl-txt', '--train_text_encoder', '--validation_prompt=cute dragon creature', '--report_to=wandb', '--mixed_precision=bf16', '--rank=4']' returned non-zero exit status 1.
System Info
OS Name: Ubuntu 22.04.3 LTS GPU: NVIDIA GeForce RTX 4090
diffusers-cli env:
diffusers
version: 0.21.0.dev0- Platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Huggingface_hub version: 0.16.4
- Transformers version: 4.31.0
- Accelerate version: 0.21.0
- xFormers version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: NO
enviroment.yml (conda):
name: myenv
channels:
- defaults
dependencies:
- nb_conda_kernels
- ipykernel
- jupyter
- pip
- python=3.10
- pip:
- accelerate==0.21.0
- "black[jupyter]==23.7.0"
- datasets==2.14.4
- git+https://github.com/huggingface/diffusers
- ftfy==6.1.1
- gradio==3.40.1
- isort==5.12.0
- Jinja2==3.1.2
- tensorboard==2.14.0
- torch==2.0.1
- torchvision==0.15.2
- transformers==4.31.0
- wandb==0.15.8
Who can help?
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 20 (18 by maintainers)
Thanks for sharing. Ccing @muellerzr for https://github.com/huggingface/diffusers/issues/4736#issuecomment-1690786925.