diffusers: Kandinsky 3.0 "CUDA Out of memory" error
Describe the bug
Kandinsky 3.0 fails with “Out of memory” error when the pipeline starts to work.
When I try other models, like SDXL, there are no problems with it and code lines like “pipe.to(‘cuda’)” work without problems, but when I try Kandinsky3 there are.
GPU: 1x T4 GPU (Google colab)
Reproduction
from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
prompt = "Any prompt"
generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, num_inference_steps=25, generator=generator).images[0] # < Here is the error.
image.save('1.png')
Logs
Loading pipeline components...: 100%
5/5 [00:02<00:00, 1.75it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%
5/5 [00:01<00:00, 3.13it/s]
---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
<ipython-input-1-7c6f4c265399> in <cell line: 10>()
8
9 generator = torch.Generator(device="cpu").manual_seed(0)
---> 10 image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]
18 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in convert(t)
1156 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
1157 non_blocking, memory_format=convert_to_format)
-> 1158 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
1159
1160 return self._apply(convert)
OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 14.75 GiB of which 24.81 MiB is free. Process 79636 has 14.72 GiB memory in use. Of the allocated memory 14.62 GiB is allocated by PyTorch, and 1.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
System Info
diffusers
version: 0.25.0.dev0- Platform: Linux-5.15.120±x86_64-with-glibc2.35 (Google Colab)
- Python version: 3.10.12
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- Huggingface_hub version: 0.19.4
- Transformers version: 4.35.2
- Accelerate version: 0.25.0
- xFormers version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help?
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 15 (5 by maintainers)
@standardAI , Oh, it finally works when I use this method. Thanks for the help!
Well that interesting, it seems having bitsandbytes installed was the reason it was so slow for me. If I uninstall it and restart the runtime I get a similar speed to your colab.
I tested it on a 3090 with
enable_model_cpu_offload()
and it uses 21 GB VRAM for inference and it goes up to 28 GB VRAM when decoding the latents so withoutpipe.enable_sequential_cpu_offload()
you’ll need at least a V100 GPU.