diffusers: Kandinsky 3.0 "CUDA Out of memory" error

Describe the bug

Kandinsky 3.0 fails with “Out of memory” error when the pipeline starts to work.

When I try other models, like SDXL, there are no problems with it and code lines like “pipe.to(‘cuda’)” work without problems, but when I try Kandinsky3 there are.

GPU: 1x T4 GPU (Google colab)

Reproduction

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
        
prompt = "Any prompt"

generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, num_inference_steps=25, generator=generator).images[0] # < Here is the error. 

image.save('1.png')

Logs

Loading pipeline components...: 100%
5/5 [00:02<00:00, 1.75it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%
5/5 [00:01<00:00, 3.13it/s]
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
<ipython-input-1-7c6f4c265399> in <cell line: 10>()
      8 
      9 generator = torch.Generator(device="cpu").manual_seed(0)
---> 10 image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]

18 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in convert(t)
   1156                 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1157                             non_blocking, memory_format=convert_to_format)
-> 1158             return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
   1159 
   1160         return self._apply(convert)

OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 14.75 GiB of which 24.81 MiB is free. Process 79636 has 14.72 GiB memory in use. Of the allocated memory 14.62 GiB is allocated by PyTorch, and 1.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Info

diffusers version: 0.25.0.dev0
Platform: Linux-5.15.120±x86_64-with-glibc2.35 (Google Colab)
Python version: 3.10.12
PyTorch version (GPU?): 2.1.0+cu121 (True)
Huggingface_hub version: 0.19.4
Transformers version: 4.35.2
Accelerate version: 0.25.0
xFormers version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@yiyixuxu @patrickvonplaten

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 15 (5 by maintainers)

Most upvoted comments

Hi @SunSual. You can use pipe.enable_sequential_cpu_offload()

@standardAI , Oh, it finally works when I use this method. Thanks for the help!

Atm4x on Dec 13, 2023

Well that interesting, it seems having bitsandbytes installed was the reason it was so slow for me. If I uninstall it and restart the runtime I get a similar speed to your colab.

Vargol on Dec 12, 2023

I tested it on a 3090 with enable_model_cpu_offload() and it uses 21 GB VRAM for inference and it goes up to 28 GB VRAM when decoding the latents so without pipe.enable_sequential_cpu_offload() you’ll need at least a V100 GPU.

asomoza on Dec 12, 2023