llama.cpp: Multi GPU with Vulkan out of memory issue.

Running llama.cpp #5832 (9731134296af3a6839cd682e51d9c2109a871de5)

I’m trying to load a model on two GPUs with Vulkan.

My GPUs have 20 and 11 gigs of VRAM

Loading a Q6_K quant of size 26.27 GiB (6.56 BPW) with -ts "20,11" -c 512 yields:

ggml ctx size =    0.62 MiB
offloading 60 repeating layers to GPU
offloading non-repeating layers to GPU
offloaded 61/61 layers to GPU
   Vulkan0 buffer size = 17458.44 MiB
   Vulkan1 buffer size =  9088.14 MiB
       CPU buffer size =   358.90 MiB

Vulkan0 KV buffer size =    80.00 MiB
Vulkan1 KV buffer size =    40.00 MiB

KV self size  =  120.00 MiB, K (f16):   60.00 MiB, V (f16):   60.00 MiB
Vulkan_Host input buffer size   =    16.01 MiB
   Vulkan0 compute buffer size =   113.00 MiB
   Vulkan1 compute buffer size =   139.00 MiB
Vulkan_Host compute buffer size =    14.00 MiB

ggml_vulkan: Device memory allocation of size 120422400 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory

The math doesn’t seem to add up.

A Q5_K_M quant at 22.65 GiB (5.66 BPW) works perfectly fine until I increase the context to 4096.

This can’t possibly be context, right? When using HIP on smaller models, I have to push it much harder to OOM, I should be fine with 31GB of VRAM. Any idea why this happens?

About this issue

Original URL
State: open
Created 4 months ago
Reactions: 1
Comments: 18

Most upvoted comments

I cannot reopen the issue. https://stackoverflow.com/questions/21333654/how-to-re-open-an-issue-in-github

Setting GGML_VK_FORCE_MAX_ALLOCATION_SIZE to 268435456, doesn’t appear to have done much but idk I can’t really tell.

I consider the issue unsolved since I cannot run the model even if a 20/11 split would technically fit. But maybe I misunderstand how this works and there’s a base level of overhead or something.

Sorry about that, I didn’t know that depende on who closed the issue.

I think it’s a case of memory fragmentation and it would work if you ran it without running a GUI on the GPUs. But depending on your setup that might be difficult to try.

0cc4m on Mar 31, 2024

I think it’s the 7900 XT that’s running out of memory in q6_k. I added info which device is allocating to the debug output, can you run q5 and q6 again? No need to let q5 run through, just the prompt processing is enough. We can then figure out how much memory it tried to allocate before running out.

0cc4m on Mar 20, 2024

@lastrosade Please check if #6155 fixes your problem.

0cc4m on Mar 19, 2024

Apologies for not getting back to you sooner, I was too busy last week. Your logs show that the size of the the dequant buffer is the problem here. Because I didn’t have proper matmul dequant shaders for the k-quants yet (and also didn’t update the buffer size logic yet) they use quite a bit of vram. Too much for your setup, with q5_k and q6_k.

Good news is that I have now implemented the k-quant matmul shaders and will update the buffer size logic to take this into account. That should save you a few hundred megabytes of VRAM and hopefully solve this issue. I’ll let you know when you can test this.

0cc4m on Mar 19, 2024

Thank you for taking time to help me.

Here’s the output from a run with LLAMA_VULKAN_DEBUG=1 dbg.txt And one with both LLAMA_VULKAN_DEBUG=1 and LLAMA_VULKAN_VALIDATE=1 dbg.txt The model is a 8x7b of size 23.6 GB.

~~It fails to allocate 1.3 gigs of VRAM when I can easily use cublas on my 1080ti to fill it to about 11.8 gigs or HIP on my 7900XT up to 20 gigs without issues.~~ Or is that additive? idk.

As for the custom error message, I have no idea how I would do that.

lastrosade on Mar 7, 2024