llama.cpp: Multi GPU with Vulkan out of memory issue.
Running llama.cpp #5832 (9731134296af3a6839cd682e51d9c2109a871de5)
I’m trying to load a model on two GPUs with Vulkan.
My GPUs have 20 and 11 gigs of VRAM
Loading a Q6_K quant of size 26.27 GiB (6.56 BPW)
with -ts "20,11" -c 512
yields:
ggml ctx size = 0.62 MiB
offloading 60 repeating layers to GPU
offloading non-repeating layers to GPU
offloaded 61/61 layers to GPU
Vulkan0 buffer size = 17458.44 MiB
Vulkan1 buffer size = 9088.14 MiB
CPU buffer size = 358.90 MiB
Vulkan0 KV buffer size = 80.00 MiB
Vulkan1 KV buffer size = 40.00 MiB
KV self size = 120.00 MiB, K (f16): 60.00 MiB, V (f16): 60.00 MiB
Vulkan_Host input buffer size = 16.01 MiB
Vulkan0 compute buffer size = 113.00 MiB
Vulkan1 compute buffer size = 139.00 MiB
Vulkan_Host compute buffer size = 14.00 MiB
ggml_vulkan: Device memory allocation of size 120422400 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
The math doesn’t seem to add up.
A Q5_K_M quant at 22.65 GiB (5.66 BPW)
works perfectly fine until I increase the context to 4096.
This can’t possibly be context, right? When using HIP on smaller models, I have to push it much harder to OOM, I should be fine with 31GB of VRAM. Any idea why this happens?
About this issue
- Original URL
- State: open
- Created 4 months ago
- Reactions: 1
- Comments: 18
Sorry about that, I didn’t know that depende on who closed the issue.
I think it’s a case of memory fragmentation and it would work if you ran it without running a GUI on the GPUs. But depending on your setup that might be difficult to try.
I think it’s the 7900 XT that’s running out of memory in q6_k. I added info which device is allocating to the debug output, can you run q5 and q6 again? No need to let q5 run through, just the prompt processing is enough. We can then figure out how much memory it tried to allocate before running out.
@lastrosade Please check if #6155 fixes your problem.
Apologies for not getting back to you sooner, I was too busy last week. Your logs show that the size of the the dequant buffer is the problem here. Because I didn’t have proper matmul dequant shaders for the k-quants yet (and also didn’t update the buffer size logic yet) they use quite a bit of vram. Too much for your setup, with q5_k and q6_k.
Good news is that I have now implemented the k-quant matmul shaders and will update the buffer size logic to take this into account. That should save you a few hundred megabytes of VRAM and hopefully solve this issue. I’ll let you know when you can test this.
Thank you for taking time to help me.
Here’s the output from a run with
LLAMA_VULKAN_DEBUG=1
dbg.txt And one with bothLLAMA_VULKAN_DEBUG=1
andLLAMA_VULKAN_VALIDATE=1
dbg.txt The model is a 8x7b of size 23.6 GB.It fails to allocate 1.3 gigs of VRAM when I can easily use cublas on my 1080ti to fill it to about 11.8 gigs or HIP on my 7900XT up to 20 gigs without issues.Or is that additive? idk.As for the custom error message, I have no idea how I would do that.