llama.cpp: CUDA out of memory - but there's plenty of memory

TLDR: When offloading all layers to GPU, RAM usage is the same as if no layers were offloaded. In situations where VRAM is sufficient to load the model but RAM is not, a CUDA Out of Memory error ocurrs even though there is plenty of VRAM still available.

System specs OS: Windows + conda CPU: 13900K RAM: 32GB DDR5 GPU: 2x RTX 3090 (48GB total VRAM)

When trying to load a 65B ggml 4bit model, regardless of how many layers I offload to GPU, system RAM is filled and I get a CUDA out of memory error.

I’ve tried with all 80 layers offloaded to GPUs, and with no layers offloaded to the GPUs at all, and the RAM usage doesn’t change in either scenario. There is still about 12GB total VRAM free when the out of memory error is thrown.

Screenshot of RAM / VRAM usage with all layers offloaded to GPUs: https://i.imgur.com/vTl04qL.png

Interestingly the system RAM usage hits a ceiling while loading the model but the error isn’t thrown until the end of the loading sequence. If I had to make a guess on what’s happening I would say llama.cpp isn’t doing garbage collection on the buffer contents. When CUDA goes to use some system memory it can’t see any as available and so crashes.

E:\llama.cpp release 254a7a7>main -t 8 -n -1 -ngl 80 --color -c 2048 --temp 0.7 --repeat_penalty 1.2 --mirostat 2 --interactive-first  -m ../models/ggml-LLaMa-65B-quantized/ggml-LLaMa-65B-q4_0.bin -i -ins
main: build = 670 (254a7a7)
main: seed  = 1686799791
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
  Device 1: NVIDIA GeForce RTX 3090
llama.cpp: loading model from ../models/ggml-LLaMa-65B-quantized/ggml-LLaMa-65B-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.18 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llama_model_load_internal: mem required  = 10814.46 MB (+ 5120.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 64 layers to GPU
llama_model_load_internal: total VRAM used: 28308 MB
....................................................................................................
llama_init_from_file: kv self size  = 5120.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.200000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.700000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

CUDA error 2 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:2342: out of memory

Bonus: Without -ngl set, loading succeeds and I actually get a few tokens worth of inference before CUDA error 2 at D:\AI\llama.cpp\ggml-cuda.cu:994: out of memory is thrown. The model needs ~38GB of RAM and I only have 32GB so I assume it’s using swapfile, but with no layers offloaded it’s odd that an error still comes from CUDA.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 23 (1 by maintainers)

Most upvoted comments

Agreed, it seems counter-intuitive why would you need RAM if the layers are going to be in VRAM. Why buffer entire model in RAM before passing it to GPU in the first place?

@LoganDark For something like this please make a separate issue rather than commenting on an existing, unrelated issue.

Saw a new build come through a09f919 - issue persists. If I up my RAM to 64GB it runs fine like you say. But surely when I have 48GB VRAM and the model needs 38GB memory I shouldn’t be using any RAM should I?