gpt4all: Fallback to CPU with OOM even though GPU *should* have more than enough

System Info

version: 1.0.12 platform: windows python: 3.11.4 graphics card: nvidia rtx 4090 24gb

Information

  • The official example notebooks/scripts
  • My own modified scripts

Reproduction

run the following code

from gpt4all import GPT4All
model = GPT4All("wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0", device='gpu') # device='amd', device='intel'
output = model.generate("Write a Tetris game in python scripts", max_tokens=4096); print(output)

Expected behavior

Found model file at  C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
llama.cpp: loading model from C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.73 MB
Error allocating memory ErrorOutOfDeviceMemory
error loading model: Error allocating vulkan memory.
llama_load_model_from_file: failed to load model
LLAMA ERROR: failed to load model from C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
LLaMA ERROR: prompt won't work with an unloaded model!

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Comments: 24 (5 by maintainers)

Commits related to this issue

Most upvoted comments

I can’t load a Q4_0 into VRAM on either of my 4090s, each with 24gb.

Just so you’re aware, GPT4All uses a completely different GPU backend than the other LLM apps you’re familiar with - it’s an original implementation based on Vulkan. It’s still in its early stages (because bugs like this need to be fixed before it can be considered mature), but the main benefit is that it’s easy to support NVIDIA, AMD, and Intel all with the same code.

exllama2 is great if you have two 4090s - GPT4All in its current state probably isn’t for you, as it definitely doesn’t take full advantage of your hardware. But many of our users do not have access to such impressive GPUs (myself included) and benefit from features that llama.cpp makes it relatively easy to support, such as partial GPU offload - which we haven’t implemented yet, but plan to.

I believe what manyoso is saying is that our Vulkan backend currently requires a contiguous chunk of memory to be available, as it allocates one big chunk instead of smaller chunks like other machine learning frameworks do. This means it would probably work fine if you didn’t have other things using small chunks in the middle of your VRAM. We still intend to fix this issue 😃