gpt4all: Fallback to CPU with OOM even though GPU *should* have more than enough
System Info
version: 1.0.12 platform: windows python: 3.11.4 graphics card: nvidia rtx 4090 24gb
Information
- The official example notebooks/scripts
- My own modified scripts
Reproduction
run the following code
from gpt4all import GPT4All
model = GPT4All("wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0", device='gpu') # device='amd', device='intel'
output = model.generate("Write a Tetris game in python scripts", max_tokens=4096); print(output)
Expected behavior
Found model file at C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
llama.cpp: loading model from C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 6983.73 MB
Error allocating memory ErrorOutOfDeviceMemory
error loading model: Error allocating vulkan memory.
llama_load_model_from_file: failed to load model
LLAMA ERROR: failed to load model from C:\\\\Users\\\\earne\\\\.cache\\\\gpt4all\\wizardlm-13b-v1.1-superhot-8k.ggmlv3.q4_0.bin
LLaMA ERROR: prompt won't work with an unloaded model!
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 24 (5 by maintainers)
Commits related to this issue
- Fix VRAM leak when model loading fails (#1901) Signed-off-by: Jared Van Bortel <jared@nomic.ai> — committed to nomic-ai/gpt4all by cebtenzzre 5 months ago
Just so you’re aware, GPT4All uses a completely different GPU backend than the other LLM apps you’re familiar with - it’s an original implementation based on Vulkan. It’s still in its early stages (because bugs like this need to be fixed before it can be considered mature), but the main benefit is that it’s easy to support NVIDIA, AMD, and Intel all with the same code.
exllama2 is great if you have two 4090s - GPT4All in its current state probably isn’t for you, as it definitely doesn’t take full advantage of your hardware. But many of our users do not have access to such impressive GPUs (myself included) and benefit from features that llama.cpp makes it relatively easy to support, such as partial GPU offload - which we haven’t implemented yet, but plan to.
I believe what manyoso is saying is that our Vulkan backend currently requires a contiguous chunk of memory to be available, as it allocates one big chunk instead of smaller chunks like other machine learning frameworks do. This means it would probably work fine if you didn’t have other things using small chunks in the middle of your VRAM. We still intend to fix this issue 😃