ollama: Out of memory error on model that previously worked fine after update to version 0.1.13
I configured a model to run entirely in VRAM using the following Modelfile:
FROM deepseek-coder:33b-instruct-q5_K_S
PARAMETER num_gpu 65
PARAMETER num_ctx 2048
I had no issues with running this, it would use about 22GB of my 4090’s 24GB VRAM without issue. It would generate responses very quickly which was very helpful for getting quick answers to short coding queries.
However, yesterday I updated Ollama (to 0.1.13), and now I cannot run the same model. I get an out of memory error, despite the model not needing more than 22.5GB (according to the logs below).
I run Ollama on a headless linux server, so there are no other applications using the GPU.
Was there an update that changes how much VRAM Ollama allocates to make it need more than before? Is there a way to configure Ollama so that it behaves the same way as before?
EDIT: Reverting back to ollama version 0.1.11 resolves the issue for now.
Error:
Dec 04 16:28:20 osm-server ollama[528776]: llm_load_tensors: offloaded 65/65 layers to GPU
Dec 04 16:28:20 osm-server ollama[528776]: llm_load_tensors: VRAM used: 21741.89 MiB
Dec 04 16:28:23 osm-server ollama[528776]: ....................................................................................................
Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: n_ctx = 2048
Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: freq_base = 100000.0
Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: freq_scale = 0.25
Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: offloading v cache to GPU
Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: offloading k cache to GPU
Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: VRAM kv self = 496.00 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: kv self size = 496.00 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_build_graph: non-view tensors processed: 1430/1430
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: compute buffer total size = 273.07 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)
Dec 04 16:28:24 osm-server ollama[600735]: {"timestamp":1701707304,"level":"INFO","function":"main","line":2917,"message":"HTTP server listening","hostname":"127.0.0.1","port":57264}
Dec 04 16:28:24 osm-server ollama[600735]: {"timestamp":1701707304,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":46990,"status":200,"method":"HEAD","path":"/","params":{}}
Dec 04 16:28:24 osm-server ollama[528776]: 2023/12/04 16:28:24 llama.go:493: llama runner started in 4.401485 seconds
Dec 04 16:28:24 osm-server ollama[528776]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5884: out of memory
Dec 04 16:28:24 osm-server ollama[528776]: current device: 0
Dec 04 16:28:25 osm-server ollama[528776]: 2023/12/04 16:28:25 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5884: out of memory
Dec 04 16:28:25 osm-server ollama[528776]: current device: 0
Dec 04 16:28:25 osm-server ollama[528776]: 2023/12/04 16:28:25 llama.go:510: llama runner stopped successfully
Dec 04 16:28:25 osm-server ollama[528776]: [GIN] 2023/12/04 - 16:28:25 | 200 | 6.468638351s | 127.0.0.1 | POST "/api/generate"
About this issue
- Original URL
- State: open
- Created 7 months ago
- Reactions: 1
- Comments: 21
@madsamjp, tried it unsuccessfully with the next version up, v0.1.16, v0.1.15 cannot possibly work.
On Fri, Dec 15, 2023 at 4:17 PM Igor Schlumberger @.***> wrote:
Thanks for sharing this. We are looking into it. There is a release coming soon which is 0.1.14, but I don’t think that will be in there. Will let you know what we find. This is a bit strange.
What OS are you running? How did you install it?