ollama: Delays and slowness when using mixtral
It seems as the context grows, the delay until the first output is getting longer and longer, taking more than half a minute after a few prompts. Also, text generation seems much slower than with the latest llama.cpp (commandline).
Using CUDA on a RTX 3090. Tried out mixtral:8x7b-instruct-v0.1-q4_K_M
(with CPU offloading) as well as mixtral:8x7b-instruct-v0.1-q2_K
(completely in VRAM).
As a comparison, I tried starling-lm:7b-alpha-q4_K_M
, which seems not to exhibit any of these problems.
Sorry for the unprecise report, running out of time right now. Does anyone have a similar experience with Mixtral? Or is this expected behaviour with ollama? (First-time user here.)
About this issue
- Original URL
- State: open
- Created 7 months ago
- Reactions: 3
- Comments: 29
@coder543 I understand that running the 5 bit model will be slow on a 4090 compared to running the 3 bit. My comment was specifically in response to this point that @confuze made: "So, the only option I have is running this model on a cpu? ". I’ve found that running this model using llama.cpp (with ooba), and partially offloading to gpu seems to work fine compared to Ollama, where it doesn’t work without very long (and progressively worse) prompt eval times. Using Ollama, after 4 prompts, I’m waiting about 1 minute before I start to get a response. The response timing for me is not slow - about 10 tps.
My understanding of this thread was that Ollama seems to have progressively longer prompt eval times - even for models that fit entirely in VRAM. If this is because of a conscious decision that Ollama team have made, then it makes running Mixtral using Ollama unfeasible.
It seems that perhaps we are discussing separate issues in the same thread which is leading to confusion.
Building ollama with https://github.com/ggerganov/llama.cpp/pull/4538 and (optionally, if you do CPU+GPU inference) https://github.com/ggerganov/llama.cpp/pull/4553 has made prompt eval significantly faster for me. (~60t/s vs. ~10t/s)
@djmaze that is strange, since I’m not encountering any unusual problems on my 3090.
Here, there are nearly 1200 tokens in the context window of previous chat messages, and yet it is able to generate a response in less than 20 seconds. Yes, this is slower than it could be, but that seems to relate to what I mentioned in my previous comment about it not keeping the eval state between generations.
This is not the terrible performance that other people are describing, where it is taking 50 seconds with less than 900 tokens in the context window.
EDIT: testing
mistral
(instead ofmixtral
), I am seeing this after a similar situation:The key differentiator is that the
prompt eval rate
is obviously way higher. As someone else linked to a PR which improved prompt eval rate on the CPU, it isn’t crazy to assume that the prompt eval rate on the GPU needs some improvements as well. You say llama.cpp is much faster at this, but I haven’t actually observed any real difference. Doing more testing now.EDIT 2: yes, using llama.cpp server, it appears to be doing exactly what I mentioned: keeping the eval state in memory. It is processing prompt tokens at the same rate as
ollama
, it is just processing fewer of them because it does not appear to be re-evaluating the entire context window with each new prompt. The otherollama
models suffer the same problems, they just seem to have a much higherprompt eval rate
thanmixtral
, which helps to mask it.The default
mixtral
Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that theq3_K_S
model can be completely offloaded to the GPU, which speeds things up dramatically:Make a
Modelfile
:Then run
ollama create mixtral_gpu -f ./Modelfile
Then you can run
ollama run mixtral_gpu
and see how it does.