ollama: Delays and slowness when using mixtral

It seems as the context grows, the delay until the first output is getting longer and longer, taking more than half a minute after a few prompts. Also, text generation seems much slower than with the latest llama.cpp (commandline).

Using CUDA on a RTX 3090. Tried out mixtral:8x7b-instruct-v0.1-q4_K_M (with CPU offloading) as well as mixtral:8x7b-instruct-v0.1-q2_K (completely in VRAM).

As a comparison, I tried starling-lm:7b-alpha-q4_K_M, which seems not to exhibit any of these problems.

Sorry for the unprecise report, running out of time right now. Does anyone have a similar experience with Mixtral? Or is this expected behaviour with ollama? (First-time user here.)

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Reactions: 3
  • Comments: 29

Most upvoted comments

@coder543 I understand that running the 5 bit model will be slow on a 4090 compared to running the 3 bit. My comment was specifically in response to this point that @confuze made: "So, the only option I have is running this model on a cpu? ". I’ve found that running this model using llama.cpp (with ooba), and partially offloading to gpu seems to work fine compared to Ollama, where it doesn’t work without very long (and progressively worse) prompt eval times. Using Ollama, after 4 prompts, I’m waiting about 1 minute before I start to get a response. The response timing for me is not slow - about 10 tps.

My understanding of this thread was that Ollama seems to have progressively longer prompt eval times - even for models that fit entirely in VRAM. If this is because of a conscious decision that Ollama team have made, then it makes running Mixtral using Ollama unfeasible.

It seems that perhaps we are discussing separate issues in the same thread which is leading to confusion.

Building ollama with https://github.com/ggerganov/llama.cpp/pull/4538 and (optionally, if you do CPU+GPU inference) https://github.com/ggerganov/llama.cpp/pull/4553 has made prompt eval significantly faster for me. (~60t/s vs. ~10t/s)

@djmaze that is strange, since I’m not encountering any unusual problems on my 3090.

total duration:       18.273218027s
prompt eval count:    1180 token(s)
prompt eval duration: 15.833678s
prompt eval rate:     74.52 tokens/s
eval count:           114 token(s)
eval duration:        2.391734s
eval rate:            47.66 tokens/s

Here, there are nearly 1200 tokens in the context window of previous chat messages, and yet it is able to generate a response in less than 20 seconds. Yes, this is slower than it could be, but that seems to relate to what I mentioned in my previous comment about it not keeping the eval state between generations.

This is not the terrible performance that other people are describing, where it is taking 50 seconds with less than 900 tokens in the context window.

EDIT: testing mistral (instead of mixtral), I am seeing this after a similar situation:

total duration:       2.244759039s
prompt eval count:    1211 token(s)
prompt eval duration: 421.415ms
prompt eval rate:     2873.65 tokens/s
eval count:           208 token(s)
eval duration:        1.774238s
eval rate:            117.23 tokens/s

The key differentiator is that the prompt eval rate is obviously way higher. As someone else linked to a PR which improved prompt eval rate on the CPU, it isn’t crazy to assume that the prompt eval rate on the GPU needs some improvements as well. You say llama.cpp is much faster at this, but I haven’t actually observed any real difference. Doing more testing now.

EDIT 2: yes, using llama.cpp server, it appears to be doing exactly what I mentioned: keeping the eval state in memory. It is processing prompt tokens at the same rate as ollama, it is just processing fewer of them because it does not appear to be re-evaluating the entire context window with each new prompt. The other ollama models suffer the same problems, they just seem to have a much higher prompt eval rate than mixtral, which helps to mask it.

The default mixtral Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the q3_K_S model can be completely offloaded to the GPU, which speeds things up dramatically:

Make a Modelfile:

FROM mixtral:8x7b-instruct-v0.1-q3_K_S
PARAMETER num_gpu 33

Then run ollama create mixtral_gpu -f ./Modelfile

Then you can run ollama run mixtral_gpu and see how it does.