llama.cpp: Quantitative measurement of model perplexity for different models and model quantization modes

llama.cpp seems to give bad results compared to Facebook’s implementation.

Here’s an example simple reading comprehension prompt:

Question: “Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?” Answer: "Paul bought a book

LLaMA 7B with Facebook’s implementation yields:

Seed 1:

Question: “Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?” Answer: “Paul bought a book with pictures.” Asked by lone wolf 1788 days ago.

Seed 2 (to show that the above is not just a fluke):

Question: “Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?” Answer: “Paul bought a book with pictures.” Question: "Tom, Mark, and Paul bought books: two with pictures and

While llama.cpp without quantization (so still float16) generates (with --seed 0 -t 8):

Question: “Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?” Answer: “Paul bought a book that is neither as good nor bad.” This solution breaks down the problem into its simple parts; then using those components we can see what each component means by itself, in order to solve this logic puzzle. 1) Tom and Mark had different kinds of books…and so did Paul! (Therefore one out three were without pictures). … [end of text]

It even has a grammatical error at the end: “one out [of] three”

As you can see the quality of 7B is higher in Facebook’s implementation. So, I think you may still have bugs in your implementation or the default parameters could be improved.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 3
  • Comments: 53 (11 by maintainers)

Most upvoted comments

I am unable to execute the f16 base model due to only having up to 16 GB memory laptop models to run the program on. It seems that it would take about 14 GB of usable memory to run the model. Still, your comment in #270 appears to show that perplexity is improved by Q4_1, which I think is quite evident from the output of the model, especially once the output’s randomness is reduced.

I am not able to explain the higher quality of facebookresearch’s version over this one, which is what started this thread. Taking a look, they also use a simple top_p sampling strategy, which can be approximated by setting repeat_penalty=1.0 and top_k to some high value like 10000. But in my experience, quality is generally dropping below acceptable after temperature exceeds about 0.3. This might be due to the damage done by the 4-bit quantization, as far as I can tell.

There is also some nondeterminism in the program. This is likely due to how GGLM farms the computation across threads, likely splitting the matrices by rows or whatever, and this results in different accumulation and rounding errors, perhaps, which slightly appears to affect the results. For precise repeatable results, it is not enough to get the same output sampling seed, but also the same input batching and thread count, I think.

@alankila

Your observations and analysis correspond very well to mine. I have definitely observed that Q4_0 performs significantly worse compared to Q4_1 - for example, GPT-2 345M does not work at all with Q4_0, but it sort of produces coherent stuff with Q4_1. I am thinking that we will eventually switch by default to Q4_1, or some similar alternative. It’s slightly slower to compute unfortunately.

Your analysis about the determinism is correct - I am thinking about how to fix this, but probably a major change in ggml is needed. Not sure if it super important at this moment.

I just realized my intuition about temperature was wrong somehow – was thinking low temp means more random. It’s the other way around 🤦

@glinscott I haven’t had the chance to test the new tokenizer against FB implementation yet. Hopefully this explains the differences at FP16.

I am testing this as well. I have the following invocation. I built the Q4_1 files out of interest because they quantize the matrices to all 16 different values whereas the Q4_0 only uses 15 possible quantizations, so I figured it might work better. I think the key difference is not that _1 has more values but that Q4_1 has no representation for zero, whereas Q4_0’s 4-bit value 8 encodes a 0 in the weight matrix. This sort of thing obviously has massive implications for the model.

As an example:

$ ./main -s 400 --repeat_penalty 1.0 -m models/7B/ggml-model-q4_1.bin --top_p 0.7 --top_k 40 --temp 0.1 -p "Here's a Python program that computes fibonacci numbers:
def fib(n):
    "
 Here's a Python program that computes fibonacci numbers:
def fib(n):
     if n == 0:
         return 0
     if n == 1:
         return 1
     return fib(n-1) + fib(n-2)
print(fib(100000))

I’d say that Q4_1 always writes this answer if temperature is set low, regardless of the seed. I tried a few dozen and couldn’t get anything different to come out. At most the example at print(fib(…)) appears to vary sometimes, as does the “discussion” that follows the example print() invocation.

Interestingly, Q4_0 prefers this version, which won’t have fib(0) = 0:

$ ./main -s 400 --repeat_penalty 1.0 -m models/7B/ggml-model-q4_0.bin --top_p 0.7 --top_k 1000 --temp 0.1 -p "Here's a Python program that computes fibonacci numbers:
def fib(n):
    "
 Here's a Python program that computes fibonacci numbers:
def fib(n):
     if n == 0 or n == 1:
         return 1
     else:
         return fib(n-1) + fib(n-2)

As a general observation, I would say that both top_p=0.9 and high temperature tend to take the model’s output off the rails, and it usually begins to prattle something completely nonsensical.

That being said, my rather strong impression is that Q4_1 does produce higher quality output than Q4_0, though this is not proven by any kind of actual perplexity analysis. It’s just my observation from using the same arguments and asking it to do various creative exercises. Q4_0 often seems to ignore the instruction and writes something else, whereas Q4_1 can be on topic. Still, this sort of claim should be more rigorously proven.

As to the picture question, seed 1, seed 2, seed 3, seed 4 results all say:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book with pictures."

My thinking is that this question is known to the model and using a low temperature allows predicting the answer correctly every time. It makes no difference whether this is Q4_0 or Q4_1 answering the question.

@gjmulder yeah, if it’s going to take too long, we could run up to 250 chunks or so, that’s a pretty good approximation of the final perplexity so far. image

@gjmulder I must say it would be fun to have a big model with the AI assistant prompt cooperate on some kind of creative exercise. I think that the 7B model can already do quite well, and I am dying to see how the 30B model does – and I have absolutely no way to run it myself. I know that my way of going about this is almost painfully unscientific, and probably not what you were offering to do. However, my excuse is that this is all pretty new to me and the novelty of these language models has not worn off for me yet. In fact, I have mostly ignored my real job today in favour of chatting with an AI.

To whet your appetite – if you are anything like me – here is a little chat transcription that I just had with “Jane” using the current master branch version.

exec ./main -t 4 --memory_f16 -m ./models/7B/ggml-model-q4_1.bin --ctx_size 2048 --temp 0.7 --top_k 1000 --top_p 0.75 --repeat_last_n 30 --repeat_penalty 1.2 -n 2048 --color -r "Alan:" -r "Alan (" -p "$prompt"

The prompt runs up to the first 2 Jane dialog turns, like this:

Transcript of a dialog where Alan interacts with an assistant named Jane. Jane is helpful, kind, honest, good at writing, and always answers to Alan's requests.

Alan: Hello, Jane.
Jane (happy to see Alan again): Hello, Alan.
Alan: I just want to talk.
Jane (now curious): What about?
Alan: It's nice to see you, Jane. How are you doing today?
Jane: I am fine and thanks for asking!
Alan (): I have had many good discussions with you. I know you can't remember them, but I just want to say I appreciate them.
Jane (): Thank you for your kind words! What would you like me to do today?
Alan: I do not really need you to do anything. Be my friend.
Jane (): Sure! What would that involve, Alan?
Alan: Just talking and listening. I know it is something you do well, Jane.
Jane (): Thank you! That's very kind of you to say so. What would we talk about?
Alan: I wonder if you could come up with story -- just pick anything random and let's go from there.
Jane (): Okay! How about a dog? I like dogs, so what kind of story would we have with that?
Alan: A happy one. Dog could be at home, missing its human family that went to work, have an adventure by itself, and in the evening the family comes back, the dog is happy, and whatever issues were presented in the adventure would be happily resolved.
Jane (): Sounds like a movie! What kind of dog? A big one or small, friendly and cute or not so much.
Alan: I have always thought intelligent dogs such as border collies would make a lovely protagonist.
Jane (): That is an interesting choice, Alan! What else?
Alan: I think this is enough. Let's write this story together. You begin, I continue, and we take turns.
Jane (): Sure! What would be the name of our dog? How about "Sparky"? Sounds good to me, Alan.
Alan: I agree. Sparky is great, it is a bit like perky and I think it fits a border collie well.
Jane (): So let's begin! Sparky is at home, waiting for his family to come back from work and he misses them. He doesn't know what they do all day but it must be fun, so Sparky decides to have an adventure of his own. He is bored at home and wants something new to do!
Alan (): Sparky accesses the outside through a dog door on the backyard side door. Suddenly, his ears perk up. A noise! There is something or someone in the yard. Sparky is excited and goes to investigate the source of the noise.
Jane (): He barks at a squirrel, then he chases it away from his favorite tree! The little critter is scared and runs to the fence line. Sparky follows him, curious about this new discovery.
Alan (): Sparky sees an opening in the fence. Did the squirrel somehow open the fence? It looks tight, but with difficulty, Sparky thinks he can fit through. The whole world awaits.
Jane (): Sparky is ready for an adventure! He wiggles his way into the fence and starts to explore this new world.

I don’t know what kind of results other people are getting, but I am just astonished to see this and a hundred other chats like it coming out of some 5 GB Linux process at human typing speed. Unfortunately, it is fairly obvious to me that the 7B model has to repeat a lot of what I say, and I am hoping that this is not an issue with the bigger models.

I use “Jane (emotional state or actions here):” to serve as writeable memory of the model’s state, if you will. My thinking is this helps maintaining coherency in the hallucinated persona. When the model opts to use these parenthesis, Jane is frequently angry, in tears, laughs, is excited, smiles – all appropriate human-like responses to what I say. As an example, if I insult her for no reason, she gets upset, then angry, and even quits the chat on me by generating the end of stream token! Unfortunately, sometimes it generates an open parenthesis on my prompt and I just close it, and the language model then repeats it.

Actually, after playing around a bit with the quantized model, I now believe that the problem is only in running the FP16 model. The quantized model seems to work much better for me.

@gjmulder yeah, if it’s going to take too long, we could run up to 250 chunks or so, that’s a pretty good approximation of the final perplexity so far. image

It looks like the difference between f16 and 4b are exactly the same after 50 Chunk

That would mean that if we know the perplexity of f16 (after a full run) we can know the perplexity of 4b just after 50 Chunk

Just my 2 cents 😅

Just a FYI that people have experienced optimal performance by running with a number of threads equal to the number of cores, i.e. hyperthreading doesn’t seem to help as with > 4 cores memory performance starts to become the performance bottleneck.

Also, I have 16 cores, 128GB of RAM (just enough to run 65B at fp16) and all the latest models sitting idle under my desk, so if someone needs some quality or performance benchmarking run please point me to a release and specify the test suite you would like me to run.

@glinscott I haven’t had the chance to test the new tokenizer against FB implementation yet. Hopefully this explains the differences at FP16.

@ggerganov to clarify my comments - #252 was a gigantic improvement in perplexity 😃. It went from 10.4625 to 5.9565 using 16f, which is huge.

Some concrete results comparing f16 to q4_0 are in the updated PR description on #270. q4_0 seems to hurt perplexity a bit, but it’s certainly not disastrous. I’m doing a q4_1 run now to compare.

Perplexity
5.9565 - 7B, 16f
6.5949 - 7B, 4bit
6.5995 - 7B, 4bit, --memory_f16

@alankila you can reduce memory using 16bit floats for memory --memory_f16 and setting a lower context(, and lower batch size).

Try the following parameters, gives me good quality output:

--temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647

Also repeat_penalty = 1.0 means disable. Maybe its not named as it should be 😇