llama.cpp: Quantitative measurement of model perplexity for different models and model quantization modes
llama.cpp seems to give bad results compared to Facebook’s implementation.
Here’s an example simple reading comprehension prompt:
Question: “Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?” Answer: "Paul bought a book
LLaMA 7B with Facebook’s implementation yields:
Seed 1:
Question: “Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?” Answer: “Paul bought a book with pictures.” Asked by lone wolf 1788 days ago.
Seed 2 (to show that the above is not just a fluke):
Question: “Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?” Answer: “Paul bought a book with pictures.” Question: "Tom, Mark, and Paul bought books: two with pictures and
While llama.cpp without quantization (so still float16) generates (with --seed 0 -t 8):
Question: “Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?” Answer: “Paul bought a book that is neither as good nor bad.” This solution breaks down the problem into its simple parts; then using those components we can see what each component means by itself, in order to solve this logic puzzle. 1) Tom and Mark had different kinds of books…and so did Paul! (Therefore one out three were without pictures). … [end of text]
It even has a grammatical error at the end: “one out [of] three”
As you can see the quality of 7B is higher in Facebook’s implementation. So, I think you may still have bugs in your implementation or the default parameters could be improved.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 3
- Comments: 53 (11 by maintainers)
I am unable to execute the f16 base model due to only having up to 16 GB memory laptop models to run the program on. It seems that it would take about 14 GB of usable memory to run the model. Still, your comment in #270 appears to show that perplexity is improved by Q4_1, which I think is quite evident from the output of the model, especially once the output’s randomness is reduced.
I am not able to explain the higher quality of facebookresearch’s version over this one, which is what started this thread. Taking a look, they also use a simple top_p sampling strategy, which can be approximated by setting repeat_penalty=1.0 and top_k to some high value like 10000. But in my experience, quality is generally dropping below acceptable after temperature exceeds about 0.3. This might be due to the damage done by the 4-bit quantization, as far as I can tell.
There is also some nondeterminism in the program. This is likely due to how GGLM farms the computation across threads, likely splitting the matrices by rows or whatever, and this results in different accumulation and rounding errors, perhaps, which slightly appears to affect the results. For precise repeatable results, it is not enough to get the same output sampling seed, but also the same input batching and thread count, I think.
@alankila
Your observations and analysis correspond very well to mine. I have definitely observed that Q4_0 performs significantly worse compared to Q4_1 - for example, GPT-2 345M does not work at all with Q4_0, but it sort of produces coherent stuff with Q4_1. I am thinking that we will eventually switch by default to Q4_1, or some similar alternative. It’s slightly slower to compute unfortunately.
Your analysis about the determinism is correct - I am thinking about how to fix this, but probably a major change in
ggmlis needed. Not sure if it super important at this moment.I just realized my intuition about temperature was wrong somehow – was thinking low temp means more random. It’s the other way around 🤦
@glinscott I haven’t had the chance to test the new tokenizer against FB implementation yet. Hopefully this explains the differences at FP16.
I am testing this as well. I have the following invocation. I built the Q4_1 files out of interest because they quantize the matrices to all 16 different values whereas the Q4_0 only uses 15 possible quantizations, so I figured it might work better. I think the key difference is not that _1 has more values but that Q4_1 has no representation for zero, whereas Q4_0’s 4-bit value 8 encodes a 0 in the weight matrix. This sort of thing obviously has massive implications for the model.
As an example:
I’d say that Q4_1 always writes this answer if temperature is set low, regardless of the seed. I tried a few dozen and couldn’t get anything different to come out. At most the example at print(fib(…)) appears to vary sometimes, as does the “discussion” that follows the example print() invocation.
Interestingly, Q4_0 prefers this version, which won’t have fib(0) = 0:
As a general observation, I would say that both top_p=0.9 and high temperature tend to take the model’s output off the rails, and it usually begins to prattle something completely nonsensical.
That being said, my rather strong impression is that Q4_1 does produce higher quality output than Q4_0, though this is not proven by any kind of actual perplexity analysis. It’s just my observation from using the same arguments and asking it to do various creative exercises. Q4_0 often seems to ignore the instruction and writes something else, whereas Q4_1 can be on topic. Still, this sort of claim should be more rigorously proven.
As to the picture question, seed 1, seed 2, seed 3, seed 4 results all say:
Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book with pictures."My thinking is that this question is known to the model and using a low temperature allows predicting the answer correctly every time. It makes no difference whether this is Q4_0 or Q4_1 answering the question.
@gjmulder yeah, if it’s going to take too long, we could run up to 250 chunks or so, that’s a pretty good approximation of the final perplexity so far.
@gjmulder I must say it would be fun to have a big model with the AI assistant prompt cooperate on some kind of creative exercise. I think that the 7B model can already do quite well, and I am dying to see how the 30B model does – and I have absolutely no way to run it myself. I know that my way of going about this is almost painfully unscientific, and probably not what you were offering to do. However, my excuse is that this is all pretty new to me and the novelty of these language models has not worn off for me yet. In fact, I have mostly ignored my real job today in favour of chatting with an AI.
To whet your appetite – if you are anything like me – here is a little chat transcription that I just had with “Jane” using the current master branch version.
The prompt runs up to the first 2 Jane dialog turns, like this:
I don’t know what kind of results other people are getting, but I am just astonished to see this and a hundred other chats like it coming out of some 5 GB Linux process at human typing speed. Unfortunately, it is fairly obvious to me that the 7B model has to repeat a lot of what I say, and I am hoping that this is not an issue with the bigger models.
I use “Jane (emotional state or actions here):” to serve as writeable memory of the model’s state, if you will. My thinking is this helps maintaining coherency in the hallucinated persona. When the model opts to use these parenthesis, Jane is frequently angry, in tears, laughs, is excited, smiles – all appropriate human-like responses to what I say. As an example, if I insult her for no reason, she gets upset, then angry, and even quits the chat on me by generating the end of stream token! Unfortunately, sometimes it generates an open parenthesis on my prompt and I just close it, and the language model then repeats it.
Actually, after playing around a bit with the quantized model, I now believe that the problem is only in running the FP16 model. The quantized model seems to work much better for me.
It looks like the difference between f16 and 4b are exactly the same after 50 Chunk
That would mean that if we know the perplexity of f16 (after a full run) we can know the perplexity of 4b just after 50 Chunk
Just my 2 cents 😅
Just a FYI that people have experienced optimal performance by running with a number of threads equal to the number of cores, i.e. hyperthreading doesn’t seem to help as with > 4 cores memory performance starts to become the performance bottleneck.
Also, I have 16 cores, 128GB of RAM (just enough to run 65B at fp16) and all the latest models sitting idle under my desk, so if someone needs some quality or performance benchmarking run please point me to a release and specify the test suite you would like me to run.
@ggerganov to clarify my comments - #252 was a gigantic improvement in perplexity 😃. It went from 10.4625 to 5.9565 using 16f, which is huge.
Some concrete results comparing f16 to q4_0 are in the updated PR description on #270. q4_0 seems to hurt perplexity a bit, but it’s certainly not disastrous. I’m doing a q4_1 run now to compare.
@alankila you can reduce memory using 16bit floats for memory
--memory_f16and setting a lower context(, and lower batch size).Try the following parameters, gives me good quality output:
--temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647Also repeat_penalty = 1.0 means disable. Maybe its not named as it should be 😇