llama.cpp: [User] Embedding doesn't seem to work?
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [X ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [X ] I carefully followed the README.md.
- [X ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I’m trying to use llama.cpp to generate sentence embeddings, and then use a query to search for answers in a vector database. But my code doesn’t work. Upon further inspection, it seems that the sentence embeddings generated by llama.cpp is not trustworthy. This can be reproduced by the embedding example:
./embedding -m models/7B/ggml-model-q4_0.bin -p "hello" -n 512
./embedding -m models/7B/ggml-model-q4_0.bin -p "hello " -n 512
notice that the only difference between the above two commands is that there is an extra space in the second prompt. But the above will result in completely different embeddings. I would assume, since the meaning of the prompts is the same, the extra space shouldn’t cause the embedding to be very different.
Is the embedding function working?
Current Behavior
The current embedding output seems to be random?
Environment and Context
Linux + A100
- Physical (or virtual) hardware you are using, e.g. for Linux:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD Ryzen Threadripper PRO 3975WX 32-Cores
Stepping: 0
CPU MHz: 2195.790
CPU max MHz: 4368.1641
CPU min MHz: 2200.0000
BogoMIPS: 6987.21
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
- Operating System, e.g. for Linux:
Linux artserver1 5.19.0-32-generic #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 30 17:03:34 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
- SDK version, e.g. for Linux:
Python 3.10.9
GNU Make 4.1
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Failure Information (for bugs)
The embedding output can be altered by adding a space in the prompt.
Steps to Reproduce
./embedding -m models/7B/ggml-model-q4_0.bin -p "hello" -n 512
./embedding -m models/7B/ggml-model-q4_0.bin -p "hello " -n 512
build the project and run the official embedding example like the above and compare the generated embeddings.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 2
- Comments: 56 (18 by maintainers)
Commits related to this issue
- embedding : print cosine similarity (#899) — committed to ggerganov/llama.cpp by ggerganov 4 months ago
- embedding : print all resulting embeddings (#899) — committed to ggerganov/llama.cpp by ggerganov 4 months ago
- embedding : add EOS token if not present (#899) — committed to ggerganov/llama.cpp by ggerganov 4 months ago
- embedding : print cosine similarity (#899) — committed to NeoZhangJianyu/llama.cpp by ggerganov 4 months ago
- embedding : print all resulting embeddings (#899) — committed to NeoZhangJianyu/llama.cpp by ggerganov 4 months ago
- embedding : add EOS token if not present (#899) — committed to NeoZhangJianyu/llama.cpp by ggerganov 4 months ago
- embedding : print cosine similarity (#899) — committed to hodlen/llama.cpp by ggerganov 4 months ago
- embedding : print all resulting embeddings (#899) — committed to hodlen/llama.cpp by ggerganov 4 months ago
- embedding : add EOS token if not present (#899) — committed to hodlen/llama.cpp by ggerganov 4 months ago
Llama is unidirectional, not bidirectional like BERT, which I think may make the embeddings better but not sure. I agree that this is a ‘least-bad’ approach, not sure how we could improve it.
I leveraged the script by @nitram147 and switched it to use cosine similarity, and output the results ranked by similarity instead of randomly.
I see one-word queries are similar to each other in embedding space, even if the words are not that related. This will definitely be bad for search. Maybe for one-word search it would be better to use word-embedding similarity over the document (with max pooling, or highlighting of sections with high similarity), instead of the full language model.
Then for sentences we could switch to the full llama sentence embedding.
Again this is a least-bad approach, but it could work better than what we have now for search. If anyone has the time to do it.
Here are the results I got, plus the script (which is a modified version of Nitram’s)
And here are the results. I think especially sentence vs sentence, they make sense. The biggest problem is one-word queries (which I guess are a big portion of all search queries). Maybe a good search would be grep-first, word-embedding second, sentence embedding third? This sounds like the kind of problem where someone smarter than me has already invented solutions though.
I concur
I think the output embedding is associated with current predication of next token.
https://github.com/ggerganov/llama.cpp/blob/fa84c4b3e80199a5683438f062009c031a06c4fa/llama.cpp#LL1655C6-L1655C6
I don’t see these results as particularly unexpected.
A sentence that ends in a ’ ’ is inherently incomplete (it would be missing a word, etc) so it’s not weird that the model encodes it very differently than a complete one, though this is just my interpretation. As a recommendation I would advise any real applications using these embeddings strip trailing whitespace off input text, especially if it’s user input.
As for the “I like cats” vs “cats” similarity, I also don’t see it as particularly unexpected that they are not similar, as one is a sentence and the other a single word, and they only share part of the topic. I would be more surprised if two noun clauses (like “hairy feline” and “purring kitten”) that have similar meanings were assigned very different scores.
Basically things that are syntactically dissimilar are understandably not very close in embedding space.
If you test sentences with very similar syntax and somewhat similar semantics and they are not aligned at all, that would worry me more.
I hope this clarifies things! Anyone who knows more please chime in too.
On Fri, Apr 21, 2023, 02:13 Rimvydas Naktinis @.***> wrote:
I ran more tests using cosine similarity, so that it would be easier to comapare to the initial tests.
Some results are as expected:
However some similarities are way off:
@StrikingLoo @ggerganov any intuition why the current embedding calculation logic could be behaving this way?
@akarshanbiswas, the server needs to be started with the
--embeddingoption, since it adds some overhead to processing it is disabled by default.Are you using a non-llama model for generating embeddings and doing the search, or did you find a way to do it with Llama?
With latest
masteryou can use theembeddingtool to compute cosine similarities of different prompts:I did some experiments on this embedding the other day and tested using averaging the vectors.
How: change the embedding vector to be
[n_embd * n_ctx]in size, from the llama.h API return the average embedding of the so far evaluated contexts.It seemed to be doing a little better in some of the document retrieval tasks. There is still the issue that it is kind of slow even with GPU acceleration to process a lot of text. Maybe all the layers are not necessary to process?
I think maybe LLaMA is not the right model for this task, some kind of encoder-decoder model could be better.
(as an aside, for indexing a large base of documents, I would definitely welcome a webserver-like mode, that would load the model once, and then accept requests with documents, returning the embeddings - currently each run loads the model again)
It’s a placeholder string - you can override it by passing
"model"in the POST dataYou can try using
server --embedding, but I think we still have some problems with the tokenization around special tokens, so the results might not be correct atm. We’ll fix this soonEmbeddings can be of many different sizes depending on the model, but the code only allows up to 16 values to be shown. You can change this if you want to see more by altering
line 174or thereabouts:Just change the
16to however many values you want to see (butmistral-7b-instruct-v0.2.Q8_0.gguf, for example, uses4096so it’s not a very easy set of numbers to use; they’re really just illustrative rather than directly useful).Beautiful 👌
cosine similarity matrix:
1.00 0.99 0.74 0.75 0.99 1.00 0.74 0.75 0.74 0.74 1.00 0.99 0.75 0.75 0.99 1.00
tokenizers man, they do my head in
llama.cppalso bypasses thelm_headwhen embeddings are computed. Not sure - can you check how the reference implementation tokenizes the strings?I mean to run it just for
["hello", "hello ", "jimmy", "jimmy "]and report only the similarity numbers. The rest is irrelevantVery nice! Since we’re showing all the cosine similarities, I’m not sure why we are only showing the first 3 embeddings. Would seem to make sense to change line 172 to show as many embeddings as there are prompts since in this illustrative example there are not likely to be many (or set the minimum well above 3):
More of the interesting discussions (from BERT)
https://github.com/JohnSnowLabs/spark-nlp/issues/684#issuecomment-557897665
@StrikingLoo Not sure actually - I have been using LLMs like a ‘magical black boxes’ so far, and am reading up on the basics. The word embeddings are definitely problematic, as the google researchers replied (for the BERT embeddings):
and also
https://github.com/google-research/bert/issues/164#issuecomment-441324222
I am beginning to lean onto the idea that what llama does now is actually 'least bad’option out of the easily available ones, and there seems to be active research still going on about how to best semantically embed sentences or documents …
I tried reading on basics of transformers at https://www.baeldung.com/cs/transformer-text-embeddings and near the end they say:
The last link says that “The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.” The authors present a different network structure, that can actually generate sentence embeddings.
So, it would seem that the keyword to google is “sentence embedding with LLM”.
Googling that, there is a SO question noticing that OpenAI embeddings don’t seem to work much for short inputs either: https://datascience.stackexchange.com/questions/120422/text-embeddings-for-words-or-very-short-sentences-with-a-llm
instead of 7B, have you tried with bigger llama model?
I’m not even sure what the embedding vector is supposed to be that llama.h gives you, I think it may represent the next generated token more than anything because it’s extracted at the end.
In reality “hello” and "hello " is a different phrase. However these two phrases should be closer to each other than to other phrases. I’ve made two scripts for testing of the embedding behaviour namely:
get_embeddings.sh:And
compare_embeddings.py:For my surprise for the short phrases it does not hold this “phrases with the similar meaning should be closer to each other” premise.
See:
Extract embeddings for a few short phrases:
Obtain results:
Results:
Unfortunately, I don’t have any more time at the moment. But if you have, try to extract embeddings for more complicated phrases and post the results here 😃
It seems embedding.cpp returns the output embeddings.