llama.cpp: [User] Embedding doesn't seem to work?

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[X ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[X ] I carefully followed the README.md.
[X ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I’m trying to use llama.cpp to generate sentence embeddings, and then use a query to search for answers in a vector database. But my code doesn’t work. Upon further inspection, it seems that the sentence embeddings generated by llama.cpp is not trustworthy. This can be reproduced by the embedding example:

./embedding -m models/7B/ggml-model-q4_0.bin -p "hello" -n 512

./embedding -m models/7B/ggml-model-q4_0.bin -p "hello " -n 512

notice that the only difference between the above two commands is that there is an extra space in the second prompt. But the above will result in completely different embeddings. I would assume, since the meaning of the prompts is the same, the extra space shouldn’t cause the embedding to be very different.

Is the embedding function working?

Current Behavior

The current embedding output seems to be random?

Environment and Context

Linux + A100

Physical (or virtual) hardware you are using, e.g. for Linux:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD Ryzen Threadripper PRO 3975WX 32-Cores
Stepping:            0
CPU MHz:             2195.790
CPU max MHz:         4368.1641
CPU min MHz:         2200.0000
BogoMIPS:            6987.21
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es

Operating System, e.g. for Linux:

Linux artserver1 5.19.0-32-generic #33~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 30 17:03:34 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

SDK version, e.g. for Linux:

Python 3.10.9
GNU Make 4.1
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

Failure Information (for bugs)

The embedding output can be altered by adding a space in the prompt.

Steps to Reproduce

./embedding -m models/7B/ggml-model-q4_0.bin -p "hello" -n 512

./embedding -m models/7B/ggml-model-q4_0.bin -p "hello " -n 512

build the project and run the official embedding example like the above and compare the generated embeddings.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 2
Comments: 56 (18 by maintainers)

Commits related to this issue

embedding : print cosine similarity (#899) — committed to ggerganov/llama.cpp by ggerganov 4 months ago
embedding : print all resulting embeddings (#899) — committed to ggerganov/llama.cpp by ggerganov 4 months ago
embedding : add EOS token if not present (#899) — committed to ggerganov/llama.cpp by ggerganov 4 months ago
embedding : print cosine similarity (#899) — committed to NeoZhangJianyu/llama.cpp by ggerganov 4 months ago
embedding : print all resulting embeddings (#899) — committed to NeoZhangJianyu/llama.cpp by ggerganov 4 months ago
embedding : add EOS token if not present (#899) — committed to NeoZhangJianyu/llama.cpp by ggerganov 4 months ago
embedding : print cosine similarity (#899) — committed to hodlen/llama.cpp by ggerganov 4 months ago
embedding : print all resulting embeddings (#899) — committed to hodlen/llama.cpp by ggerganov 4 months ago
embedding : add EOS token if not present (#899) — committed to hodlen/llama.cpp by ggerganov 4 months ago

Most upvoted comments

Llama is unidirectional, not bidirectional like BERT, which I think may make the embeddings better but not sure. I agree that this is a ‘least-bad’ approach, not sure how we could improve it.

I leveraged the script by @nitram147 and switched it to use cosine similarity, and output the results ranked by similarity instead of randomly.

I see one-word queries are similar to each other in embedding space, even if the words are not that related. This will definitely be bad for search. Maybe for one-word search it would be better to use word-embedding similarity over the document (with max pooling, or highlighting of sections with high similarity), instead of the full language model.

Then for sentences we could switch to the full llama sentence embedding.

Again this is a least-bad approach, but it could work better than what we have now for search. If anyone has the time to do it.

Here are the results I got, plus the script (which is a modified version of Nitram’s)


# /* +----------------------------------+ */
# /* |     LLaMA Embeddings Tester      | */
# /* |      compare_embeddings.py       | */
# /* |   (c)copyright nitram147 2023    | */
# /* +----------------------------------+ */

import sys
import glob
import math

from numpy import dot
from numpy.linalg import norm

def cos_sim(a,b):
  return dot(a, b)/(norm(a)*norm(b))

def print_help(script_name: str) -> None:
	print("Usage: python3 " + script_name + " path_to_results_folder")

def get_results_subfolders(path_to_results_folder: str) -> list:
	return [
		x + "/" for x in sorted(glob.glob(path_to_results_folder + "*"))
		if glob.os.path.isdir(x)
	]

def get_results_filenames_from_folder(folder: str) -> list:
	return [
		x for x in sorted(glob.glob(folder + "*"))
		if glob.os.path.isfile(x) and len(glob.os.path.basename(x)) == 64
	]

def load_embedding_from_file(file: str) -> dict:
	if not glob.os.path.isfile(file): raise ValueError("Invalid argument provided!!!")
	lines = [x.strip("\n") for x in open(file, "r").readlines()]
	if not lines[0].startswith("Phrase: "): raise ValueError("Invalid result file provided!!!")
	#remove last space character on the end of returned embedding by [:-1]
	return { lines[0][len("Phrase: "):] : [float(x) for x in lines[1][:-1].split(" ")] }

def get_distance_between_embeddings(first: list, second: list) -> float:
	if (
		not isinstance(first, list) or
		not isinstance(second, list)
	): raise ValueError("Invalid arguments provided!!!")
	return cos_sim(first, second)

def get_table_index(i: int, j: int, length: int) -> int:
	if j < i: i, j = j, i
	return sum([length - x for x in range(i)]) + (j - i)

if len(sys.argv) != 2:
	print("Invalid count of arguments! See help below:", file=sys.stderr)
	print_help(sys.argv[0])
	sys.exit(1)

path_to_results_folder = sys.argv[1] + "/" if sys.argv[1][-1] != "/" else sys.argv[1]

results_subfolders = get_results_subfolders(path_to_results_folder)

for folder in results_subfolders:
	
	print("Analyzing data in folder: " + folder)
	
	filenames = get_results_filenames_from_folder(folder)
	
	phrases_embeddings = sorted(
		[load_embedding_from_file(file) for file in filenames],
		key = lambda v: list(v.keys())[0]
	)
	
	phrases_count = len(phrases_embeddings)

	distances = [{} for i in range(phrases_count)]

	for i in range(phrases_count):
		for j in range(i, phrases_count):
			 distances[i][j] = get_distance_between_embeddings(
			 		phrases_embeddings[i][list(phrases_embeddings[i].keys())[0]],
			 		phrases_embeddings[j][list(phrases_embeddings[j].keys())[0]]
			 	)
			 distances[j][i] = distances[i][j]
			 
  
	for i in range(phrases_count):
		print("Distance from phrase \"" + list(phrases_embeddings[i].keys())[0] + "\" to:")
		sorted_similarities = sorted(distances[i].items(), key=lambda x:x[1])
		sorted_similarities.reverse()

		for j, v in sorted_similarities:
			print(
				"\tPhrase: \"" + list(phrases_embeddings[j].keys())[0] + "\" is " +
				str(distances[i][j])
			)

And here are the results. I think especially sentence vs sentence, they make sense. The biggest problem is one-word queries (which I guess are a big portion of all search queries). Maybe a good search would be grep-first, word-embedding second, sentence embedding third? This sounds like the kind of problem where someone smarter than me has already invented solutions though.

Distance from phrase "A detailed history of the United Kingdom" to:
	Phrase: "A detailed history of the United Kingdom" is 0.9999999999999998
	Phrase: "The Roman Republic" is 0.5748738236620277
	Phrase: "Victorian England" is 0.5417959966805156
	Phrase: "Five serpents" is 0.4887654290282026
	Phrase: "An essay about vipers" is 0.4877461656062162
	Phrase: "The temple of the snakes" is 0.4713469048137312
	Phrase: "A treaty on vipers" is 0.4325516723710721
	Phrase: "birds" is 0.4323984354416718
	Phrase: "Wales" is 0.40871608168029117
	Phrase: "history" is 0.40729360565060807
	Phrase: "snakes" is 0.37678648423072164
	Phrase: "Five kittens" is 0.3504049324479399
	Phrase: "Platypus are animals in the family of monotrema" is 0.3416645712014458
	Phrase: "Five puppies" is 0.3399316180379291
	Phrase: "Important facts about snakes" is 0.1901999384625233
	Phrase: "Most birds can fly, but not all of them" is 0.14662966175951292
	Phrase: "Ostriches lay eggs in summer" is 0.05645862730366624
	Phrase: "Platypus are animals in the family of monotremes" is 0.029071853233189747
	Phrase: "Platypus lay eggs even though they are mammals" is 0.001354915506041539
	Phrase: "They are taking the hobbits to Isengard" is -0.027998082203763545
Distance from phrase "A treaty on vipers" to:
	Phrase: "A treaty on vipers" is 0.9999999999999999
	Phrase: "The temple of the snakes" is 0.7121266584730245
	Phrase: "An essay about vipers" is 0.6756580098457795
	Phrase: "Five serpents" is 0.6138246984223019
	Phrase: "Important facts about snakes" is 0.6033703821537613
	Phrase: "Platypus are animals in the family of monotrema" is 0.5213079576838386
	Phrase: "Five kittens" is 0.5102517425776194
	Phrase: "The Roman Republic" is 0.4952710353955269
	Phrase: "Five puppies" is 0.4928782444742322
	Phrase: "A detailed history of the United Kingdom" is 0.4325516723710721
	Phrase: "Most birds can fly, but not all of them" is 0.41196624821292727
	Phrase: "birds" is 0.40048799630259635
	Phrase: "Ostriches lay eggs in summer" is 0.36774319666470007
	Phrase: "Victorian England" is 0.34452522313252903
	Phrase: "snakes" is 0.34030332986162143
	Phrase: "Platypus are animals in the family of monotremes" is 0.33867262088481903
	Phrase: "They are taking the hobbits to Isengard" is 0.3017421545504743
	Phrase: "Platypus lay eggs even though they are mammals" is 0.28052927838170594
	Phrase: "Wales" is 0.2589836768050556
	Phrase: "history" is 0.16800477409088788
Distance from phrase "An essay about vipers" to:
	Phrase: "An essay about vipers" is 1.0
	Phrase: "A treaty on vipers" is 0.6756580098457795
	Phrase: "The temple of the snakes" is 0.6613045304976577
	Phrase: "snakes" is 0.6424852768571448
	Phrase: "birds" is 0.6022809720677789
	Phrase: "Five serpents" is 0.5706639924969026
	Phrase: "Victorian England" is 0.5004794370047532
	Phrase: "A detailed history of the United Kingdom" is 0.4877461656062162
	Phrase: "The Roman Republic" is 0.48582124817882233
	Phrase: "history" is 0.48050394501994287
	Phrase: "Five kittens" is 0.3972282359943179
	Phrase: "Five puppies" is 0.36455730502806755
	Phrase: "Wales" is 0.28967521135603924
	Phrase: "Platypus are animals in the family of monotrema" is 0.2885147834071458
	Phrase: "Important facts about snakes" is 0.2686247993307239
	Phrase: "Most birds can fly, but not all of them" is 0.1516970934076252
	Phrase: "Ostriches lay eggs in summer" is -0.06161211798781555
	Phrase: "Platypus are animals in the family of monotremes" is -0.076016728541744
	Phrase: "They are taking the hobbits to Isengard" is -0.12011592297443835
	Phrase: "Platypus lay eggs even though they are mammals" is -0.14184739358299675
Distance from phrase "Five kittens" to:
	Phrase: "Five kittens" is 0.9999999999999999
	Phrase: "Five puppies" is 0.9637579941524779
	Phrase: "Five serpents" is 0.7181583032099871
	Phrase: "The temple of the snakes" is 0.5154477753455162
	Phrase: "A treaty on vipers" is 0.5102517425776194
	Phrase: "Most birds can fly, but not all of them" is 0.46804329258518235
	Phrase: "birds" is 0.4329793179326728
	Phrase: "The Roman Republic" is 0.42026387242460195
	Phrase: "An essay about vipers" is 0.3972282359943179
	Phrase: "Victorian England" is 0.37867964473037746
	Phrase: "Platypus are animals in the family of monotrema" is 0.35483120275980085
	Phrase: "A detailed history of the United Kingdom" is 0.3504049324479399
	Phrase: "snakes" is 0.3384167555590293
	Phrase: "Important facts about snakes" is 0.3051616952272665
	Phrase: "Wales" is 0.2869469938895282
	Phrase: "Ostriches lay eggs in summer" is 0.19663443328163918
	Phrase: "They are taking the hobbits to Isengard" is 0.18155184584899411
	Phrase: "Platypus are animals in the family of monotremes" is 0.17553070413530059
	Phrase: "Platypus lay eggs even though they are mammals" is 0.167219877876806
	Phrase: "history" is 0.15289122649586898
Distance from phrase "Five puppies" to:
	Phrase: "Five puppies" is 1.0
	Phrase: "Five kittens" is 0.9637579941524779
	Phrase: "Five serpents" is 0.6937265653941465
	Phrase: "A treaty on vipers" is 0.4928782444742322
	Phrase: "The temple of the snakes" is 0.47766281627400026
	Phrase: "Most birds can fly, but not all of them" is 0.4670481719172508
	Phrase: "birds" is 0.39218471142299705
	Phrase: "The Roman Republic" is 0.3849102243088202
	Phrase: "An essay about vipers" is 0.36455730502806755
	Phrase: "Platypus are animals in the family of monotrema" is 0.3508676788029156
	Phrase: "A detailed history of the United Kingdom" is 0.3399316180379291
	Phrase: "Victorian England" is 0.3346593189159687
	Phrase: "Important facts about snakes" is 0.324087568892877
	Phrase: "snakes" is 0.29776784877896306
	Phrase: "Wales" is 0.2937135879933727
	Phrase: "Ostriches lay eggs in summer" is 0.20993457335944407
	Phrase: "Platypus are animals in the family of monotremes" is 0.20618581263720262
	Phrase: "Platypus lay eggs even though they are mammals" is 0.19523476831744402
	Phrase: "They are taking the hobbits to Isengard" is 0.18998793822468293
	Phrase: "history" is 0.11647268985383732
Distance from phrase "Five serpents" to:
	Phrase: "Five serpents" is 0.9999999999999999
	Phrase: "Five kittens" is 0.7181583032099871
	Phrase: "The temple of the snakes" is 0.7126311879933475
	Phrase: "Five puppies" is 0.6937265653941465
	Phrase: "A treaty on vipers" is 0.6138246984223019
	Phrase: "An essay about vipers" is 0.5706639924969026
	Phrase: "birds" is 0.5586839567810735
	Phrase: "snakes" is 0.5350123489286813
	Phrase: "The Roman Republic" is 0.5316128408395847
	Phrase: "A detailed history of the United Kingdom" is 0.4887654290282026
	Phrase: "Victorian England" is 0.4880012983858247
	Phrase: "Most birds can fly, but not all of them" is 0.42510155926378385
	Phrase: "Platypus are animals in the family of monotrema" is 0.39039890813324973
	Phrase: "history" is 0.31638335399535994
	Phrase: "Wales" is 0.31536804519798795
	Phrase: "Important facts about snakes" is 0.2730802410943006
	Phrase: "Ostriches lay eggs in summer" is 0.1045134201465086
	Phrase: "Platypus are animals in the family of monotremes" is 0.09111205097352691
	Phrase: "Platypus lay eggs even though they are mammals" is 0.06015343973108685
	Phrase: "They are taking the hobbits to Isengard" is 0.04756306022740093
Distance from phrase "Important facts about snakes" to:
	Phrase: "Important facts about snakes" is 1.0
	Phrase: "Platypus are animals in the family of monotremes" is 0.6395528238368102
	Phrase: "Platypus lay eggs even though they are mammals" is 0.625774654703554
	Phrase: "A treaty on vipers" is 0.6033703821537613
	Phrase: "Ostriches lay eggs in summer" is 0.6015410729979876
	Phrase: "They are taking the hobbits to Isengard" is 0.5416844489131567
	Phrase: "Platypus are animals in the family of monotrema" is 0.4993521072165484
	Phrase: "Most birds can fly, but not all of them" is 0.40654780779615896
	Phrase: "The temple of the snakes" is 0.33485603875398423
	Phrase: "Five puppies" is 0.324087568892877
	Phrase: "Five kittens" is 0.3051616952272665
	Phrase: "Five serpents" is 0.2730802410943006
	Phrase: "An essay about vipers" is 0.2686247993307239
	Phrase: "A detailed history of the United Kingdom" is 0.1901999384625233
	Phrase: "Wales" is 0.136449263197007
	Phrase: "The Roman Republic" is 0.1351638751322845
	Phrase: "birds" is 0.023255450047180805
	Phrase: "snakes" is -0.016928320998117207
	Phrase: "Victorian England" is -0.06392578654090138
	Phrase: "history" is -0.21211808043470504
Distance from phrase "Most birds can fly, but not all of them" to:
	Phrase: "Most birds can fly, but not all of them" is 1.0
	Phrase: "Five kittens" is 0.46804329258518235
	Phrase: "Five puppies" is 0.4670481719172508
	Phrase: "Ostriches lay eggs in summer" is 0.45467751192524924
	Phrase: "Platypus lay eggs even though they are mammals" is 0.4390938473607746
	Phrase: "Five serpents" is 0.42510155926378385
	Phrase: "Platypus are animals in the family of monotremes" is 0.4145665249360098
	Phrase: "A treaty on vipers" is 0.41196624821292727
	Phrase: "Important facts about snakes" is 0.40654780779615896
	Phrase: "They are taking the hobbits to Isengard" is 0.4050178579766894
	Phrase: "Platypus are animals in the family of monotrema" is 0.3749648352266313
	Phrase: "The temple of the snakes" is 0.2779855332150474
	Phrase: "birds" is 0.18000835274267407
	Phrase: "The Roman Republic" is 0.17215546204568155
	Phrase: "An essay about vipers" is 0.1516970934076252
	Phrase: "A detailed history of the United Kingdom" is 0.14662966175951292
	Phrase: "Wales" is 0.14601303174046434
	Phrase: "snakes" is 0.04878154241021203
	Phrase: "Victorian England" is 0.04768871027452527
	Phrase: "history" is -0.1289748521765876
Distance from phrase "Ostriches lay eggs in summer" to:
	Phrase: "Ostriches lay eggs in summer" is 1.0
	Phrase: "Platypus lay eggs even though they are mammals" is 0.7562461703550528
	Phrase: "Platypus are animals in the family of monotremes" is 0.7252187218973202
	Phrase: "They are taking the hobbits to Isengard" is 0.6813851613339252
	Phrase: "Important facts about snakes" is 0.6015410729979876
	Phrase: "Platypus are animals in the family of monotrema" is 0.45904739610934026
	Phrase: "Most birds can fly, but not all of them" is 0.45467751192524924
	Phrase: "A treaty on vipers" is 0.36774319666470007
	Phrase: "Five puppies" is 0.20993457335944407
	Phrase: "Five kittens" is 0.19663443328163918
	Phrase: "Five serpents" is 0.1045134201465086
	Phrase: "A detailed history of the United Kingdom" is 0.05645862730366624
	Phrase: "The temple of the snakes" is 0.039955601536770254
	Phrase: "Wales" is 0.025899447347126417
	Phrase: "The Roman Republic" is -0.03612939508681027
	Phrase: "An essay about vipers" is -0.06161211798781555
	Phrase: "birds" is -0.19049183012392876
	Phrase: "Victorian England" is -0.23449337096332104
	Phrase: "snakes" is -0.31028059984901213
	Phrase: "history" is -0.3809109227202052
Distance from phrase "Platypus are animals in the family of monotrema" to:
	Phrase: "Platypus are animals in the family of monotrema" is 0.9999999999999999
	Phrase: "Platypus are animals in the family of monotremes" is 0.620178529962473
	Phrase: "A treaty on vipers" is 0.5213079576838386
	Phrase: "Important facts about snakes" is 0.4993521072165484
	Phrase: "Platypus lay eggs even though they are mammals" is 0.467344439291505
	Phrase: "Ostriches lay eggs in summer" is 0.45904739610934026
	Phrase: "The temple of the snakes" is 0.3980306499066072
	Phrase: "Five serpents" is 0.39039890813324973
	Phrase: "Most birds can fly, but not all of them" is 0.3749648352266313
	Phrase: "Five kittens" is 0.35483120275980085
	Phrase: "Five puppies" is 0.3508676788029156
	Phrase: "A detailed history of the United Kingdom" is 0.3416645712014458
	Phrase: "They are taking the hobbits to Isengard" is 0.34149434376099636
	Phrase: "The Roman Republic" is 0.3213479902058374
	Phrase: "An essay about vipers" is 0.2885147834071458
	Phrase: "birds" is 0.20666068517374298
	Phrase: "Wales" is 0.14583454573554305
	Phrase: "snakes" is 0.1190190772294065
	Phrase: "Victorian England" is 0.11566740137646295
	Phrase: "history" is 0.010699338004517623
Distance from phrase "Platypus are animals in the family of monotremes" to:
	Phrase: "Platypus are animals in the family of monotremes" is 1.0000000000000002
	Phrase: "Platypus lay eggs even though they are mammals" is 0.8198717952715795
	Phrase: "Ostriches lay eggs in summer" is 0.7252187218973202
	Phrase: "Important facts about snakes" is 0.6395528238368102
	Phrase: "They are taking the hobbits to Isengard" is 0.6230846072081213
	Phrase: "Platypus are animals in the family of monotrema" is 0.620178529962473
	Phrase: "Most birds can fly, but not all of them" is 0.4145665249360098
	Phrase: "A treaty on vipers" is 0.33867262088481903
	Phrase: "Five puppies" is 0.20618581263720262
	Phrase: "Five kittens" is 0.17553070413530059
	Phrase: "Five serpents" is 0.09111205097352691
	Phrase: "The temple of the snakes" is 0.058959841122550406
	Phrase: "A detailed history of the United Kingdom" is 0.029071853233189747
	Phrase: "Wales" is -0.021033555063295344
	Phrase: "The Roman Republic" is -0.029984235066861313
	Phrase: "An essay about vipers" is -0.076016728541744
	Phrase: "birds" is -0.23020738177073827
	Phrase: "Victorian England" is -0.26167436042116127
	Phrase: "snakes" is -0.3266629465844782
	Phrase: "history" is -0.415516791839634
Distance from phrase "Platypus lay eggs even though they are mammals" to:
	Phrase: "Platypus lay eggs even though they are mammals" is 1.0
	Phrase: "Platypus are animals in the family of monotremes" is 0.8198717952715795
	Phrase: "Ostriches lay eggs in summer" is 0.7562461703550528
	Phrase: "They are taking the hobbits to Isengard" is 0.7204261321828527
	Phrase: "Important facts about snakes" is 0.625774654703554
	Phrase: "Platypus are animals in the family of monotrema" is 0.467344439291505
	Phrase: "Most birds can fly, but not all of them" is 0.4390938473607746
	Phrase: "A treaty on vipers" is 0.28052927838170594
	Phrase: "Five puppies" is 0.19523476831744402
	Phrase: "Five kittens" is 0.167219877876806
	Phrase: "Five serpents" is 0.06015343973108685
	Phrase: "Wales" is 0.011540422166169335
	Phrase: "A detailed history of the United Kingdom" is 0.001354915506041539
	Phrase: "The temple of the snakes" is -0.024654326682285847
	Phrase: "The Roman Republic" is -0.08910031561314674
	Phrase: "An essay about vipers" is -0.14184739358299675
	Phrase: "birds" is -0.23253452301526378
	Phrase: "Victorian England" is -0.29892930149207947
	Phrase: "snakes" is -0.3368052783296547
	Phrase: "history" is -0.4389313409760877
Distance from phrase "The Roman Republic" to:
	Phrase: "The Roman Republic" is 1.0
	Phrase: "Victorian England" is 0.7841146200572829
	Phrase: "The temple of the snakes" is 0.6364758764502186
	Phrase: "A detailed history of the United Kingdom" is 0.5748738236620277
	Phrase: "Five serpents" is 0.5316128408395847
	Phrase: "A treaty on vipers" is 0.4952710353955269
	Phrase: "An essay about vipers" is 0.48582124817882233
	Phrase: "history" is 0.48231789906021916
	Phrase: "birds" is 0.4819111670905733
	Phrase: "snakes" is 0.46379561792198076
	Phrase: "Five kittens" is 0.42026387242460195
	Phrase: "Five puppies" is 0.3849102243088202
	Phrase: "Wales" is 0.382858313247597
	Phrase: "Platypus are animals in the family of monotrema" is 0.3213479902058374
	Phrase: "Most birds can fly, but not all of them" is 0.17215546204568155
	Phrase: "Important facts about snakes" is 0.1351638751322845
	Phrase: "Platypus are animals in the family of monotremes" is -0.029984235066861313
	Phrase: "Ostriches lay eggs in summer" is -0.03612939508681027
	Phrase: "They are taking the hobbits to Isengard" is -0.06534568617570666
	Phrase: "Platypus lay eggs even though they are mammals" is -0.08910031561314674
Distance from phrase "The temple of the snakes" to:
	Phrase: "The temple of the snakes" is 1.0
	Phrase: "Five serpents" is 0.7126311879933475
	Phrase: "A treaty on vipers" is 0.7121266584730245
	Phrase: "An essay about vipers" is 0.6613045304976577
	Phrase: "The Roman Republic" is 0.6364758764502186
	Phrase: "Victorian England" is 0.5970962177231259
	Phrase: "snakes" is 0.5808420416999638
	Phrase: "birds" is 0.5573900343844128
	Phrase: "Five kittens" is 0.5154477753455162
	Phrase: "Five puppies" is 0.47766281627400026
	Phrase: "A detailed history of the United Kingdom" is 0.4713469048137312
	Phrase: "history" is 0.4141767926445685
	Phrase: "Platypus are animals in the family of monotrema" is 0.3980306499066072
	Phrase: "Important facts about snakes" is 0.33485603875398423
	Phrase: "Wales" is 0.3120181652150862
	Phrase: "Most birds can fly, but not all of them" is 0.2779855332150474
	Phrase: "Platypus are animals in the family of monotremes" is 0.058959841122550406
	Phrase: "Ostriches lay eggs in summer" is 0.039955601536770254
	Phrase: "They are taking the hobbits to Isengard" is 0.0197709798888127
	Phrase: "Platypus lay eggs even though they are mammals" is -0.024654326682285847
Distance from phrase "They are taking the hobbits to Isengard" to:
	Phrase: "They are taking the hobbits to Isengard" is 1.0
	Phrase: "Platypus lay eggs even though they are mammals" is 0.7204261321828527
	Phrase: "Ostriches lay eggs in summer" is 0.6813851613339252
	Phrase: "Platypus are animals in the family of monotremes" is 0.6230846072081213
	Phrase: "Important facts about snakes" is 0.5416844489131567
	Phrase: "Most birds can fly, but not all of them" is 0.4050178579766894
	Phrase: "Platypus are animals in the family of monotrema" is 0.34149434376099636
	Phrase: "A treaty on vipers" is 0.3017421545504743
	Phrase: "Five puppies" is 0.18998793822468293
	Phrase: "Five kittens" is 0.18155184584899411
	Phrase: "Five serpents" is 0.04756306022740093
	Phrase: "The temple of the snakes" is 0.0197709798888127
	Phrase: "Wales" is -0.021034520811148583
	Phrase: "A detailed history of the United Kingdom" is -0.027998082203763545
	Phrase: "The Roman Republic" is -0.06534568617570666
	Phrase: "An essay about vipers" is -0.12011592297443835
	Phrase: "birds" is -0.2410456062453273
	Phrase: "Victorian England" is -0.2419816604582041
	Phrase: "snakes" is -0.336432211624184
	Phrase: "history" is -0.42161126557001105
Distance from phrase "Victorian England" to:
	Phrase: "Victorian England" is 1.0
	Phrase: "The Roman Republic" is 0.7841146200572829
	Phrase: "history" is 0.6171528620942585
	Phrase: "The temple of the snakes" is 0.5970962177231259
	Phrase: "snakes" is 0.5820718403103681
	Phrase: "birds" is 0.5686519312918882
	Phrase: "A detailed history of the United Kingdom" is 0.5417959966805156
	Phrase: "An essay about vipers" is 0.5004794370047532
	Phrase: "Five serpents" is 0.4880012983858247
	Phrase: "Wales" is 0.4149343272780173
	Phrase: "Five kittens" is 0.37867964473037746
	Phrase: "A treaty on vipers" is 0.34452522313252903
	Phrase: "Five puppies" is 0.3346593189159687
	Phrase: "Platypus are animals in the family of monotrema" is 0.11566740137646295
	Phrase: "Most birds can fly, but not all of them" is 0.04768871027452527
	Phrase: "Important facts about snakes" is -0.06392578654090138
	Phrase: "Ostriches lay eggs in summer" is -0.23449337096332104
	Phrase: "They are taking the hobbits to Isengard" is -0.2419816604582041
	Phrase: "Platypus are animals in the family of monotremes" is -0.26167436042116127
	Phrase: "Platypus lay eggs even though they are mammals" is -0.29892930149207947
Distance from phrase "Wales" to:
	Phrase: "Wales" is 1.0
	Phrase: "Victorian England" is 0.4149343272780173
	Phrase: "A detailed history of the United Kingdom" is 0.40871608168029117
	Phrase: "birds" is 0.3843363535443526
	Phrase: "The Roman Republic" is 0.382858313247597
	Phrase: "snakes" is 0.35357851660519546
	Phrase: "history" is 0.3337169177569299
	Phrase: "Five serpents" is 0.31536804519798795
	Phrase: "The temple of the snakes" is 0.3120181652150862
	Phrase: "Five puppies" is 0.2937135879933727
	Phrase: "An essay about vipers" is 0.28967521135603924
	Phrase: "Five kittens" is 0.2869469938895282
	Phrase: "A treaty on vipers" is 0.2589836768050556
	Phrase: "Most birds can fly, but not all of them" is 0.14601303174046434
	Phrase: "Platypus are animals in the family of monotrema" is 0.14583454573554305
	Phrase: "Important facts about snakes" is 0.136449263197007
	Phrase: "Ostriches lay eggs in summer" is 0.025899447347126417
	Phrase: "Platypus lay eggs even though they are mammals" is 0.011540422166169335
	Phrase: "Platypus are animals in the family of monotremes" is -0.021033555063295344
	Phrase: "They are taking the hobbits to Isengard" is -0.021034520811148583
Distance from phrase "birds" to:
	Phrase: "birds" is 1.0
	Phrase: "snakes" is 0.8154332796941849
	Phrase: "history" is 0.6878440441219348
	Phrase: "An essay about vipers" is 0.6022809720677789
	Phrase: "Victorian England" is 0.5686519312918882
	Phrase: "Five serpents" is 0.5586839567810735
	Phrase: "The temple of the snakes" is 0.5573900343844128
	Phrase: "The Roman Republic" is 0.4819111670905733
	Phrase: "Five kittens" is 0.4329793179326728
	Phrase: "A detailed history of the United Kingdom" is 0.4323984354416718
	Phrase: "A treaty on vipers" is 0.40048799630259635
	Phrase: "Five puppies" is 0.39218471142299705
	Phrase: "Wales" is 0.3843363535443526
	Phrase: "Platypus are animals in the family of monotrema" is 0.20666068517374298
	Phrase: "Most birds can fly, but not all of them" is 0.18000835274267407
	Phrase: "Important facts about snakes" is 0.023255450047180805
	Phrase: "Ostriches lay eggs in summer" is -0.19049183012392876
	Phrase: "Platypus are animals in the family of monotremes" is -0.23020738177073827
	Phrase: "Platypus lay eggs even though they are mammals" is -0.23253452301526378
	Phrase: "They are taking the hobbits to Isengard" is -0.2410456062453273
Distance from phrase "history" to:
	Phrase: "history" is 1.0
	Phrase: "snakes" is 0.7294531838087251
	Phrase: "birds" is 0.6878440441219348
	Phrase: "Victorian England" is 0.6171528620942585
	Phrase: "The Roman Republic" is 0.48231789906021916
	Phrase: "An essay about vipers" is 0.48050394501994287
	Phrase: "The temple of the snakes" is 0.4141767926445685
	Phrase: "A detailed history of the United Kingdom" is 0.40729360565060807
	Phrase: "Wales" is 0.3337169177569299
	Phrase: "Five serpents" is 0.31638335399535994
	Phrase: "A treaty on vipers" is 0.16800477409088788
	Phrase: "Five kittens" is 0.15289122649586898
	Phrase: "Five puppies" is 0.11647268985383732
	Phrase: "Platypus are animals in the family of monotrema" is 0.010699338004517623
	Phrase: "Most birds can fly, but not all of them" is -0.1289748521765876
	Phrase: "Important facts about snakes" is -0.21211808043470504
	Phrase: "Ostriches lay eggs in summer" is -0.3809109227202052
	Phrase: "Platypus are animals in the family of monotremes" is -0.415516791839634
	Phrase: "They are taking the hobbits to Isengard" is -0.42161126557001105
	Phrase: "Platypus lay eggs even though they are mammals" is -0.4389313409760877
Distance from phrase "snakes" to:
	Phrase: "snakes" is 1.0
	Phrase: "birds" is 0.8154332796941849
	Phrase: "history" is 0.7294531838087251
	Phrase: "An essay about vipers" is 0.6424852768571448
	Phrase: "Victorian England" is 0.5820718403103681
	Phrase: "The temple of the snakes" is 0.5808420416999638
	Phrase: "Five serpents" is 0.5350123489286813
	Phrase: "The Roman Republic" is 0.46379561792198076
	Phrase: "A detailed history of the United Kingdom" is 0.37678648423072164
	Phrase: "Wales" is 0.35357851660519546
	Phrase: "A treaty on vipers" is 0.34030332986162143
	Phrase: "Five kittens" is 0.3384167555590293
	Phrase: "Five puppies" is 0.29776784877896306
	Phrase: "Platypus are animals in the family of monotrema" is 0.1190190772294065
	Phrase: "Most birds can fly, but not all of them" is 0.04878154241021203
	Phrase: "Important facts about snakes" is -0.016928320998117207
	Phrase: "Ostriches lay eggs in summer" is -0.31028059984901213
	Phrase: "Platypus are animals in the family of monotremes" is -0.3266629465844782
	Phrase: "They are taking the hobbits to Isengard" is -0.336432211624184
	Phrase: "Platypus lay eggs even though they are mammals" is -0.3368052783296547

StrikingLoo on May 16, 2023

I’m not even sure what the embedding vector is supposed to be that llama.h gives you, I think it may represent the next generated token more than anything because it’s extracted at the end.

I concur

jbax3 on Nov 15, 2023

I think the output embedding is associated with current predication of next token.

memcpy(embedding_out.data(), (float *) ggml_get_data(embeddings) + (n_embd*(N - 1)), sizeof(float)*n_embd);

https://github.com/ggerganov/llama.cpp/blob/fa84c4b3e80199a5683438f062009c031a06c4fa/llama.cpp#LL1655C6-L1655C6

foldl on Jun 12, 2023

I don’t see these results as particularly unexpected.

A sentence that ends in a ’ ’ is inherently incomplete (it would be missing a word, etc) so it’s not weird that the model encodes it very differently than a complete one, though this is just my interpretation. As a recommendation I would advise any real applications using these embeddings strip trailing whitespace off input text, especially if it’s user input.

As for the “I like cats” vs “cats” similarity, I also don’t see it as particularly unexpected that they are not similar, as one is a sentence and the other a single word, and they only share part of the topic. I would be more surprised if two noun clauses (like “hairy feline” and “purring kitten”) that have similar meanings were assigned very different scores.

Basically things that are syntactically dissimilar are understandably not very close in embedding space.

If you test sentences with very similar syntax and somewhat similar semantics and they are not aligned at all, that would worry me more.

I hope this clarifies things! Anyone who knows more please chime in too.

On Fri, Apr 21, 2023, 02:13 Rimvydas Naktinis @.***> wrote:

I ran more tests using cosine similarity, so that it would be easier to comapare to the initial tests https://github.com/ggerganov/llama.cpp/pull/282#issuecomment-1479895785.

Some results are as expected:

“I like cats” is similar to “I love cats” and “cats are cute”, and dissimilar to “Napoleonic France”

“cat” is quite similar to “dog”

“Napoleonic France” is somewhat similar to “Victorian England”

“hello” is quite similar to “hi”

However some similarities are way off:

appending one of the phrases with a space character dramatically reduces the similarity

if both phrases end with a space character, the similarity comes back up

“I like cats” is very dissimilar to “cat” and “I like dogs” is very dissimilar to “dog”

@StrikingLoo https://github.com/StrikingLoo @ggerganov https://github.com/ggerganov any intuition why the current embedding calculation logic could be behaving this way?

“I like cats” – "I like cats "… 0.20311777255799193 “I like cats” – “I like dogs”… 0.896390003690664 “I like cats” – "I like dogs "… 0.20045489096743105 “I like cats” – “I love cats”… 0.9571038771953083 “I like cats” – "I love cats "… 0.2156631142674983 “I like cats” – “I love dogs”… 0.8450703589509785 “I like cats” – "I love dogs "… 0.2169230548515942 “I like cats” – “Napoleonic France”… -0.21246371932212327 “I like cats” – "Napoleonic France "… 0.04575540547715773 “I like cats” – “Victorian England”… -0.29933218462361305 “I like cats” – "Victorian England "… -0.06149233717528417 “I like cats” – “cat”… -0.22651239180178487 “I like cats” – "cat "… 0.05906783956749464 “I like cats” – “cats are cute”… 0.3670225246784726 “I like cats” – "cats are cute "… 0.11606769194395 “I like cats” – “dog”… -0.14639967519051528 “I like cats” – "dog "… 0.04783762210617664 “I like cats” – “dogs are cute”… 0.31819465704480615 “I like cats” – "dogs are cute "… 0.11610797748796792 “I like cats” – “hello”… -0.20630688086162569 “I like cats” – "hello "… 0.05191533662217677 “I like cats” – “hi”… -0.18188225673086578 “I like cats” – "hi "… -0.0595385355447103 "I like cats " – “I like dogs”… 0.19392397721812782 "I like cats " – "I like dogs "… 0.9601616172820892 "I like cats " – “I love cats”… 0.20298700271041506 "I like cats " – "I love cats "… 0.9692328566598946 "I like cats " – “I love dogs”… 0.18069456493337113 "I like cats " – "I love dogs "… 0.9361746123408047 "I like cats " – “Napoleonic France”… 0.04077828080003284 "I like cats " – "Napoleonic France "… 0.7514104733324016 "I like cats " – “Victorian England”… 0.009752570450316756 "I like cats " – "Victorian England "… 0.7966698584728275 "I like cats " – “cat”… -0.015622401712858672 "I like cats " – "cat "… 0.7438255953321713 "I like cats " – “cats are cute”… 0.20019632673493853 "I like cats " – "cats are cute "… 0.870023708294639 "I like cats " – “dog”… 0.0030972791571316615 "I like cats " – "dog "… 0.8017966029865697 "I like cats " – “dogs are cute”… 0.18456252662747993 "I like cats " – "dogs are cute "… 0.8497227651725612 "I like cats " – “hello”… -0.0005249279792397854 "I like cats " – "hello "… 0.8324597099732179 "I like cats " – “hi”… 0.0012268027593519127 "I like cats " – "hi "… 0.7523755760379622 “I like dogs” – "I like dogs "… 0.22689238866131242 “I like dogs” – “I love cats”… 0.8745890129079315 “I like dogs” – "I love cats "… 0.20704656061606252 “I like dogs” – “I love dogs”… 0.9488098708025015 “I like dogs” – "I love dogs "… 0.24556722925131885 “I like dogs” – “Napoleonic France”… -0.26413464286093585 “I like dogs” – "Napoleonic France "… 0.05801915836818936 “I like dogs” – “Victorian England”… -0.3562970344216997 “I like dogs” – "Victorian England "… -0.06291220071485515 “I like dogs” – “cat”… -0.3220193857299431 “I like dogs” – "cat "… 0.01976040801733492 “I like dogs” – “cats are cute”… 0.30090905476542995 “I like dogs” – "cats are cute "… 0.08185635301464264 “I like dogs” – “dog”… -0.15754898020924868 “I like dogs” – "dog "… 0.05649268019207619 “I like dogs” – “dogs are cute”… 0.25782603756454203 “I like dogs” – "dogs are cute "… 0.0890702719335868 “I like dogs” – “hello”… -0.2796894362421596 “I like dogs” – "hello "… 0.035996981301803635 “I like dogs” – “hi”… -0.25787672908495085 “I like dogs” – "hi "… -0.08290316130522596 "I like dogs " – “I love cats”… 0.2045472826446419 "I like dogs " – "I love cats "… 0.9167028194335608 "I like dogs " – “I love dogs”… 0.2129259955894849 "I like dogs " – "I love dogs "… 0.9534364920909392 "I like dogs " – “Napoleonic France”… 0.030884468121599513 "I like dogs " – "Napoleonic France "… 0.7373470208338967 "I like dogs " – “Victorian England”… -0.02431210206116902 "I like dogs " – "Victorian England "… 0.7752905016610782 "I like dogs " – “cat”… -0.08397765922811914 "I like dogs " – "cat "… 0.71447935466483 "I like dogs " – “cats are cute”… 0.17071387667006183 "I like dogs " – "cats are cute "… 0.8151229555939554 "I like dogs " – “dog”… -0.04537135780039387 "I like dogs " – "dog "… 0.8167544600308861 "I like dogs " – “dogs are cute”… 0.15822200994259486 "I like dogs " – "dogs are cute "… 0.7938602405373409 "I like dogs " – “hello”… -0.05666404826137203 "I like dogs " – "hello "… 0.8289671743241819 "I like dogs " – “hi”… -0.060960899056495974 "I like dogs " – "hi "… 0.7187010548820195 “I love cats” – "I love cats "… 0.2448064338260396 “I love cats” – “I love dogs”… 0.899362557333871 “I love cats” – "I love dogs "… 0.2469770260439035 “I love cats” – “Napoleonic France”… -0.2619564319421419 “I love cats” – "Napoleonic France "… 0.04512874304527943 “I love cats” – “Victorian England”… -0.3351779492606247 “I love cats” – "Victorian England "… -0.05627744048023769 “I love cats” – “cat”… -0.24381239195695179 “I love cats” – "cat "… 0.05865530689702666 “I love cats” – “cats are cute”… 0.3642354902833239 “I love cats” – "cats are cute "… 0.12915733809213054 “I love cats” – “dog”… -0.181630562647824 “I love cats” – "dog "… 0.04991525949175284 “I love cats” – “dogs are cute”… 0.31779280347738087 “I love cats” – "dogs are cute "… 0.12914489705580579 “I love cats” – “hello”… -0.20556096184328576 “I love cats” – "hello "… 0.07391973600329921 “I love cats” – “hi”… -0.18424632031868096 “I love cats” – "hi "… -0.05032686070896378 "I love cats " – “I love dogs”… 0.2252081785243395 "I love cats " – "I love dogs "… 0.9536944077380259 "I love cats " – “Napoleonic France”… 0.022966387887623004 "I love cats " – "Napoleonic France "… 0.74409242120594 "I love cats " – “Victorian England”… 0.005962043345386044 "I love cats " – "Victorian England "… 0.781874206851949 "I love cats " – “cat”… -0.0034494665529626427 "I love cats " – "cat "… 0.7317299538195132 "I love cats " – “cats are cute”… 0.2262019494531532 "I love cats " – "cats are cute "… 0.8769976427038626 "I love cats " – “dog”… 0.02328492758403161 "I love cats " – "dog "… 0.7703433589994425 "I love cats " – “dogs are cute”… 0.2104158917188272 "I love cats " – "dogs are cute "… 0.8660908021592335 "I love cats " – “hello”… 0.023115252661932466 "I love cats " – "hello "… 0.8086529873575895 "I love cats " – “hi”… 0.023717349902427878 "I love cats " – "hi "… 0.7426429014054192 “I love dogs” – "I love dogs "… 0.2668744541065285 “I love dogs” – “Napoleonic France”… -0.29275150529306815 “I love dogs” – "Napoleonic France "… 0.04357306838641106 “I love dogs” – “Victorian England”… -0.36638799068196853 “I love dogs” – "Victorian England "… -0.06908215968686245 “I love dogs” – “cat”… -0.3047423164022532 “I love dogs” – "cat "… 0.0101104682762854 “I love dogs” – “cats are cute”… 0.3039941060157555 “I love dogs” – "cats are cute "… 0.08910464218525402 “I love dogs” – “dog”… -0.15135784328566665 “I love dogs” – "dog "… 0.05290617609381392 “I love dogs” – “dogs are cute”… 0.26499805257358044 “I love dogs” – "dogs are cute "… 0.09934014727476749 “I love dogs” – “hello”… -0.24268201717121615 “I love dogs” – "hello "… 0.045935074892588655 “I love dogs” – “hi”… -0.22500111960052072 “I love dogs” – "hi "… -0.07546074189006309 "I love dogs " – “Napoleonic France”… 0.01883090481368493 "I love dogs " – "Napoleonic France "… 0.7386010682104132 "I love dogs " – “Victorian England”… -0.02281812995309553 "I love dogs " – "Victorian England "… 0.7603767383707928 "I love dogs " – “cat”… -0.06416890752500873 "I love dogs " – "cat "… 0.7087321235353528 "I love dogs " – “cats are cute”… 0.20021670300208802 "I love dogs " – "cats are cute "… 0.8293343369105992 "I love dogs " – “dog”… -0.007743482872031577 "I love dogs " – "dog "… 0.791858638404352 "I love dogs " – “dogs are cute”… 0.18901810582114495 "I love dogs " – "dogs are cute "… 0.8217160711203176 "I love dogs " – “hello”… -0.028063669282785846 "I love dogs " – "hello "… 0.7975007795567103 "I love dogs " – “hi”… -0.02801001638132258 "I love dogs " – "hi "… 0.7116123302355635 “Napoleonic France” – "Napoleonic France "… 0.23522390837866922 “Napoleonic France” – “Victorian England”… 0.6859025998049194 “Napoleonic France” – "Victorian England "… 0.15648560509818651 “Napoleonic France” – “cat”… 0.35800033036759454 “Napoleonic France” – "cat "… 0.10647011838283668 “Napoleonic France” – “cats are cute”… 0.07981987732132663 “Napoleonic France” – "cats are cute "… 0.078149911960321 “Napoleonic France” – “dog”… 0.3826710214412356 “Napoleonic France” – "dog "… 0.11401018637067296 “Napoleonic France” – “dogs are cute”… 0.0773770554340013 “Napoleonic France” – "dogs are cute "… 0.09123545209030627 “Napoleonic France” – “hello”… 0.37213096418783836 “Napoleonic France” – "hello "… 0.057774352193263975 “Napoleonic France” – “hi”… 0.3507834273848848 “Napoleonic France” – "hi "… 0.17696122118133434 "Napoleonic France " – “Victorian England”… 0.08466607680324116 "Napoleonic France " – "Victorian England "… 0.8037786302246899 "Napoleonic France " – “cat”… -0.019977595529280315 "Napoleonic France " – "cat "… 0.7037017986232446 "Napoleonic France " – “cats are cute”… 0.07337913536494711 "Napoleonic France " – "cats are cute "… 0.6771872359838416 "Napoleonic France " – “dog”… 0.010643016302043572 "Napoleonic France " – "dog "… 0.739274480331095 "Napoleonic France " – “dogs are cute”… 0.04549129074053724 "Napoleonic France " – "dogs are cute "… 0.6471315374932367 "Napoleonic France " – “hello”… -0.04491316315316086 "Napoleonic France " – "hello "… 0.7016026239194642 "Napoleonic France " – “hi”… -0.04483742943994349 "Napoleonic France " – "hi "… 0.6379276120297552 “Victorian England” – "Victorian England "… 0.1970397243022337 “Victorian England” – “cat”… 0.5315626991866473 “Victorian England” – "cat "… 0.1132440438361098 “Victorian England” – “cats are cute”… 0.07564712547170802 “Victorian England” – "cats are cute "… 0.07047236143056597 “Victorian England” – “dog”… 0.5023841250096192 “Victorian England” – "dog "… 0.09627092477400122 “Victorian England” – “dogs are cute”… 0.08558379851546237 “Victorian England” – "dogs are cute "… 0.0892397072219102 “Victorian England” – “hello”… 0.5153410825616703 “Victorian England” – "hello "… 0.04956935673613258 “Victorian England” – “hi”… 0.4727394855738129 “Victorian England” – "hi "… 0.21165691018559324 "Victorian England " – “cat”… 0.051434725296773336 "Victorian England " – "cat "… 0.7840190374817173 "Victorian England " – “cats are cute”… 0.0647773051440868 "Victorian England " – "cats are cute "… 0.7347889675972376 "Victorian England " – “dog”… 0.03781358609513832 "Victorian England " – "dog "… 0.8064848839781267 "Victorian England " – “dogs are cute”… 0.04396328283710094 "Victorian England " – "dogs are cute "… 0.6829677641565312 "Victorian England " – “hello”… 0.042481552565901706 "Victorian England " – "hello "… 0.808211277301756 "Victorian England " – “hi”… 0.028057424386802313 "Victorian England " – "hi "… 0.745340058660383 “cat” – "cat "… 0.2261562422600992 “cat” – “cats are cute”… 0.055479073463416025 “cat” – "cats are cute "… 0.042783194474326644 “cat” – “dog”… 0.7428052216162652 “cat” – "dog "… 0.07579947107525319 “cat” – “dogs are cute”… 0.08587503622015415 “cat” – "dogs are cute "… 0.06047225271094304 “cat” – “hello”… 0.5867101415982408 “cat” – "hello "… 0.020849676027916392 “cat” – “hi”… 0.5395565382469979 “cat” – "hi "… 0.18922445718289724 "cat " – “cats are cute”… 0.1377352530456209 "cat " – "cats are cute "… 0.7045457312726324 "cat " – “dog”… 0.151244663186442 "cat " – "dog "… 0.8130943607206529 "cat " – “dogs are cute”… 0.10501837198032893 "cat " – "dogs are cute "… 0.6591719081389649 "cat " – “hello”… 0.05720266958632396 "cat " – "hello "… 0.7487726233664361 "cat " – “hi”… 0.04989013326368915 "cat " – "hi "… 0.7145388756690138 “cats are cute” – "cats are cute "… 0.3062561991124481 “cats are cute” – “dog”… 0.12454416235558191 “cats are cute” – "dog "… 0.13800195126360817 “cats are cute” – “dogs are cute”… 0.9635889317154503 “cats are cute” – "dogs are cute "… 0.3340841814158837 “cats are cute” – “hello”… 0.19514219546396794 “cats are cute” – "hello "… 0.15336479785550297 “cats are cute” – “hi”… 0.19398964147538308 “cats are cute” – "hi "… 0.15299873070429496 "cats are cute " – “dog”… 0.047162606186251725 "cats are cute " – "dog "… 0.7412502506668067 "cats are cute " – “dogs are cute”… 0.2922316428497054 "cats are cute " – "dogs are cute "… 0.9694282308713165 "cats are cute " – “hello”… 0.07172682701676675 "cats are cute " – "hello "… 0.7659055442905317 "cats are cute " – “hi”… 0.07395981795315063 "cats are cute " – "hi "… 0.7292468669861907 “dog” – "dog "… 0.15684172716063982 “dog” – “dogs are cute”… 0.15841792840219776 “dog” – "dogs are cute "… 0.07408198913521334 “dog” – “hello”… 0.5390216946824529 “dog” – "hello "… 0.004148038650315292 “dog” – “hi”… 0.4729935607550871 “dog” – "hi "… 0.1594872209549724 "dog " – “dogs are cute”… 0.11997556990898295 "dog " – "dogs are cute "… 0.700678277520648 "dog " – “hello”… 0.0252694300933951 "dog " – "hello "… 0.8277359075315665 "dog " – “hi”… 0.005638887786058549 "dog " – "hi "… 0.7370474015559675 “dogs are cute” – "dogs are cute "… 0.3421108849843522 “dogs are cute” – “hello”… 0.2194993676080879 “dogs are cute” – "hello "… 0.1326256045415808 “dogs are cute” – “hi”… 0.21773509413279815 “dogs are cute” – "hi "… 0.1500779129341232 "dogs are cute " – “hello”… 0.1031251333293599 "dogs are cute " – "hello "… 0.7279175778496194 "dogs are cute " – “hi”… 0.11635693884531505 "dogs are cute " – "hi "… 0.7269622577344995 “hello” – "hello "… 0.15147937239963533 “hello” – “hi”… 0.8043211555390358 “hello” – "hi "… 0.2946474076607263 "hello " – “hi”… 0.10360399054770437 "hello " – "hi "… 0.8464744965225194 “hi” – "hi "… 0.373960595217887

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/899#issuecomment-1516677687, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6JCX57AZXUQPMBVZNQXLTXCFVCRANCNFSM6AAAAAAW2SDSGU . You are receiving this because you were mentioned.Message ID: @.***>

StrikingLoo on Apr 21, 2023

I ran more tests using cosine similarity, so that it would be easier to comapare to the initial tests.

Some results are as expected:

“I like cats” is similar to “I love cats” and “cats are cute”, and dissimilar to “Napoleonic France”
“cat” is quite similar to “dog”
“Napoleonic France” is somewhat similar to “Victorian England”
“hello” is quite similar to “hi”

However some similarities are way off:

appending one of the phrases with a space character dramatically reduces the similarity
if both phrases end with a space character, the similarity comes back up
“I like cats” is very dissimilar to “cat” and “I like dogs” is very dissimilar to “dog”

@StrikingLoo @ggerganov any intuition why the current embedding calculation logic could be behaving this way?

"I like cats" -- "I like cats "................ 0.20311777255799193
"I like cats" -- "I like dogs"................. 0.896390003690664
"I like cats" -- "I like dogs "................ 0.20045489096743105
"I like cats" -- "I love cats"................. 0.9571038771953083
"I like cats" -- "I love cats "................ 0.2156631142674983
"I like cats" -- "I love dogs"................. 0.8450703589509785
"I like cats" -- "I love dogs "................ 0.2169230548515942
"I like cats" -- "Napoleonic France"........... -0.21246371932212327
"I like cats" -- "Napoleonic France ".......... 0.04575540547715773
"I like cats" -- "Victorian England"........... -0.29933218462361305
"I like cats" -- "Victorian England ".......... -0.06149233717528417
"I like cats" -- "cat"......................... -0.22651239180178487
"I like cats" -- "cat "........................ 0.05906783956749464
"I like cats" -- "cats are cute"............... 0.3670225246784726
"I like cats" -- "cats are cute ".............. 0.11606769194395
"I like cats" -- "dog"......................... -0.14639967519051528
"I like cats" -- "dog "........................ 0.04783762210617664
"I like cats" -- "dogs are cute"............... 0.31819465704480615
"I like cats" -- "dogs are cute ".............. 0.11610797748796792
"I like cats" -- "hello"....................... -0.20630688086162569
"I like cats" -- "hello "...................... 0.05191533662217677
"I like cats" -- "hi".......................... -0.18188225673086578
"I like cats" -- "hi "......................... -0.0595385355447103
"I like cats " -- "I like dogs"................ 0.19392397721812782
"I like cats " -- "I like dogs "............... 0.9601616172820892
"I like cats " -- "I love cats"................ 0.20298700271041506
"I like cats " -- "I love cats "............... 0.9692328566598946
"I like cats " -- "I love dogs"................ 0.18069456493337113
"I like cats " -- "I love dogs "............... 0.9361746123408047
"I like cats " -- "Napoleonic France".......... 0.04077828080003284
"I like cats " -- "Napoleonic France "......... 0.7514104733324016
"I like cats " -- "Victorian England".......... 0.009752570450316756
"I like cats " -- "Victorian England "......... 0.7966698584728275
"I like cats " -- "cat"........................ -0.015622401712858672
"I like cats " -- "cat "....................... 0.7438255953321713
"I like cats " -- "cats are cute".............. 0.20019632673493853
"I like cats " -- "cats are cute "............. 0.870023708294639
"I like cats " -- "dog"........................ 0.0030972791571316615
"I like cats " -- "dog "....................... 0.8017966029865697
"I like cats " -- "dogs are cute".............. 0.18456252662747993
"I like cats " -- "dogs are cute "............. 0.8497227651725612
"I like cats " -- "hello"...................... -0.0005249279792397854
"I like cats " -- "hello "..................... 0.8324597099732179
"I like cats " -- "hi"......................... 0.0012268027593519127
"I like cats " -- "hi "........................ 0.7523755760379622
"I like dogs" -- "I like dogs "................ 0.22689238866131242
"I like dogs" -- "I love cats"................. 0.8745890129079315
"I like dogs" -- "I love cats "................ 0.20704656061606252
"I like dogs" -- "I love dogs"................. 0.9488098708025015
"I like dogs" -- "I love dogs "................ 0.24556722925131885
"I like dogs" -- "Napoleonic France"........... -0.26413464286093585
"I like dogs" -- "Napoleonic France ".......... 0.05801915836818936
"I like dogs" -- "Victorian England"........... -0.3562970344216997
"I like dogs" -- "Victorian England ".......... -0.06291220071485515
"I like dogs" -- "cat"......................... -0.3220193857299431
"I like dogs" -- "cat "........................ 0.01976040801733492
"I like dogs" -- "cats are cute"............... 0.30090905476542995
"I like dogs" -- "cats are cute ".............. 0.08185635301464264
"I like dogs" -- "dog"......................... -0.15754898020924868
"I like dogs" -- "dog "........................ 0.05649268019207619
"I like dogs" -- "dogs are cute"............... 0.25782603756454203
"I like dogs" -- "dogs are cute ".............. 0.0890702719335868
"I like dogs" -- "hello"....................... -0.2796894362421596
"I like dogs" -- "hello "...................... 0.035996981301803635
"I like dogs" -- "hi".......................... -0.25787672908495085
"I like dogs" -- "hi "......................... -0.08290316130522596
"I like dogs " -- "I love cats"................ 0.2045472826446419
"I like dogs " -- "I love cats "............... 0.9167028194335608
"I like dogs " -- "I love dogs"................ 0.2129259955894849
"I like dogs " -- "I love dogs "............... 0.9534364920909392
"I like dogs " -- "Napoleonic France".......... 0.030884468121599513
"I like dogs " -- "Napoleonic France "......... 0.7373470208338967
"I like dogs " -- "Victorian England".......... -0.02431210206116902
"I like dogs " -- "Victorian England "......... 0.7752905016610782
"I like dogs " -- "cat"........................ -0.08397765922811914
"I like dogs " -- "cat "....................... 0.71447935466483
"I like dogs " -- "cats are cute".............. 0.17071387667006183
"I like dogs " -- "cats are cute "............. 0.8151229555939554
"I like dogs " -- "dog"........................ -0.04537135780039387
"I like dogs " -- "dog "....................... 0.8167544600308861
"I like dogs " -- "dogs are cute".............. 0.15822200994259486
"I like dogs " -- "dogs are cute "............. 0.7938602405373409
"I like dogs " -- "hello"...................... -0.05666404826137203
"I like dogs " -- "hello "..................... 0.8289671743241819
"I like dogs " -- "hi"......................... -0.060960899056495974
"I like dogs " -- "hi "........................ 0.7187010548820195
"I love cats" -- "I love cats "................ 0.2448064338260396
"I love cats" -- "I love dogs"................. 0.899362557333871
"I love cats" -- "I love dogs "................ 0.2469770260439035
"I love cats" -- "Napoleonic France"........... -0.2619564319421419
"I love cats" -- "Napoleonic France ".......... 0.04512874304527943
"I love cats" -- "Victorian England"........... -0.3351779492606247
"I love cats" -- "Victorian England ".......... -0.05627744048023769
"I love cats" -- "cat"......................... -0.24381239195695179
"I love cats" -- "cat "........................ 0.05865530689702666
"I love cats" -- "cats are cute"............... 0.3642354902833239
"I love cats" -- "cats are cute ".............. 0.12915733809213054
"I love cats" -- "dog"......................... -0.181630562647824
"I love cats" -- "dog "........................ 0.04991525949175284
"I love cats" -- "dogs are cute"............... 0.31779280347738087
"I love cats" -- "dogs are cute ".............. 0.12914489705580579
"I love cats" -- "hello"....................... -0.20556096184328576
"I love cats" -- "hello "...................... 0.07391973600329921
"I love cats" -- "hi".......................... -0.18424632031868096
"I love cats" -- "hi "......................... -0.05032686070896378
"I love cats " -- "I love dogs"................ 0.2252081785243395
"I love cats " -- "I love dogs "............... 0.9536944077380259
"I love cats " -- "Napoleonic France".......... 0.022966387887623004
"I love cats " -- "Napoleonic France "......... 0.74409242120594
"I love cats " -- "Victorian England".......... 0.005962043345386044
"I love cats " -- "Victorian England "......... 0.781874206851949
"I love cats " -- "cat"........................ -0.0034494665529626427
"I love cats " -- "cat "....................... 0.7317299538195132
"I love cats " -- "cats are cute".............. 0.2262019494531532
"I love cats " -- "cats are cute "............. 0.8769976427038626
"I love cats " -- "dog"........................ 0.02328492758403161
"I love cats " -- "dog "....................... 0.7703433589994425
"I love cats " -- "dogs are cute".............. 0.2104158917188272
"I love cats " -- "dogs are cute "............. 0.8660908021592335
"I love cats " -- "hello"...................... 0.023115252661932466
"I love cats " -- "hello "..................... 0.8086529873575895
"I love cats " -- "hi"......................... 0.023717349902427878
"I love cats " -- "hi "........................ 0.7426429014054192
"I love dogs" -- "I love dogs "................ 0.2668744541065285
"I love dogs" -- "Napoleonic France"........... -0.29275150529306815
"I love dogs" -- "Napoleonic France ".......... 0.04357306838641106
"I love dogs" -- "Victorian England"........... -0.36638799068196853
"I love dogs" -- "Victorian England ".......... -0.06908215968686245
"I love dogs" -- "cat"......................... -0.3047423164022532
"I love dogs" -- "cat "........................ 0.0101104682762854
"I love dogs" -- "cats are cute"............... 0.3039941060157555
"I love dogs" -- "cats are cute ".............. 0.08910464218525402
"I love dogs" -- "dog"......................... -0.15135784328566665
"I love dogs" -- "dog "........................ 0.05290617609381392
"I love dogs" -- "dogs are cute"............... 0.26499805257358044
"I love dogs" -- "dogs are cute ".............. 0.09934014727476749
"I love dogs" -- "hello"....................... -0.24268201717121615
"I love dogs" -- "hello "...................... 0.045935074892588655
"I love dogs" -- "hi".......................... -0.22500111960052072
"I love dogs" -- "hi "......................... -0.07546074189006309
"I love dogs " -- "Napoleonic France".......... 0.01883090481368493
"I love dogs " -- "Napoleonic France "......... 0.7386010682104132
"I love dogs " -- "Victorian England".......... -0.02281812995309553
"I love dogs " -- "Victorian England "......... 0.7603767383707928
"I love dogs " -- "cat"........................ -0.06416890752500873
"I love dogs " -- "cat "....................... 0.7087321235353528
"I love dogs " -- "cats are cute".............. 0.20021670300208802
"I love dogs " -- "cats are cute "............. 0.8293343369105992
"I love dogs " -- "dog"........................ -0.007743482872031577
"I love dogs " -- "dog "....................... 0.791858638404352
"I love dogs " -- "dogs are cute".............. 0.18901810582114495
"I love dogs " -- "dogs are cute "............. 0.8217160711203176
"I love dogs " -- "hello"...................... -0.028063669282785846
"I love dogs " -- "hello "..................... 0.7975007795567103
"I love dogs " -- "hi"......................... -0.02801001638132258
"I love dogs " -- "hi "........................ 0.7116123302355635
"Napoleonic France" -- "Napoleonic France ".... 0.23522390837866922
"Napoleonic France" -- "Victorian England"..... 0.6859025998049194
"Napoleonic France" -- "Victorian England ".... 0.15648560509818651
"Napoleonic France" -- "cat"................... 0.35800033036759454
"Napoleonic France" -- "cat ".................. 0.10647011838283668
"Napoleonic France" -- "cats are cute"......... 0.07981987732132663
"Napoleonic France" -- "cats are cute "........ 0.078149911960321
"Napoleonic France" -- "dog"................... 0.3826710214412356
"Napoleonic France" -- "dog ".................. 0.11401018637067296
"Napoleonic France" -- "dogs are cute"......... 0.0773770554340013
"Napoleonic France" -- "dogs are cute "........ 0.09123545209030627
"Napoleonic France" -- "hello"................. 0.37213096418783836
"Napoleonic France" -- "hello "................ 0.057774352193263975
"Napoleonic France" -- "hi".................... 0.3507834273848848
"Napoleonic France" -- "hi "................... 0.17696122118133434
"Napoleonic France " -- "Victorian England".... 0.08466607680324116
"Napoleonic France " -- "Victorian England "... 0.8037786302246899
"Napoleonic France " -- "cat".................. -0.019977595529280315
"Napoleonic France " -- "cat "................. 0.7037017986232446
"Napoleonic France " -- "cats are cute"........ 0.07337913536494711
"Napoleonic France " -- "cats are cute "....... 0.6771872359838416
"Napoleonic France " -- "dog".................. 0.010643016302043572
"Napoleonic France " -- "dog "................. 0.739274480331095
"Napoleonic France " -- "dogs are cute"........ 0.04549129074053724
"Napoleonic France " -- "dogs are cute "....... 0.6471315374932367
"Napoleonic France " -- "hello"................ -0.04491316315316086
"Napoleonic France " -- "hello "............... 0.7016026239194642
"Napoleonic France " -- "hi"................... -0.04483742943994349
"Napoleonic France " -- "hi ".................. 0.6379276120297552
"Victorian England" -- "Victorian England ".... 0.1970397243022337
"Victorian England" -- "cat"................... 0.5315626991866473
"Victorian England" -- "cat ".................. 0.1132440438361098
"Victorian England" -- "cats are cute"......... 0.07564712547170802
"Victorian England" -- "cats are cute "........ 0.07047236143056597
"Victorian England" -- "dog"................... 0.5023841250096192
"Victorian England" -- "dog ".................. 0.09627092477400122
"Victorian England" -- "dogs are cute"......... 0.08558379851546237
"Victorian England" -- "dogs are cute "........ 0.0892397072219102
"Victorian England" -- "hello"................. 0.5153410825616703
"Victorian England" -- "hello "................ 0.04956935673613258
"Victorian England" -- "hi".................... 0.4727394855738129
"Victorian England" -- "hi "................... 0.21165691018559324
"Victorian England " -- "cat".................. 0.051434725296773336
"Victorian England " -- "cat "................. 0.7840190374817173
"Victorian England " -- "cats are cute"........ 0.0647773051440868
"Victorian England " -- "cats are cute "....... 0.7347889675972376
"Victorian England " -- "dog".................. 0.03781358609513832
"Victorian England " -- "dog "................. 0.8064848839781267
"Victorian England " -- "dogs are cute"........ 0.04396328283710094
"Victorian England " -- "dogs are cute "....... 0.6829677641565312
"Victorian England " -- "hello"................ 0.042481552565901706
"Victorian England " -- "hello "............... 0.808211277301756
"Victorian England " -- "hi"................... 0.028057424386802313
"Victorian England " -- "hi ".................. 0.745340058660383
"cat" -- "cat "................................ 0.2261562422600992
"cat" -- "cats are cute"....................... 0.055479073463416025
"cat" -- "cats are cute "...................... 0.042783194474326644
"cat" -- "dog"................................. 0.7428052216162652
"cat" -- "dog "................................ 0.07579947107525319
"cat" -- "dogs are cute"....................... 0.08587503622015415
"cat" -- "dogs are cute "...................... 0.06047225271094304
"cat" -- "hello"............................... 0.5867101415982408
"cat" -- "hello ".............................. 0.020849676027916392
"cat" -- "hi".................................. 0.5395565382469979
"cat" -- "hi "................................. 0.18922445718289724
"cat " -- "cats are cute"...................... 0.1377352530456209
"cat " -- "cats are cute "..................... 0.7045457312726324
"cat " -- "dog"................................ 0.151244663186442
"cat " -- "dog "............................... 0.8130943607206529
"cat " -- "dogs are cute"...................... 0.10501837198032893
"cat " -- "dogs are cute "..................... 0.6591719081389649
"cat " -- "hello".............................. 0.05720266958632396
"cat " -- "hello "............................. 0.7487726233664361
"cat " -- "hi"................................. 0.04989013326368915
"cat " -- "hi "................................ 0.7145388756690138
"cats are cute" -- "cats are cute "............ 0.3062561991124481
"cats are cute" -- "dog"....................... 0.12454416235558191
"cats are cute" -- "dog "...................... 0.13800195126360817
"cats are cute" -- "dogs are cute"............. 0.9635889317154503
"cats are cute" -- "dogs are cute "............ 0.3340841814158837
"cats are cute" -- "hello"..................... 0.19514219546396794
"cats are cute" -- "hello ".................... 0.15336479785550297
"cats are cute" -- "hi"........................ 0.19398964147538308
"cats are cute" -- "hi "....................... 0.15299873070429496
"cats are cute " -- "dog"...................... 0.047162606186251725
"cats are cute " -- "dog "..................... 0.7412502506668067
"cats are cute " -- "dogs are cute"............ 0.2922316428497054
"cats are cute " -- "dogs are cute "........... 0.9694282308713165
"cats are cute " -- "hello".................... 0.07172682701676675
"cats are cute " -- "hello "................... 0.7659055442905317
"cats are cute " -- "hi"....................... 0.07395981795315063
"cats are cute " -- "hi "...................... 0.7292468669861907
"dog" -- "dog "................................ 0.15684172716063982
"dog" -- "dogs are cute"....................... 0.15841792840219776
"dog" -- "dogs are cute "...................... 0.07408198913521334
"dog" -- "hello"............................... 0.5390216946824529
"dog" -- "hello ".............................. 0.004148038650315292
"dog" -- "hi".................................. 0.4729935607550871
"dog" -- "hi "................................. 0.1594872209549724
"dog " -- "dogs are cute"...................... 0.11997556990898295
"dog " -- "dogs are cute "..................... 0.700678277520648
"dog " -- "hello".............................. 0.0252694300933951
"dog " -- "hello "............................. 0.8277359075315665
"dog " -- "hi"................................. 0.005638887786058549
"dog " -- "hi "................................ 0.7370474015559675
"dogs are cute" -- "dogs are cute "............ 0.3421108849843522
"dogs are cute" -- "hello"..................... 0.2194993676080879
"dogs are cute" -- "hello ".................... 0.1326256045415808
"dogs are cute" -- "hi"........................ 0.21773509413279815
"dogs are cute" -- "hi "....................... 0.1500779129341232
"dogs are cute " -- "hello".................... 0.1031251333293599
"dogs are cute " -- "hello "................... 0.7279175778496194
"dogs are cute " -- "hi"....................... 0.11635693884531505
"dogs are cute " -- "hi "...................... 0.7269622577344995
"hello" -- "hello "............................ 0.15147937239963533
"hello" -- "hi"................................ 0.8043211555390358
"hello" -- "hi "............................... 0.2946474076607263
"hello " -- "hi"............................... 0.10360399054770437
"hello " -- "hi ".............................. 0.8464744965225194
"hi" -- "hi ".................................. 0.373960595217887

naktinis on Apr 20, 2023

@akarshanbiswas, the server needs to be started with the --embedding option, since it adds some overhead to processing it is disabled by default.

SlyEcho on Aug 9, 2023

Yes, I also tried myself. my similarity search based on this llama embedding doesn’t work at all. It finds content that is far away from the query.

Switching to a different embedding system solved my issue.

Also, does the tokenizer tokenize spaces? I thought “hello” and "hello " should be the same if tokenized?

Are you using a non-llama model for generating embeddings and doing the search, or did you find a way to do it with Llama?

4t0m on Apr 21, 2023

With latest master you can use the embedding tool to compute cosine similarities of different prompts:

./embedding -m ./ggml-sfr-embedding-mistral-q4_k_m.gguf -p "hello\nhello\njimmy\njimmy" -e --verbose-prompt

main: prompt 0: 'hello'
main: number of tokens in prompt = 3
     1 -> ''
  6312 -> ' hell'
 28709 -> 'o'


main: prompt 1: 'hello'
main: number of tokens in prompt = 3
     1 -> ''
  6312 -> ' hell'
 28709 -> 'o'


main: prompt 2: 'jimmy'
main: number of tokens in prompt = 4
     1 -> ''
   461 -> ' j'
   321 -> 'im'
  1916 -> 'my'


main: prompt 3: 'jimmy'
main: number of tokens in prompt = 4
     1 -> ''
   461 -> ' j'
   321 -> 'im'
  1916 -> 'my'


batch_decode: n_tokens = 14, n_seq = 4

embedding 0: -0.007333 -0.000813 -0.018733 0.002301 0.025102 -0.006046 -0.006658 0.016906 0.009363 -0.010177 0.036365 0.002155 -0.000981 -0.002932 0.003560 -0.012318 
embedding 1: -0.007333 -0.000813 -0.018733 0.002302 0.025101 -0.006047 -0.006658 0.016906 0.009363 -0.010177 0.036365 0.002155 -0.000981 -0.002932 0.003560 -0.012318 
embedding 2: 0.003885 -0.002249 0.016749 0.016726 -0.003766 -0.007469 -0.011537 -0.010095 -0.000311 -0.002057 0.024579 0.008103 -0.013174 -0.020640 -0.001157 -0.003295 

cosine similarity matrix:

  1.00   1.00   0.18   0.18 
  1.00   1.00   0.18   0.18 
  0.18   0.18   1.00   1.00 
  0.18   0.18   1.00   1.00

ggerganov on Mar 14, 2024

I did some experiments on this embedding the other day and tested using averaging the vectors.

How: change the embedding vector to be [n_embd * n_ctx] in size, from the llama.h API return the average embedding of the so far evaluated contexts.

It seemed to be doing a little better in some of the document retrieval tasks. There is still the issue that it is kind of slow even with GPU acceleration to process a lot of text. Maybe all the layers are not necessary to process?

I think maybe LLaMA is not the right model for this task, some kind of encoder-decoder model could be better.

SlyEcho on Jun 27, 2023

(as an aside, for indexing a large base of documents, I would definitely welcome a webserver-like mode, that would load the model once, and then accept requests with documents, returning the embeddings - currently each run loads the model again)

tkafka on May 16, 2023

Hey just tried the server, the model being used to return the embeddings is “gpt-3.5-turbo-0613” instead of the model path passed by me.

It’s a placeholder string - you can override it by passing "model" in the POST data

ggerganov on Mar 15, 2024

You can try using server --embedding, but I think we still have some problems with the tokenization around special tokens, so the results might not be correct atm. We’ll fix this soon

ggerganov on Mar 15, 2024

hey so the embeddings displayed on console is of size 16 but isn’t mistral embedding dimension 1024?

Embeddings can be of many different sizes depending on the model, but the code only allows up to 16 values to be shown. You can change this if you want to see more by altering line 174 or thereabouts:

        for (int i = 0; i < std::min(16, n_embd); i++) {

Just change the 16 to however many values you want to see (but mistral-7b-instruct-v0.2.Q8_0.gguf, for example, uses 4096 so it’s not a very easy set of numbers to use; they’re really just illustrative rather than directly useful).

pudepiedj on Mar 15, 2024

Beautiful 👌

cosine similarity matrix:

1.00 0.99 0.74 0.75 0.99 1.00 0.74 0.75 0.74 0.74 1.00 0.99 0.75 0.75 0.99 1.00

tokenizers man, they do my head in

wassname on Mar 14, 2024

llama.cpp also bypasses the lm_head when embeddings are computed. Not sure - can you check how the reference implementation tokenizes the strings?

ggerganov on Mar 14, 2024

I mean to run it just for ["hello", "hello ", "jimmy", "jimmy "] and report only the similarity numbers. The rest is irrelevant

ggerganov on Mar 14, 2024

Very nice! Since we’re showing all the cosine similarities, I’m not sure why we are only showing the first 3 embeddings. Would seem to make sense to change line 172 to show as many embeddings as there are prompts since in this illustrative example there are not likely to be many (or set the minimum well above 3):

for (int j = 0; j < n_prompts; j++) {
...

pudepiedj on Mar 14, 2024

More of the interesting discussions (from BERT)

The USE is a whole another approach and I do agree simply averaging may not be the best way especially with contextualized embeddings. I am working on introducing other pooling strategies for BERT to average the last 4 layers instead of just having 1 layer at the time, and also extend the SentenceEmbeddings to do more such as weighted-average, including TF-IDF as a weight factor, and SIF (Smooth Inverse Frequency).

https://github.com/JohnSnowLabs/spark-nlp/issues/684#issuecomment-557897665

tkafka on May 16, 2023

@StrikingLoo Not sure actually - I have been using LLMs like a ‘magical black boxes’ so far, and am reading up on the basics. The word embeddings are definitely problematic, as the google researchers replied (for the BERT embeddings):

I’m not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn’t mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

and also

Bert is a language model, was never really meant for sentence similarity tasks, you can try fine-tuned bert model for sentence similarity and use it as a sentence encoder if you have clean, decently long sentences.

https://github.com/google-research/bert/issues/164#issuecomment-441324222

I am beginning to lean onto the idea that what llama does now is actually 'least bad’option out of the easily available ones, and there seems to be active research still going on about how to best semantically embed sentences or documents …

tkafka on May 16, 2023

I think it may represent the next generated token more than anything because it’s extracted at the end.

I tried reading on basics of transformers at https://www.baeldung.com/cs/transformer-text-embeddings and near the end they say:

If we want a vector representing each token, we can just use the corresponding output vector produced by the encoding stack block (The “y” vectors in the diagram above)

If we need a vector representing the whole sequence, there are 3 strategies we can follow:

Use the [CLS] token output vector (I believe this is what we are doing now?)

Apply mean pooling between the token vectors

Apply max-pooling between the token vectors

The default strategy is the first one, though some papers suggest the other two work best for other tasks.

The last link says that “The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.” The authors present a different network structure, that can actually generate sentence embeddings.

So, it would seem that the keyword to google is “sentence embedding with LLM”.

Googling that, there is a SO question noticing that OpenAI embeddings don’t seem to work much for short inputs either: https://datascience.stackexchange.com/questions/120422/text-embeddings-for-words-or-very-short-sentences-with-a-llm

tkafka on May 16, 2023

Yes, I also tried myself. my similarity search based on this llama embedding doesn’t work at all. It finds content that is far away from the query.

Switching to a different embedding system solved my issue.

Also, does the tokenizer tokenize spaces? I thought “hello” and "hello " should be the same if tokenized?

instead of 7B, have you tried with bigger llama model?

dkzdev on Apr 25, 2023

I’m not even sure what the embedding vector is supposed to be that llama.h gives you, I think it may represent the next generated token more than anything because it’s extracted at the end.

SlyEcho on Apr 21, 2023

In reality “hello” and "hello " is a different phrase. However these two phrases should be closer to each other than to other phrases. I’ve made two scripts for testing of the embedding behaviour namely:

get_embeddings.sh:

#!/bin/bash

# /* +----------------------------------+ */
# /* |     LLaMA Embeddings Tester      | */
# /* |        get_embeddings.sh         | */
# /* |   (c)copyright nitram147 2023    | */
# /* +----------------------------------+ */

usage="Usage: bash $0 path_to_model phrase"

if [[ $# -ne 2 ]]; then
	echo "Invalid number of parameters!" >&2
	echo "$usage"
	exit 1
fi

if [[ ! -f $1 ]]; then
	echo "Invalid path to model!" >&2
	echo "$usage"
	exit 2
fi

# better way would be to calculate model's weight hash, however that would take a while
model_path_hash=$(echo -n "$1" | sha256sum | head -c 64)
phrase_hash=$(echo -n "$2" | sha256sum | head -c 64)

mkdir -p results/"$model_path_hash"

if [[ -f results/"$model_path_hash"/"$phrase_hash" ]]; then
	echo "Embedding was already calculated by previous run"
	exit 0
fi

echo "Calculating embedding for phrase: $2"

echo "Phrase: $2" >results/"$model_path_hash"/"$phrase_hash"
./embedding -m $1 -p "$2" >>results/"$model_path_hash"/"$phrase_hash"

And compare_embeddings.py:

#!/usr/bin/python3

# /* +----------------------------------+ */
# /* |     LLaMA Embeddings Tester      | */
# /* |      compare_embeddings.py       | */
# /* |   (c)copyright nitram147 2023    | */
# /* +----------------------------------+ */

import sys
import glob
import math

def print_help(script_name: str) -> None:
	print("Usage: python3 " + script_name + " path_to_results_folder")

def get_results_subfolders(path_to_results_folder: str) -> list:
	return [
		x + "/" for x in sorted(glob.glob(path_to_results_folder + "*"))
		if glob.os.path.isdir(x)
	]

def get_results_filenames_from_folder(folder: str) -> list:
	return [
		x for x in sorted(glob.glob(folder + "*"))
		if glob.os.path.isfile(x) and len(glob.os.path.basename(x)) == 64
	]

def load_embedding_from_file(file: str) -> dict:
	if not glob.os.path.isfile(file): raise ValueError("Invalid argument provided!!!")
	lines = [x.strip("\n") for x in open(file, "r").readlines()]
	if not lines[0].startswith("Phrase: "): raise ValueError("Invalid result file provided!!!")
	#remove last space character on the end of returned embedding by [:-1]
	return { lines[0][len("Phrase: "):] : [float(x) for x in lines[1][:-1].split(" ")] }

def get_distance_between_embeddings(first: list, second: list) -> float:
	if (
		not isinstance(first, list) or
		not isinstance(second, list)
	): raise ValueError("Invalid arguments provided!!!")
	return math.dist(first, second)

def get_table_index(i: int, j: int, length: int) -> int:
	if j < i: i, j = j, i
	return sum([length - x for x in range(i)]) + (j - i)

if len(sys.argv) != 2:
	print("Invalid count of arguments! See help below:", file=sys.stderr)
	print_help(sys.argv[0])
	sys.exit(1)

path_to_results_folder = sys.argv[1] + "/" if sys.argv[1][-1] != "/" else sys.argv[1]

results_subfolders = get_results_subfolders(path_to_results_folder)

for folder in results_subfolders:
	
	print("Analyzing data in folder: " + folder)
	
	filenames = get_results_filenames_from_folder(folder)
	
	phrases_embeddings = sorted(
		[load_embedding_from_file(file) for file in filenames],
		key = lambda v: list(v.keys())[0]
	)
	
	phrases_count = len(phrases_embeddings)

	distances = []

	for i in range(phrases_count):
		for j in range(i, phrases_count):
			 distances.append(
			 	get_distance_between_embeddings(
			 		phrases_embeddings[i][list(phrases_embeddings[i].keys())[0]],
			 		phrases_embeddings[j][list(phrases_embeddings[j].keys())[0]]
			 	)
			 )

	for i in range(phrases_count):
		print("Distance from phrase \"" + list(phrases_embeddings[i].keys())[0] + "\" to:")
		for j in range(phrases_count):
			print(
				"\tPhrase: \"" + list(phrases_embeddings[j].keys())[0] + "\" is " +
				str(distances[get_table_index(i, j, phrases_count)])
			)

For my surprise for the short phrases it does not hold this “phrases with the similar meaning should be closer to each other” premise.

See:

Extract embeddings for a few short phrases:

bash get_embeddings.sh ../LLaMA/7B/ggml-new-model-q4_0.bin "hello"
bash get_embeddings.sh ../LLaMA/7B/ggml-new-model-q4_0.bin "hello "
bash get_embeddings.sh ../LLaMA/7B/ggml-new-model-q4_0.bin "cat"
bash get_embeddings.sh ../LLaMA/7B/ggml-new-model-q4_0.bin "cat "
bash get_embeddings.sh ../LLaMA/7B/ggml-new-model-q4_0.bin "dog"
bash get_embeddings.sh ../LLaMA/7B/ggml-new-model-q4_0.bin "dog "

Obtain results:

python compare_embeddings.py results/

Results:

Analyzing data in folder: results/9a2dfca16ff679dc3442dad4ea8cbbeaf015ef08df76385c789965b97226eb99/
Distance from phrase "cat" to:
	Phrase: "cat" is 0.0
	Phrase: "cat " is 141.5266910129102
	Phrase: "dog" is 79.05358607846175
	Phrase: "dog " is 150.61770647155694
	Phrase: "hello" is 104.7673500465483
	Phrase: "hello " is 147.5524726057386
Distance from phrase "cat " to:
	Phrase: "cat" is 141.5266910129102
	Phrase: "cat " is 0.0
	Phrase: "dog" is 134.02650674575497
	Phrase: "dog " is 65.6564442420672
	Phrase: "hello" is 152.28321946264828
	Phrase: "hello " is 69.05842796227314
Distance from phrase "dog" to:
	Phrase: "cat" is 79.05358607846175
	Phrase: "cat " is 134.02650674575497
	Phrase: "dog" is 0.0
	Phrase: "dog " is 134.56114935952093
	Phrase: "hello" is 110.13720887542694
	Phrase: "hello " is 139.16754187161132
Distance from phrase "dog " to:
	Phrase: "cat" is 150.61770647155694
	Phrase: "cat " is 65.6564442420672
	Phrase: "dog" is 134.56114935952093
	Phrase: "dog " is 0.0
	Phrase: "hello" is 155.05308475451446
	Phrase: "hello " is 60.12117182785281
Distance from phrase "hello" to:
	Phrase: "cat" is 104.7673500465483
	Phrase: "cat " is 152.28321946264828
	Phrase: "dog" is 110.13720887542694
	Phrase: "dog " is 155.05308475451446
	Phrase: "hello" is 0.0
	Phrase: "hello " is 141.5638727632533
Distance from phrase "hello " to:
	Phrase: "cat" is 147.5524726057386
	Phrase: "cat " is 69.05842796227314
	Phrase: "dog" is 139.16754187161132
	Phrase: "dog " is 60.12117182785281
	Phrase: "hello" is 141.5638727632533
	Phrase: "hello " is 0.0

Unfortunately, I don’t have any more time at the moment. But if you have, try to extract embeddings for more complicated phrases and post the results here 😃

nitram147 on Apr 13, 2023

It seems embedding.cpp returns the output embeddings.

SlyEcho on Apr 12, 2023