LocalAI: LLama backend embeddings not working

Discussed in https://github.com/mudler/LocalAI/discussions/1615

^{Originally posted by fdewes January 20, 2024} Hi everyone,

first of all: thank you for this great piece of software 😃

I am trying to create embeddings with gguf models, such as Phi or Mistral. However, all my attempts to serve them via LocalAI and the LLama backend fail. I am using the localai/localai:v2.5.1-cublas-cuda12 docker image

Here is the http error response:

InternalServerError: Error code: 500 - {'error': {'code': 500, 'message': 'rpc error: code = Unimplemented desc = ', 'type': ''}}

This is the debug output from the localai docker image:

11:25AM DBG Request received: 
11:25AM DBG Parameter Config: &{PredictionOptions:{Model:phi-2.Q8_0.gguf Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:phi-embeddings F16:false Threads:4 Debug:true Roles:map[] Embeddings:true Backend:llama TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[Please create python script for doing basic data science on entire pandas df] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}
11:25AM INF Loading model 'phi-2.Q8_0.gguf' with backend llama
11:25AM DBG llama-cpp is an alias of llama-cpp
11:25AM DBG Model already loaded in memory: phi-2.Q8_0.gguf
[10.10.1.20]:62886 500 - POST /embeddings

This is the configuration file for the model.

https://raw.githubusercontent.com/fdewes/model_gallery/main/phi_embeddings.yaml

name: "phi-embeddings"
license: "Apache 2.0"
urls:
- https://huggingface.co/TheBloke/phi-2-GGUF
description: |
    Phi model that can be used for embeddings
config_file: |
    parameters:
      model: phi-2.Q8_0.gguf
    backend: llama
    embeddings: true
files:
- filename: "phi-2.Q8_0.gguf"
  sha256: "26a44c5a2bc22f33a1271cdf1accb689028141a6cb12e97671740a9803d23c63"
  uri: "https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q8_0.gguf"

Am i doing anything wrong or is this a bug? With the same models, I can create embeddings locally by using the llama-cpp-python bindings without problems. Any help solving this problem would be greatly appreciated.

</div>

About this issue

Original URL
State: open
Created 5 months ago
Reactions: 1
Comments: 15

Most upvoted comments

It doesn’t download them on startup, it downloads them on the first time it’s invoked given a specific model of interest.

Also your information there (to models) doesn’t download the embedding models there. I’m a bit unsure why, and am still looking into that myself. There would be the need for 2 -v mounts.

TheDarkTrumpet on Jan 24, 2024

Gotcha, yeah I’m not sure on that document. I can give you some code, below, of how I handled it:

file_contents = Document(page_content=open(file, "r").read(), metadata={"file": file_base})
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
splits = text_splitter.split_documents([file_contents])

# Split Documents
embeddings = OpenAIEmbeddings(openai_api_base=env['OPENAI_API_BASE'], openai_api_key=env['OPENAI_API_KEY'], deployment="sentence-t5-large")
# Save to Disk
Chroma.from_documents(splits, embeddings, persist_directory="no_git/chroma_chats")

The above uses LangChain to handle the embeddings. The more important part is the embedding database, which is “sentence-t5-large” (https://huggingface.co/sentence-transformers/sentence-t5-large) which is used for embeddings. My accuracy isn’t quite where I want to be with it in that vector space, but it does call the embeddings and saves to the database.

TheDarkTrumpet on Jan 23, 2024

I’ve run into similar situations using models that weren’t really good with embeddings. I’m checking https://huggingface.co/models?other=embeddings&sort=trending and https://huggingface.co/spaces/mteb/leaderboard

Are you sure that this model can do embeddings? I’m not seeing anything on either of the above links, nor on TheBloke’s page, nor on the Microsoft Phi page or any google searches.

TheDarkTrumpet on Jan 23, 2024