chroma: [Bug]: querying a persisted index returns an empty array

What happened?

The following example uses langchain to successfully load documents into chroma and to successfully persist the data. Querying works as expected.

However, when we restart the notebook and attempt to query again without ingesting data and instead reading the persisted directory, we get [] when querying both using the langchain wrapper’s method and chromadb’s client (accessed from langchain wrapper). I’ve concluded that there is either a deep bug in chromadb or I am doing something wrong. Please help!

Code to reproduce

In the notebook paste:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import json

assembly_ai_output_file = "data_auto_chapters.json"

# Read the JSON file
with open(assembly_ai_output_file, "r") as json_file:
    json_data = json.load(json_file)

utterances = json_data['utterances']
data = list(map(lambda x: {key: value for key, value in x.items() if key != "words"}, utterances))
transcription_id = json_data['id']
docs = list(map(lambda x: Document(page_content=x['text'], 
                                   metadata={"transcription_id": transcription_id,"confidence": x['confidence'], "start": x['start'], "end": x['end'], "speaker": x['speaker']}),
                            utterances))

embedding = OpenAIEmbeddings()

persist_directory = 'db'
vectordb = Chroma.from_documents(documents=docs, embedding=embedding, persist_directory=persist_directory, collection_name="condense_demo")
vectordb.persist()

query = "what does the speaker say about raytheon?"

retrieved_docs = vectordb.similarity_search(query) # filter={"speaker": "B"}
retrieved_docs # data comes back

Now lets restart the notebook, and run the following code which I believe should work.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embedding = OpenAIEmbeddings()

vectordb = Chroma(persist_directory="db", embedding_function=embedding, collection_name="condense_demo")
query = "what does the speaker say about raytheon?"

retrieved_docs = vectordb.similarity_search(query) # filter={"speaker": "B"}
retrieved_docs # returns []

To make sure its not langchain issue, lets get the chroma client from the Chroma class and query the data at a lower level.

vectordb._client.list_collections() # looks good
vectordb._client.get_collection("condense_demo").peek() # yes there is data there!
query_vector = embedding.embed_query(query) # use langchain implementation of openai embedding algorithm
vectordb._client.get_collection("condense_demo").query(query_texts=[query], n_results=4) 
# returns {'ids': [[]],
# 'embeddings': None,
# 'documents': [[]],
# 'metadatas': [[]],
# 'distances': [[]]}

Am I doing something wrong? I am attaching the JSON that is fed into the data as a zip file because GitHub doesn’t allow json to make it easier to reproduce but i believe you can reproduce using any document (my assumption).

data_auto_chapters.json.zip

Versions

Python 3.8.8 langchain 0.0.184 chroma 0.3.25

Relevant log output

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 28 (11 by maintainers)

Most upvoted comments

I suspect this bug is due to langchain implicitly creating many different chroma clients under the hood, which will not work. The chroma client should be treated as a singleton, and we should separately add some validation for this.

What is happening is that these separate clients are stomping on each other atExit.

Langchain supports passing in a client OR the settings/persist directory. Can people experiencing this issue instead pass the client into Chroma.from_documents and similar functions? This should resolve your issue. Thanks!

Apologies for the confusion.

@sbslee we are releasing an updated version of chroma next wednesday that will persist on every write and will totally this weird runtime behavior. stay tuned!

Hello team. Any updates on this? Another langchain user with the same issue. Thanks for your hard work!

I will noodle on this more and can propose a solution, happy to help contribute on this front, I am familiar with the Chroma implementation in langchain partially due to the work on Auto Evaluator: https://autoevaluator.langchain.com/