langchain: "NoIndexException: Index not found when initializing Chroma from a persisted directory"

I am facing a problem when trying to use the Chroma vector store with a persisted index. I have already loaded a document, created embeddings for it, and saved those embeddings in Chroma. The script ran perfectly with LLM and also created the necessary files in the persistence directory (.chroma\index). The files include:

chroma-collections.parquet chroma-embeddings.parquet id_to_uuid_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl index_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.bin index_metadata_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl uuid_to_id_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl

However, when I try to initialize the Chroma instance using the persist_directory to utilize the previously saved embeddings, I encounter a NoIndexException error, stating “Index not found, please create an instance before querying”.

Here is a snippet of the code I am using in a Jupyter notebook:

# Section 1
import os
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain

# Load environment variables
%reload_ext dotenv
%dotenv info.env
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Section 2 - Initialize Chroma without an embedding function
persist_directory = '.chroma\\index'
db = Chroma(persist_directory=persist_directory)

# Section 3
# Load chat model and question answering chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=.5, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

# Section 4
# Run the chain on a sample query
query = "The Question - Can you also cite the information you give after your answer?"
docs = db.similarity_search(query)
response = chain.run(input_documents=docs, question=query)
print(response)

Please help me understand what might be causing this problem and suggest possible solutions. Additionally, I am curious if these pre-existing embeddings could be reused without incurring the same cost for generating Ada embeddings again, as the documents I am working with have lots of pages. Thanks in advance!

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 38 (11 by maintainers)

Most upvoted comments

Hi everyone, Jeff from Chroma here. I will be looking into this tomorrow morning and will report back.

I also encountered this NoIndexException: Index not found when initializing Chroma from a persisted directory message using Chroma with LangChain, and I made the following silly simple mistake. Not sure if this is related to the original post, but perhaps it helps people that come here to not make this same mistake.

While persisting the chroma-db I used the following lines:

loader = PyPDFLoader("[MY DOCUMENT].pdf")
pages = loader.load_and_split()

embedding_function = OpenAIEmbeddings()
vectordb = Chroma.from_documents(pages, embedding_function, persist_directory="[MY PERSIST DIRECTORY]")
vectordb.persist()
vectordb = None

while using the persisted db I used the following:

vectordb = Chroma(persist_directory="[MY PERSIST DIRECTORY]", embedding_function=embedding_function, collection_name="[MY COLLECTION NAME]")

My issue was using a collection_name, while I did not provide this when persisting the db. According to the documentation (https://python.langchain.com/en/latest/reference/modules/vectorstores.html) Chroma uses the default collection_name = "langchain" when not provided. By either removing the collection_name or settting it to langchain when using the persisted db it resolved my error.

The lesson: keep collection_names the the same.

Hi @jeffchuber @murasz , I found the issue. So the issue is with the uuid creation and fetching. The issue of index not found was only happening in the case of custom collection_name. So you guys have used this mechanism to create uuid for the saved index files : uuid.uuid4(), which generates and gets a new uuid every time it is called.

ref file : chromadb/db/clickhouse.py , chromadb/db/duckdb.py

image

I instead changed the uuid mechanism while creating and fetching collections to do something like this using a seed as collection_name parameter which generates a consistent uuid for a fixed value of collection_name.

   rd = random.Random()
   rd.seed(collection_name)
   id = uuid.UUID(int=rd.getrandbits(128))

   collection_uuid = id

This generates a consistent UUID for a collection based on the name passed in the params. Do the same thing in get_collection_uuid_from_name() functions as well .

I tried this in various scenarios where it was failing for me, this solution works in those scenarios.

You may have to use db.persist() after db = Chroma(...) in Section 2.

Yes, once embeddings are stored you can query against them from next time. Although, an embedding against the query has to be done.

Hi everyone, if you are using a notebook – you need to call client.persist() manually because the garbage collection in a notebook does not call the __del__ lifecycle method on the object.

I just added a PR here to improve our docs on this. https://github.com/chroma-core/docs/pull/44

Reading this thread from start to end, my understanding is that the root cause is Chroma doesn’t reinitialize collection_name to langchain when restored from a persisted directory, and so passing collection_name = "langchain" or whatever manual name was given to it should resolve this error. Feel free to correct me if I’m wrong

@jeffchuber Sorry I forgot to update my thread. I was able to resolve this error simply by upgrading chroma! I am not facing this error anymore but I uncovered another one and created a ticket on chroma’s repo: https://github.com/chroma-core/chroma/issues/640

Please look into it if possible, I owe you and Anton a beer if it is something that I overlooked.

@jeffchuber Thanks for this. It solved the issue for me.

ok sounds great! please send that along via email or discord DM

here is a trivial example based on the langchain example vectorstore.ipynb

test.py

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter

from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
state_of_union_store = Chroma.from_documents(texts, embeddings, collection_name="state-of-union", persist_directory=".chromadb/")

val = state_of_union_store.similarity_search("the", top_n=2)
print(val)

test2.py

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings()
state_of_union_store = Chroma(collection_name="state-of-union", persist_directory=".chromadb/", embedding_function=embeddings)

val = state_of_union_store.similarity_search("the", top_n=2)
print(val)

This works on my end. Can others try this?