chroma: Cannot submit more than 41,666 embeddings at once.

Hi,

I am using chromadb 0.4.7 and langchain 0.0.274 while trying to save some embeddings into a chromadb.

After upgrading to the versions mentioned above, I started to get the following error when trying to ingest some offline documents:

/chromadb/db/mixins/embeddings_queue.py", line 127, in submit_embeddings raise ValueError( ValueError: Cannot submit more than 41,666 embeddings at once. Please submit your embeddings in batches of size 41,666 or less.

Could you please let me know what was changed and why do I have this error?

Thank you!

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 30 (7 by maintainers)

Commits related to this issue

Most upvoted comments

Hello, In the end this what I did :

embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)


def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]
        
split_docs_chunked = split_list(split_docs, 41000)

for split_docs_chunk in split_docs_chunked:
    vectordb = Chroma.from_documents(
        documents=split_docs_chunk,
        embedding=embedding,
        persist_directory=persist_directory,
    )
    vectordb.persist()

It worked, but it would be nice to at least send a warning to the user, so that there credits aren’t wasted (That’s what happen to me). Hope this helps.

Hello, In the end this what I did :

embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)


def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]
        
split_docs_chunked = split_list(split_docs, 41000)

for split_docs_chunk in split_docs_chunked:
    vectordb = Chroma.from_documents(
        documents=split_docs_chunk,
        embedding=embedding,
        persist_directory=persist_directory,
    )
    vectordb.persist()

It worked, but it would be nice to at least send a warning to the user, so that there credits aren’t wasted (That’s what happen to me). Hope this helps.

I tried on a large folder of pdfs and still getting the same error, trying to get a smaller chunk_size but still doesn’t not work, any other workaround?

@MarcoV10 You can get the actual max batch size from a Chroma client (create, for example, a chroma_client = chromadb.PersistentClient). The max batch size depends on your setup:

max_batch_size = chroma_client._producer.max_batch_size

Note that Chroma team is about to make the max_batch_size a public property of the API, but this should work in the meantime

My primary motivation was DWM (don’t waste money), which, if you are submitting 40+k embeddings with OpenAI you will be doing for sure 😄

I just realized that we might add a new method for batch submissions (e.g. batch_add()), which will then be just syntactic sugar on top of the usual add() by batching things for the developer and implementing some retries and also save a state-file with your embeddings so you can retry them later. But then again … baby steps 😃.

These are two separate problems

  1. The token limit of the model
  2. The batch size chroma can accept

@ilisparrow, glad you figured out a solution. There is a PR in the works for warning users about too large docs - https://github.com/chroma-core/chroma/pull/1008

I also had a gist about splitting texts using LangChain (not an ideal solution, but can give you an idea about chunking large texts) - https://gist.github.com/tazarov/e66c1d3ae298c424dc4ffc8a9a916a4a

@HammadB yes, using langchain for doing the chunking. I’ll try as you suggested. Thank you!