chroma: Cannot submit more than 41,666 embeddings at once.

Hi,

I am using chromadb 0.4.7 and langchain 0.0.274 while trying to save some embeddings into a chromadb.

After upgrading to the versions mentioned above, I started to get the following error when trying to ingest some offline documents:

/chromadb/db/mixins/embeddings_queue.py", line 127, in submit_embeddings raise ValueError( ValueError: Cannot submit more than 41,666 embeddings at once. Please submit your embeddings in batches of size 41,666 or less.

Could you please let me know what was changed and why do I have this error?

Thank you!

About this issue

Original URL
State: closed
Created 10 months ago
Comments: 30 (7 by maintainers)

Commits related to this issue

feat: CIP-5: Large Batch Handling Improvements Proposal - Including only CIP for review. Refs: #1049 — committed to amikos-tech/chroma-core by tazarov 10 months ago
feat: CIP-5: Large Batch Handling Improvements Proposal - Updated CIP - Implementation done - Added a new test in test_add Refs: #1049 — committed to amikos-tech/chroma-core by tazarov 10 months ago
feat: CIP-5: Large Batch Handling Improvements Proposal - Refactored the code to no longer do the splitting in _add,_update,_upsert - Batch utils library remains as a utility method for users to spli... — committed to amikos-tech/chroma-core by tazarov 10 months ago
feat: CIP-5: Large Batch Handling Improvements Proposal - Minor improvement suggested by @imartinez to pass API to create_batches utility method. Refs: #1049 — committed to amikos-tech/chroma-core by tazarov 10 months ago
feat: CIP-5: Large Batch Handling Improvements Proposal - Updated CIP - Implementation done - Added a new test in test_add Refs: #1049 — committed to amikos-tech/chroma-core by tazarov 10 months ago
feat: CIP-5: Large Batch Handling Improvements Proposal - Refactored the code to no longer do the splitting in _add,_update,_upsert - Batch utils library remains as a utility method for users to spli... — committed to amikos-tech/chroma-core by tazarov 10 months ago
[ENH]: CIP-5: Large Batch Handling Improvements Proposal (#1077) - Including only CIP for review. Refs: #1049 ## Description of changes *Summarize the changes made by this PR.* - Improveme... — committed to chroma-core/chroma by tazarov 9 months ago
[ENH]: CIP-5: Large Batch Handling Improvements Proposal (#1077) - Including only CIP for review. Refs: #1049 ## Description of changes *Summarize the changes made by this PR.* - Improveme... — committed to amikos-tech/chroma-core by tazarov 9 months ago

Most upvoted comments

Hello, In the end this what I did :

embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)


def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]
        
split_docs_chunked = split_list(split_docs, 41000)

for split_docs_chunk in split_docs_chunked:
    vectordb = Chroma.from_documents(
        documents=split_docs_chunk,
        embedding=embedding,
        persist_directory=persist_directory,
    )
    vectordb.persist()

It worked, but it would be nice to at least send a warning to the user, so that there credits aren’t wasted (That’s what happen to me). Hope this helps.

+12

ilisparrow on Aug 30, 2023

Hello, In the end this what I did :
embedding = OpenAIEmbeddings(show_progress_bar=True,chunk_size=3)


def split_list(input_list, chunk_size):
    for i in range(0, len(input_list), chunk_size):
        yield input_list[i:i + chunk_size]
        
split_docs_chunked = split_list(split_docs, 41000)

for split_docs_chunk in split_docs_chunked:
    vectordb = Chroma.from_documents(
        documents=split_docs_chunk,
        embedding=embedding,
        persist_directory=persist_directory,
    )
    vectordb.persist()
It worked, but it would be nice to at least send a warning to the user, so that there credits aren’t wasted (That’s what happen to me). Hope this helps.
I tried on a large folder of pdfs and still getting the same error, trying to get a smaller chunk_size but still doesn’t not work, any other workaround?

@MarcoV10 You can get the actual max batch size from a Chroma client (create, for example, a chroma_client = chromadb.PersistentClient). The max batch size depends on your setup:

max_batch_size = chroma_client._producer.max_batch_size

Note that Chroma team is about to make the max_batch_size a public property of the API, but this should work in the meantime

imartinez on Sep 15, 2023

My primary motivation was DWM (don’t waste money), which, if you are submitting 40+k embeddings with OpenAI you will be doing for sure 😄

I just realized that we might add a new method for batch submissions (e.g. batch_add()), which will then be just syntactic sugar on top of the usual add() by batching things for the developer and implementing some retries and also save a state-file with your embeddings so you can retry them later. But then again … baby steps 😃.

tazarov on Sep 4, 2023

These are two separate problems

The token limit of the model
The batch size chroma can accept

HammadB on Aug 31, 2023

@ilisparrow, glad you figured out a solution. There is a PR in the works for warning users about too large docs - https://github.com/chroma-core/chroma/pull/1008

I also had a gist about splitting texts using LangChain (not an ideal solution, but can give you an idea about chunking large texts) - https://gist.github.com/tazarov/e66c1d3ae298c424dc4ffc8a9a916a4a

tazarov on Aug 31, 2023

@HammadB yes, using langchain for doing the chunking. I’ll try as you suggested. Thank you!

stofarius on Aug 28, 2023