chroma: Cannot submit more than 41,666 embeddings at once.
Hi,
I am using chromadb 0.4.7 and langchain 0.0.274 while trying to save some embeddings into a chromadb.
After upgrading to the versions mentioned above, I started to get the following error when trying to ingest some offline documents:
/chromadb/db/mixins/embeddings_queue.py", line 127, in submit_embeddings raise ValueError( ValueError: Cannot submit more than 41,666 embeddings at once. Please submit your embeddings in batches of size 41,666 or less.
Could you please let me know what was changed and why do I have this error?
Thank you!
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 30 (7 by maintainers)
Commits related to this issue
- feat: CIP-5: Large Batch Handling Improvements Proposal - Including only CIP for review. Refs: #1049 — committed to amikos-tech/chroma-core by tazarov 10 months ago
- feat: CIP-5: Large Batch Handling Improvements Proposal - Updated CIP - Implementation done - Added a new test in test_add Refs: #1049 — committed to amikos-tech/chroma-core by tazarov 10 months ago
- feat: CIP-5: Large Batch Handling Improvements Proposal - Refactored the code to no longer do the splitting in _add,_update,_upsert - Batch utils library remains as a utility method for users to spli... — committed to amikos-tech/chroma-core by tazarov 10 months ago
- feat: CIP-5: Large Batch Handling Improvements Proposal - Minor improvement suggested by @imartinez to pass API to create_batches utility method. Refs: #1049 — committed to amikos-tech/chroma-core by tazarov 10 months ago
- feat: CIP-5: Large Batch Handling Improvements Proposal - Updated CIP - Implementation done - Added a new test in test_add Refs: #1049 — committed to amikos-tech/chroma-core by tazarov 10 months ago
- feat: CIP-5: Large Batch Handling Improvements Proposal - Refactored the code to no longer do the splitting in _add,_update,_upsert - Batch utils library remains as a utility method for users to spli... — committed to amikos-tech/chroma-core by tazarov 10 months ago
- [ENH]: CIP-5: Large Batch Handling Improvements Proposal (#1077) - Including only CIP for review. Refs: #1049 ## Description of changes *Summarize the changes made by this PR.* - Improveme... — committed to chroma-core/chroma by tazarov 9 months ago
- [ENH]: CIP-5: Large Batch Handling Improvements Proposal (#1077) - Including only CIP for review. Refs: #1049 ## Description of changes *Summarize the changes made by this PR.* - Improveme... — committed to amikos-tech/chroma-core by tazarov 9 months ago
Hello, In the end this what I did :
It worked, but it would be nice to at least send a warning to the user, so that there credits aren’t wasted (That’s what happen to me). Hope this helps.
@MarcoV10 You can get the actual max batch size from a Chroma client (create, for example, a chroma_client = chromadb.PersistentClient). The max batch size depends on your setup:
max_batch_size = chroma_client._producer.max_batch_size
Note that Chroma team is about to make the max_batch_size a public property of the API, but this should work in the meantime
My primary motivation was DWM (don’t waste money), which, if you are submitting 40+k embeddings with OpenAI you will be doing for sure 😄
I just realized that we might add a new method for batch submissions (e.g.
batch_add()), which will then be just syntactic sugar on top of the usualadd()by batching things for the developer and implementing some retries and also save a state-file with your embeddings so you can retry them later. But then again … baby steps 😃.These are two separate problems
@ilisparrow, glad you figured out a solution. There is a PR in the works for warning users about too large docs - https://github.com/chroma-core/chroma/pull/1008
I also had a gist about splitting texts using LangChain (not an ideal solution, but can give you an idea about chunking large texts) - https://gist.github.com/tazarov/e66c1d3ae298c424dc4ffc8a9a916a4a
@HammadB yes, using langchain for doing the chunking. I’ll try as you suggested. Thank you!