chroma: [Bug]: Adding large datasets to a collection using OpenAI embedding function fails
What happened?
I’ve experienced two different failure modes recently that appear to be caused by attempting to add large datasets to a collection all at once, and which were routed around by adding the data in smaller chunks (300 at a time, which was just the first chunk size I tried).
My script (abbreviated) looks roughly like:
api_key = os.environ.get("OPENAI_API_KEY")
if api_key is None:
raise Exception("Please set OPENAI_API_KEY environment variable")
chroma_client = Client(Settings(chroma_db_impl="duckdb+parquet", persist_directory=".chroma/test"))
chroma_client.persist()
openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key=api_key, model_name="text-embedding-ada-002")
# Create a collection with the openai embedding function
collection = chroma_client.get_or_create_collection(name="test", embedding_function=openai_ef)
# Some code to generate texts, metadatas, and ids
collection.add(documents=texts, metadatas=metadatas, ids=ids)
When the length of each of these lists was 10823, I reliably got the following:
Traceback (most recent call last):
File "/Users/ibeckermayer/test/scripts/bug.py", line 290, in <module>
main()
File "/Users/ibeckermayer/test/scripts/bug.py", line 284, in main
embed_all(texts, metadatas, ids)
File "/Users/ibeckermayer/test/scripts/bug.py", line 33, in embed_all
collection.add(documents=texts, metadatas=cast_metadatas, ids=ids)
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 96, in add
ids, embeddings, metadatas, documents = self._validate_embedding_set(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 387, in _validate_embedding_set
embeddings = self._embedding_function(documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/utils/embedding_functions.py", line 111, in __call__
embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
response = super().create(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
response, _, api_key = requestor.request(
^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
raise self.handle_error_response(
openai.error.APIError: Internal error {
"error": {
"message": "Internal error",
"type": "internal_error",
"param": null,
"code": "internal_error"
}
}
500 {'error': {'message': 'Internal error', 'type': 'internal_error', 'param': None, 'code': 'internal_error'}} {'Date': 'Sat, 17 Jun 2023 23:58:31 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '152', 'Connection': 'keep-alive', 'vary': 'Origin', 'x-ratelimit-limit-requests': '3000', 'x-ratelimit-remaining-requests': '2999', 'x-ratelimit-reset-requests': '20ms', 'x-request-id': '6b27f33d31697820aebb0c1cc6a48242', 'strict-transport-security': 'max-age=15724800; includeSubDomains', 'CF-Cache-Status': 'DYNAMIC', 'Server': 'cloudflare', 'CF-RAY': '7d8f3a5d88e1811a-ORD', 'alt-svc': 'h3=":443"; ma=86400'}
When I modified my code to cut them down to a length of 4628, I got
Traceback (most recent call last):
File "/Users/ibeckermayer/test/scripts/bug.py", line 290, in <module>
main()
File "/Users/ibeckermayer/test/scripts/bug.py", line 284, in main
embed_all(texts, metadatas, ids)
File "/Users/ibeckermayer/test/scripts/bug.py", line 33, in embed_all
collection.add(documents=texts, metadatas=cast_metadatas, ids=ids)
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 96, in add
ids, embeddings, metadatas, documents = self._validate_embedding_set(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 387, in _validate_embedding_set
embeddings = self._embedding_function(documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/utils/embedding_functions.py", line 111, in __call__
embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
response = super().create(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
response, _, api_key = requestor.request(
^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
raise self.handle_error_response(
openai.error.InvalidRequestError: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.
Versions
chromadb==0.3.26 openai==0.27.8 Python 3.11.3
Relevant log output
Traceback (most recent call last):
File "/Users/ibeckermayer/test/scripts/bug.py", line 290, in <module>
main()
File "/Users/ibeckermayer/test/scripts/bug.py", line 284, in main
embed_all(texts, metadatas, ids)
File "/Users/ibeckermayer/test/scripts/bug.py", line 33, in embed_all
collection.add(documents=texts, metadatas=cast_metadatas, ids=ids)
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 96, in add
ids, embeddings, metadatas, documents = self._validate_embedding_set(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 387, in _validate_embedding_set
embeddings = self._embedding_function(documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/utils/embedding_functions.py", line 111, in __call__
embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
response = super().create(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
response, _, api_key = requestor.request(
^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
raise self.handle_error_response(
openai.error.APIError: Internal error {
"error": {
"message": "Internal error",
"type": "internal_error",
"param": null,
"code": "internal_error"
}
}
500 {'error': {'message': 'Internal error', 'type': 'internal_error', 'param': None, 'code': 'internal_error'}} {'Date': 'Sat, 17 Jun 2023 23:58:31 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '152', 'Connection': 'keep-alive', 'vary': 'Origin', 'x-ratelimit-limit-requests': '3000', 'x-ratelimit-remaining-requests': '2999', 'x-ratelimit-reset-requests': '20ms', 'x-request-id': '6b27f33d31697820aebb0c1cc6a48242', 'strict-transport-security': 'max-age=15724800; includeSubDomains', 'CF-Cache-Status': 'DYNAMIC', 'Server': 'cloudflare', 'CF-RAY': '7d8f3a5d88e1811a-ORD', 'alt-svc': 'h3=":443"; ma=86400'}
-------------------------------------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/ibeckermayer/test/scripts/bug.py", line 290, in <module>
main()
File "/Users/ibeckermayer/test/scripts/bug.py", line 284, in main
embed_all(texts, metadatas, ids)
File "/Users/ibeckermayer/test/scripts/bug.py", line 33, in embed_all
collection.add(documents=texts, metadatas=cast_metadatas, ids=ids)
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 96, in add
ids, embeddings, metadatas, documents = self._validate_embedding_set(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 387, in _validate_embedding_set
embeddings = self._embedding_function(documents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/utils/embedding_functions.py", line 111, in __call__
embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
response = super().create(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
response, _, api_key = requestor.request(
^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
raise self.handle_error_response(
openai.error.InvalidRequestError: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (5 by maintainers)
As a heads up to Chroma’s use of
Embedding, there is apparently a limit of 2048 elements in theinputlist when you useEmbedding.create. I can’t find this documented anywhere, but I nailed it down with some experimentation. Not sure if you want to continue passing on the not-very-helpful error message from OpenAI or whether you want to preemptively check against this limit and hope they don’t quietly change it in the future.I’ll raise this with our friends over at OpenAI.
@emcd - I wonder if we can work with OpenAI to make that error message better. cc @atroyn
Update: I was already using tiktoken to calculate api costs so I added some logic to double check that I wasn’t going over 8191. Anyways, I will open an issue pointing the OpenAI team to this thread.
Hi @ibeckermayer, I may be wrong but I think that what you state above may not be fully correct.
Embeddingclass of OpenAI to get the embeddings of the documents you pass as follows:embeddings = self._client.create(input=texts, engine=self._model_name)["data"], i.e. there’s is not a loop to send the documents but rather they’re sent all to the API.I don’t know what would happen if you sent a very large number of documents to the API, it may give you an error as you’ve experienced.
Other options for this error message could be: 1) the token limit per document depends on the model you use, that limit is for the
text-embedding-ada-002, are you using that model? and 2) the number of tokens of the documents is calculated withtiktokenwhich is different per model (https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb), could it be that some of your documents is over the limit?I don’t know if this helps 😃