chroma: [Bug]: Adding large datasets to a collection using OpenAI embedding function fails

What happened?

I’ve experienced two different failure modes recently that appear to be caused by attempting to add large datasets to a collection all at once, and which were routed around by adding the data in smaller chunks (300 at a time, which was just the first chunk size I tried).

My script (abbreviated) looks roughly like:

api_key = os.environ.get("OPENAI_API_KEY")
if api_key is None:
    raise Exception("Please set OPENAI_API_KEY environment variable")
chroma_client = Client(Settings(chroma_db_impl="duckdb+parquet", persist_directory=".chroma/test"))
chroma_client.persist()
openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key=api_key, model_name="text-embedding-ada-002")
# Create a collection with the openai embedding function
collection = chroma_client.get_or_create_collection(name="test", embedding_function=openai_ef)

# Some code to generate texts, metadatas, and ids

collection.add(documents=texts, metadatas=metadatas, ids=ids)

When the length of each of these lists was 10823, I reliably got the following:

Traceback (most recent call last):
  File "/Users/ibeckermayer/test/scripts/bug.py", line 290, in <module>
    main()
  File "/Users/ibeckermayer/test/scripts/bug.py", line 284, in main
    embed_all(texts, metadatas, ids)
  File "/Users/ibeckermayer/test/scripts/bug.py", line 33, in embed_all
    collection.add(documents=texts, metadatas=cast_metadatas, ids=ids)
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 96, in add
    ids, embeddings, metadatas, documents = self._validate_embedding_set(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 387, in _validate_embedding_set
    embeddings = self._embedding_function(documents)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/utils/embedding_functions.py", line 111, in __call__
    embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
                           ^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
    resp, got_stream = self._interpret_response(result, stream)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
    self._interpret_response_line(
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
    raise self.handle_error_response(
openai.error.APIError: Internal error {
    "error": {
        "message": "Internal error",
        "type": "internal_error",
        "param": null,
        "code": "internal_error"
    }
}
 500 {'error': {'message': 'Internal error', 'type': 'internal_error', 'param': None, 'code': 'internal_error'}} {'Date': 'Sat, 17 Jun 2023 23:58:31 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '152', 'Connection': 'keep-alive', 'vary': 'Origin', 'x-ratelimit-limit-requests': '3000', 'x-ratelimit-remaining-requests': '2999', 'x-ratelimit-reset-requests': '20ms', 'x-request-id': '6b27f33d31697820aebb0c1cc6a48242', 'strict-transport-security': 'max-age=15724800; includeSubDomains', 'CF-Cache-Status': 'DYNAMIC', 'Server': 'cloudflare', 'CF-RAY': '7d8f3a5d88e1811a-ORD', 'alt-svc': 'h3=":443"; ma=86400'}

When I modified my code to cut them down to a length of 4628, I got

Traceback (most recent call last):
  File "/Users/ibeckermayer/test/scripts/bug.py", line 290, in <module>
    main()
  File "/Users/ibeckermayer/test/scripts/bug.py", line 284, in main
    embed_all(texts, metadatas, ids)
  File "/Users/ibeckermayer/test/scripts/bug.py", line 33, in embed_all
    collection.add(documents=texts, metadatas=cast_metadatas, ids=ids)
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 96, in add
    ids, embeddings, metadatas, documents = self._validate_embedding_set(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 387, in _validate_embedding_set
    embeddings = self._embedding_function(documents)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/utils/embedding_functions.py", line 111, in __call__
    embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
                           ^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
    resp, got_stream = self._interpret_response(result, stream)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
    self._interpret_response_line(
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.

Versions

chromadb==0.3.26 openai==0.27.8 Python 3.11.3

Relevant log output

Traceback (most recent call last):
  File "/Users/ibeckermayer/test/scripts/bug.py", line 290, in <module>
    main()
  File "/Users/ibeckermayer/test/scripts/bug.py", line 284, in main
    embed_all(texts, metadatas, ids)
  File "/Users/ibeckermayer/test/scripts/bug.py", line 33, in embed_all
    collection.add(documents=texts, metadatas=cast_metadatas, ids=ids)
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 96, in add
    ids, embeddings, metadatas, documents = self._validate_embedding_set(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 387, in _validate_embedding_set
    embeddings = self._embedding_function(documents)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/utils/embedding_functions.py", line 111, in __call__
    embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
                           ^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
    resp, got_stream = self._interpret_response(result, stream)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
    self._interpret_response_line(
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
    raise self.handle_error_response(
openai.error.APIError: Internal error {
    "error": {
        "message": "Internal error",
        "type": "internal_error",
        "param": null,
        "code": "internal_error"
    }
}
 500 {'error': {'message': 'Internal error', 'type': 'internal_error', 'param': None, 'code': 'internal_error'}} {'Date': 'Sat, 17 Jun 2023 23:58:31 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '152', 'Connection': 'keep-alive', 'vary': 'Origin', 'x-ratelimit-limit-requests': '3000', 'x-ratelimit-remaining-requests': '2999', 'x-ratelimit-reset-requests': '20ms', 'x-request-id': '6b27f33d31697820aebb0c1cc6a48242', 'strict-transport-security': 'max-age=15724800; includeSubDomains', 'CF-Cache-Status': 'DYNAMIC', 'Server': 'cloudflare', 'CF-RAY': '7d8f3a5d88e1811a-ORD', 'alt-svc': 'h3=":443"; ma=86400'}

-------------------------------------------------------------------------------------------------

Traceback (most recent call last):
  File "/Users/ibeckermayer/test/scripts/bug.py", line 290, in <module>
    main()
  File "/Users/ibeckermayer/test/scripts/bug.py", line 284, in main
    embed_all(texts, metadatas, ids)
  File "/Users/ibeckermayer/test/scripts/bug.py", line 33, in embed_all
    collection.add(documents=texts, metadatas=cast_metadatas, ids=ids)
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 96, in add
    ids, embeddings, metadatas, documents = self._validate_embedding_set(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/api/models/Collection.py", line 387, in _validate_embedding_set
    embeddings = self._embedding_function(documents)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/chromadb/utils/embedding_functions.py", line 111, in __call__
    embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
                           ^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
    resp, got_stream = self._interpret_response(result, stream)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
    self._interpret_response_line(
  File "/Users/ibeckermayer/venvs/test/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (5 by maintainers)

Most upvoted comments

As a heads up to Chroma’s use of Embedding, there is apparently a limit of 2048 elements in the input list when you use Embedding.create. I can’t find this documented anywhere, but I nailed it down with some experimentation. Not sure if you want to continue passing on the not-very-helpful error message from OpenAI or whether you want to preemptively check against this limit and hope they don’t quietly change it in the future.

emcd on Jul 16, 2023

I’ll raise this with our friends over at OpenAI.

atroyn on Jul 17, 2023

@emcd - I wonder if we can work with OpenAI to make that error message better. cc @atroyn

HammadB on Jul 16, 2023

Update: I was already using tiktoken to calculate api costs so I added some logic to double check that I wasn’t going over 8191. Anyways, I will open an issue pointing the OpenAI team to this thread.

ibeckermayer on Jul 2, 2023

Hi @ibeckermayer, I may be wrong but I think that what you state above may not be fully correct.

ChromaDB uses the Embedding class of OpenAI to get the embeddings of the documents you pass as follows: embeddings = self._client.create(input=texts, engine=self._model_name)["data"], i.e. there’s is not a loop to send the documents but rather they’re sent all to the API.
This is supported by the OpenAI API create endpoint that you mention in your comment, I don’t see the limit of 1-by-1 that you stated, the input can be an array of strings:

input string or array Required Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. Each input must not exceed the max input tokens for the model (8191 tokens for text-embedding-ada-002)

I don’t know what would happen if you sent a very large number of documents to the API, it may give you an error as you’ve experienced.

Other options for this error message could be: 1) the token limit per document depends on the model you use, that limit is for the text-embedding-ada-002, are you using that model? and 2) the number of tokens of the documents is calculated with tiktoken which is different per model (https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb), could it be that some of your documents is over the limit?

I don’t know if this helps 😃

albertovilla on Jun 27, 2023