langchain: ValidationError: 2 validation errors for DocArrayDoc returned when try to execute the RAG Search Example

System Info

Langchain Version: 0.0.335 Platform: Win11 Python Version: 3.11.5

Hi experts,

I’m trying to execute the RAG Search Example on the Langchain Doc: https://python.langchain.com/docs/expression_language/get_started

Here is the code: from langchain.chat_models import ChatOpenAI from langchain.embeddings import OpenAIEmbeddings from langchain.prompts import ChatPromptTemplate from langchain.vectorstores import DocArrayInMemorySearch from langchain.schema import StrOutputParser from langchain.schema.runnable import RunnableParallel, RunnablePassthrough

vectorstore = DocArrayInMemorySearch.from_texts( [“harrison worked at kensho”, “bears like to eat honey”], embedding=OpenAIEmbeddings(), ) retriever = vectorstore.as_retriever()

template = “”"Answer the question based only on the following context: {context}

Question: {question} “”" prompt = ChatPromptTemplate.from_template(template) model = ChatOpenAI() output_parser = StrOutputParser()

setup_and_retrieval = RunnableParallel( {“context”: retriever, “question”: RunnablePassthrough()} ) chain = setup_and_retrieval | prompt | model | output_parser

chain.invoke(“where did harrison work?”)

but the example fails with the ValidationError: 2 validation errors for DocArrayDoc. Here is the error details: C:\Project\pythonProjectAI.venv\Lib\site-packages\pydantic_migration.py:283: UserWarning: pydantic.error_wrappers:ValidationError has been moved to pydantic:ValidationError. warnings.warn(f’{import_path} has been moved to {new_location}.') Traceback (most recent call last): File “C:\Project\pythonProjectAI\AI.py”, line 28, in <module> chain.invoke(“where did harrison work?”) File “C:\Project\pythonProjectAI.venv\Lib\site-packages\langchain\schema\runnable\base.py”, line 1427, in invoke input = step.invoke( ^^^^^^^^^^^^ File “C:\Project\pythonProjectAI.venv\Lib\site-packages\langchain\schema\runnable\base.py”, line 1938, in invoke output = {key: future.result() for key, future in zip(steps, futures)} ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Project\pythonProjectAI.venv\Lib\site-packages\langchain\schema\runnable\base.py”, line 1938, in <dictcomp> output = {key: future.result() for key, future in zip(steps, futures)} ^^^^^^^^^^^^^^^ File “C:\Program Files\Python\Lib\concurrent\futures_base.py”, line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File “C:\Program Files\Python\Lib\concurrent\futures_base.py”, line 401, in __get_result raise self._exception File “C:\Program Files\Python\Lib\concurrent\futures\thread.py”, line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Project\pythonProjectAI.venv\Lib\site-packages\langchain\schema\retriever.py”, line 112, in invoke return self.get_relevant_documents( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Project\pythonProjectAI.venv\Lib\site-packages\langchain\schema\retriever.py”, line 211, in get_relevant_documents raise e File “C:\Project\pythonProjectAI.venv\Lib\site-packages\langchain\schema\retriever.py”, line 204, in get_relevant_documents result = self._get_relevant_documents( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Project\pythonProjectAI.venv\Lib\site-packages\langchain\schema\vectorstore.py”, line 657, in _get_relevant_documents docs = self.vectorstore.similarity_search(query, **self.search_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Project\pythonProjectAI.venv\Lib\site-packages\langchain\vectorstores\docarray\base.py”, line 127, in similarity_search results = self.similarity_search_with_score(query, k=k, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Project\pythonProjectAI.venv\Lib\site-packages\langchain\vectorstores\docarray\base.py”, line 106, in similarity_search_with_score query_doc = self.doc_cls(embedding=query_embedding) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Project\pythonProjectAI.venv\Lib\site-packages\pydantic\main.py”, line 164, in init pydantic_self.pydantic_validator.validate_python(data, self_instance=pydantic_self) pydantic_core._pydantic_core.ValidationError: 2 validation errors for DocArrayDoc text Field required [type=missing, input_value={‘embedding’: [-0.0192381…, 0.010137099064823456]}, input_type=dict] For further information visit https://errors.pydantic.dev/2.5/v/missing metadata Field required [type=missing, input_value={‘embedding’: [-0.0192381…, 0.010137099064823456]}, input_type=dict] For further information visit https://errors.pydantic.dev/2.5/v/missing

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

from langchain.chat_models import ChatOpenAI from langchain.embeddings import OpenAIEmbeddings from langchain.prompts import ChatPromptTemplate from langchain.vectorstores import DocArrayInMemorySearch from langchain.schema import StrOutputParser from langchain.schema.runnable import RunnableParallel, RunnablePassthrough

vectorstore = DocArrayInMemorySearch.from_texts( [“harrison worked at kensho”, “bears like to eat honey”], embedding=OpenAIEmbeddings(), ) retriever = vectorstore.as_retriever()

template = “”"Answer the question based only on the following context: {context}

Question: {question} “”" prompt = ChatPromptTemplate.from_template(template) model = ChatOpenAI() output_parser = StrOutputParser()

setup_and_retrieval = RunnableParallel( {“context”: retriever, “question”: RunnablePassthrough()} ) chain = setup_and_retrieval | prompt | model | output_parser

chain.invoke(“where did harrison work?”)

Expected behavior

The example runs well.

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Reactions: 11
  • Comments: 19

Commits related to this issue

Most upvoted comments

I’m also experiencing this issue, but I was able to find a workaround. The issue seems to be a conflict in the version of pydantic that gets installed. You want a 1.10.x version, not 2.x. See https://github.com/langchain-ai/langchain/issues/12916

In the meantime, I was able to work around this locally with pip install pydantic==1.10.8.

It might be helpful to install the pydantic==1.10.13. It`s useful for me with the following environment:

pydantic== 1.10.13 docarray== 0.32.1

🤖

Hello @xidian237!

I’m Dosu, a friendly bot here to assist you with your issues and questions related to LangChain. While you’re waiting for a human maintainer, I’m here to help guide you through the process of bug fixing and contributing. Let’s get started!

It seems like the error you’re encountering is related to the DocArrayDoc class. The error message indicates that the text and metadata fields are missing when trying to create a DocArrayDoc instance.

In the LangChain codebase, the DocArrayDoc class is used to define the schema of the documents that are stored in the DocArray based vector stores. It has three attributes: text, embedding, and metadata. The text and metadata fields are optional, but they are required when creating a DocArrayDoc instance for the similarity_search_with_score method in the DocArrayInMemorySearch class.

Here is the relevant code for the DocArrayDoc class:

class DocArrayDoc(BaseDoc):
    text: Optional[str]
    embedding: Optional[NdArray] = Field(**embeddings_params)
    metadata: Optional[dict]

And here is the part of the similarity_search_with_score method where the DocArrayDoc instance is created:

query_doc = self.doc_cls(embedding=query_embedding)  # type: ignore

From the traceback you provided, it seems that the query_embedding does not include the text and metadata fields, which leads to the ValidationError.

To fix this issue, you could try to ensure that the query_embedding includes the text and metadata fields when calling the similarity_search_with_score method. If you’re not sure how to do this, you could check the documentation or the source code of the OpenAIEmbeddings class to see how the embeddings are created.

If this doesn’t solve your issue, please provide more details about how you’re using the OpenAIEmbeddings class and the DocArrayInMemorySearch class, so I can give you more specific advice.

I hope this helps! If you have any other questions, feel free to ask.

Best regards, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Based on the error message you provided, it seems that the text and metadata fields are missing when trying to create a DocArrayDoc instance. This could be due to the fact that the document you are trying to convert into a DocArrayDoc does not contain these fields.

In LangChain, the _docarray_to_langchain_doc method is responsible for converting a DocArray document to a LangChain document format. This method expects the document to contain a content_field which is used as page_content in the LangChain document. All other fields excluding tensors and embeddings are added to the metadata of the LangChain document.

If the content_field is not found in the document, a ValueError is raised. This could be the source of your issue. Make sure that the documents you are trying to convert contain the content_field specified in the DocArrayRetriever instance.

Here is the relevant code:

def _docarray_to_langchain_doc(self, doc: Union[Dict[str, Any], Any]) -> Document:
    """
    Convert a DocArray document (which also might be a dict)
    to a langchain document format.

    DocArray document can contain arbitrary fields, so the mapping is done
    in the following way:

    page_content <-> content_field
    metadata <-> all other fields excluding
        tensors and embeddings (so float, int, string)

    Args:
        doc: DocArray document

    Returns:
        Document in langchain format

    Raises:
        ValueError: If the document doesn't contain the content field
    """

    fields = doc.keys() if isinstance(doc, dict) else doc.__fields__
    if self.content_field not in fields:
        raise ValueError(
            f"Document does not contain the content field - {self.content_field}."
        )
    lc_doc = Document(
        page_content=doc[self.content_field]
        if isinstance(doc, dict)
        else getattr(doc, self.content_field)
    )

    for name in fields:
        value = doc[name] if isinstance(doc, dict) else getattr(doc, name)
        if (
            isinstance(value, (str, int, float, bool))
            and name != self.content_field
        ):
            lc_doc.metadata[name] = value

    return lc_doc

You can find this code in the libs/langchain/langchain/retrievers/docarray.py file in the LangChain repository.

Also, the DocArrayInMemorySearch.from_texts method in LangChain expects the input data in the following structure:

  • texts: A list of strings. Each string in this list represents a text data.
  • embedding: An instance of the Embeddings class. This is used as the embedding function.
  • metadatas: An optional list of dictionaries. Each dictionary in this list represents metadata for each corresponding text in the texts list. If no metadata is provided, it defaults to None.

Here is the relevant part of the code:

@classmethod
def from_texts(
    cls,
    texts: List[str],
    embedding: Embeddings,
    metadatas: Optional[List[Dict[Any, Any]]] = None,
    **kwargs: Any,
) -> DocArrayInMemorySearch:
    """Create an DocArrayInMemorySearch store and insert data.

    Args:
        texts (List[str]): Text data.
        embedding (Embeddings): Embedding function.
        metadatas (Optional[List[Dict[Any, Any]]]): Metadata for each text
            if it exists. Defaults to None.
    Returns:
        DocArrayInMemorySearch Vector Store
    """
    store = cls.from_params(embedding, **kwargs)
    store.add_texts(texts=texts, metadatas=metadatas)
    return store

You can find this code in the libs/langchain/langchain/vectorstores/docarray/in_memory.py file in the LangChain repository.

Please ensure that the documents you are trying to convert contain the content_field specified in the DocArrayRetriever instance and that all other fields that should be included in the metadata are of the correct type (str, int, float, bool).

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

I’m having a similar issue as well.

Tested this and this is not the problem: embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

The following does not work:

DocArrayInMemorySearch.from_documents(docs, embedding=embeddings)
DocArrayInMemorySearch.from_texts(texts_array, embedding=embeddings)

Installing these: pydantic== 1.10.13 docarray== 0.32.1 … unfortunately did not work.

Actively looking for another solution at the moment.

Edit: My fix after hours of troubleshooting was just switching to using Qdrant. A number of other tools similar to DocArray that are recommended on DocArray’s site: https://docs.docarray.org/user_guide/storing/docindex/

pydantic==1.10.8 docarray==0.40.0

It works for me when I use the above versions of pedantic and docarray

@swordsmanluke I solved this problem by using pip install pydantic==1.10.8, thank you.

I downgraded the pydantic version, and it works now.

pydantic = ">=1,<2"
docarray = "^0.40.0"

I am experiencing this issue with the Retrieval Chain example from the quickstart section (https://python.langchain.com/docs/get_started/quickstart) and at first also with the Basic Example and the RAG Search Example (https://python.langchain.com/docs/expression_language/get_started). After a little research I switched to docarray 0.32.1 and pydantic 1.10.8 and the Basic Example and RAG Search Example worked, but the Retrieval Chain still throws the error: 1711 try: 1712 for i, step in enumerate(self.steps): -> 1713 input = step.invoke( 1714 input, 1715 # mark each step as a child run … Field required [type=missing, input_value={‘embedding’: [-0.0144587…7, -0.0153772095736525]}, input_type=dict] For further information visit https://errors.pydantic.dev/2.4/v/missing metadata Field required [type=missing, input_value={‘embedding’: [-0.0144587…7, -0.0153772095736525]}, input_type=dict] For further information visit https://errors.pydantic.dev/2.4/v/missing

Why is this different and is there a solution for this? Answers highly appreciated!

It’s now working with pydantic 1.10.13 and docarray 0.32.1!

I had a similar issue when trying out the RAF search example.

from langchain.vectorstores import DocArrayInMemorySearch
vectorstore = DocArrayInMemorySearch.from_texts(
    ["harrison worked at kensho", "bears like to eat honey"],
    embedding=OpenAIEmbeddings(),
)
retriever = vectorstore.as_retriever()

{ “name”: “AttributeError”, “message”: “‘builtin_function_or_method’ object has no attribute ‘func’”, “stack”: "--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[2], line 2 1 from langchain.vectorstores import DocArrayInMemorySearch ----> 2 vectorstore = DocArrayInMemorySearch.from_texts( 3 ["harrison worked at kensho", "bears like to eat honey"], 4 embedding=OpenAIEmbeddings(), 5 ) 6 retriever = vectorstore.as_retriever()

File ~/mambaforge/envs/langchain/lib/python3.12/site-packages/langchain/vectorstores/docarray/in_memory.py:67, in DocArrayInMemorySearch.from_texts(cls, texts, embedding, metadatas, **kwargs) 45 @classmethod 46 def from_texts( 47 cls, (…) 51 **kwargs: Any, 52 ) -> DocArrayInMemorySearch: 53 """Create an DocArrayInMemorySearch store and insert data. 54 55 Args: (…) 65 DocArrayInMemorySearch Vector Store 66 """ —> 67 store = cls.from_params(embedding, **kwargs) 68 store.add_texts(texts=texts, metadatas=metadatas) 69 return store

File ~/mambaforge/envs/langchain/lib/python3.12/site-packages/langchain/vectorstores/docarray/in_memory.py:38, in DocArrayInMemorySearch.from_params(cls, embedding, metric, **kwargs) 20 @classmethod 21 def from_params( 22 cls, (…) 27 **kwargs: Any, 28 ) -> DocArrayInMemorySearch: 29 """Initialize DocArrayInMemorySearch store. 30 31 Args: (…) 36 **kwargs: Other keyword arguments to be passed to the get_doc_cls method. 37 """ —> 38 _check_docarray_import() 39 from docarray.index import InMemoryExactNNIndex 41 doc_cls = cls._get_doc_cls(space=metric, **kwargs)

File ~/mambaforge/envs/langchain/lib/python3.12/site-packages/langchain/vectorstores/docarray/base.py:19, in _check_docarray_import() 17 def _check_docarray_import() -> None: 18 try: —> 19 import docarray 21 da_version = docarray.version.split(".") 22 if int(da_version[0]) == 0 and int(da_version[1]) <= 31:

File ~/mambaforge/envs/langchain/lib/python3.12/site-packages/docarray/init.py:5 1 version = ‘0.39.1’ 3 import logging ----> 5 from docarray.array import DocList, DocVec 6 from docarray.base_doc.doc import BaseDoc 7 from docarray.utils._internal.misc import _get_path_from_docarray_root_level

File ~/mambaforge/envs/langchain/lib/python3.12/site-packages/docarray/array/init.py:2 1 from docarray.array.any_array import AnyDocArray ----> 2 from docarray.array.doc_list.doc_list import DocList 3 from docarray.array.doc_vec.doc_vec import DocVec 5 all = [‘DocList’, ‘DocVec’, ‘AnyDocArray’]

File ~/mambaforge/envs/langchain/lib/python3.12/site-packages/docarray/array/doc_list/doc_list.py:51 43 T = TypeVar(‘T’, bound=‘DocList’) 44 T_doc = TypeVar(‘T_doc’, bound=BaseDocWithoutId) 47 class DocList( 48 ListAdvancedIndexing[T_doc], 49 PushPullMixin, 50 IOMixinDocList, —> 51 AnyDocArray[T_doc], 52 ): 53 """ 54 DocList is a container of Documents. 55 (…) 121 122 """ 124 doc_type: Type[BaseDocWithoutId] = AnyDoc

File ~/mambaforge/envs/langchain/lib/python3.12/site-packages/docarray/array/any_array.py:54, in AnyDocArray.class_getitem(cls, item) 51 @classmethod 52 def class_getitem(cls, item: Union[Type[BaseDocWithoutId], TypeVar, str]): 53 if not isinstance(item, type): —> 54 return Generic.class_getitem.func(cls, item) # type: ignore 55 # this do nothing that checking that item is valid type var or str 56 if not safe_issubclass(item, BaseDocWithoutId):

AttributeError: ‘builtin_function_or_method’ object has no attribute ‘func’" }


Here’s are my package installations

langchain 0.0.350 pypi_0 pypi python 3.12.0 h47c9636_0_cpython conda-forge pydantic 2.5.2 pyhd8ed1ab_0 conda-forge