langchain: Pinecone retriever throwing: KeyError: 'text'

My query code is below:

pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY'),  # app.pinecone.io
    environment=os.environ.get('PINECONE_ENV')  # next to API key in console
)
index = pinecone.Index(index_name)
embeddings = OpenAIEmbeddings(openai_api_key=os.environ.get('OPENAI_API_KEY'))

vectordb = Pinecone(
    index=index,
    embedding_function=embeddings.embed_query,
    text_key="text",
)
llm=ChatOpenAI(
    openai_api_key=os.environ.get('OPENAI_API_KEY'),
    temperature=0,
    model_name='gpt-3.5-turbo'
)
retriever = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever()
)
tools = [Tool(
    func=retriever.run,
    description=tool_desc,
    name='Product DB'
)]
memory = ConversationBufferWindowMemory(
    memory_key="chat_history",  # important to align with agent prompt (below)
    k=5,
    return_messages=True
)
agent = initialize_agent(
    agent='chat-conversational-react-description', 
    tools=tools, 
    llm=llm,
    verbose=True,
    max_iterations=3,
    early_stopping_method="generate",
    memory=memory,
)

If I run: agent({'chat_history':[], 'input':'What is a product?'})

It throws:

File “C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\vectorstores\pinecone.py”, line 160, in similarity_search text = metadata.pop(self._text_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ KeyError: ‘text’

This is the offending block in site-packages/pinecone.py:

 for res in results["matches"]:
            # print('metadata.pop(self._text_key) = ' + metadata.pop(self._text_key))
            metadata = res["metadata"]
            text = metadata.pop(self._text_key)
            docs.append(Document(page_content=text, metadata=metadata))

If I remove my tool like the line below, everything executes (just not my tool):

tools = []

Can anyone help me fix this KeyError: ‘text’ issue? My versions of langchain, pinecone-client and python are 0.0.147, 2.2.1 and 3.11.3 respectively.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15

Most upvoted comments

Pinecone doesn’t store documents explicitly; it only stores ids, embeddings, and metadata. So, if when querying Pinecone you’d like to have access to the documents themselves, you should add them to the metadata, as illustrated here. If you add them under a text key in the metadata, the KeyError should resolve.

Alternatively, you can specify your own key name when you initialize Pinecone, via text_str. Code reference is here.

Your alternative is to store the document text elsewhere and then store a reference / URL to it in the pinecone metadata. This is what I do with Tana notes, for example. Pinecone is the search index - Tana is the content repository.

On Thu, Jun 15, 2023 at 3:13 PM Peak Prosperity @.***> wrote:

This explains a lot - thank you. However, wouldn’t storing the entire text of a document in addition to the vector drastically increase the storage requirements? I believe Pinecone also suggests avoiding pieces of metadata that are very unique, which this certainly would be.

Is it not possible to use the vector data itself as an input to ChatGPT for answering questions on the data? From above, it sounds like you need to do as follows:

  1. Embed the text and store both the vector and original text at Pinecone.
  2. Perform a similarity search on the Pinecone index to find relevant documents.
  3. Take text from Pinecone metadata and feed that back into the LLM for QA.

Appreciate any additional ideas. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/3460#issuecomment-1593592203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUCPNHIWD7INUAK5SNXW3XLNNE7ANCNFSM6AAAAAAXJ5SREY . You are receiving this because you commented.Message ID: @.***>

+1 here. Subscribing.

Same problem here!! I’m using from_documents