langchain: ValueError: Expected metadata value to be a str, int, or float, got [{'text': 'Git', 'url': '#git'}] which is a when storing into Chroma vector stores using using element mode of UnstructuredMarkdownLoader

System Info

LangChain: 0.0.248 Python: 3.9.17 OS version: Linux 6.1.27-43.48.amzn2023.x86_64

Who can help?

I will submit a PR for a solution to this problem

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

Code:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import CharacterTextSplitter

def testElement():
    loader = UnstructuredMarkdownLoader(
        "filepath", mode="elements")
    documents = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
    split_docs = text_splitter.split_documents(documents)

    embeddings = OpenAIEmbeddings()
    docsearch = Chroma.from_documents(split_docs, embeddings)

Also need to have a link format in the markdown file to be load, for example:

- [Google Developer Documentation Style Guide](https://developers.google.com/style)

Error Message:

    138         # isinstance(True, int) evaluates to True, so we need to check for bools separately
    139         if not isinstance(value, (str, int, float)) or isinstance(value, bool):
--> 140             raise ValueError(
    141                 f"Expected metadata value to be a str, int, or float, got {value} which is a {type(value)}"
    142             )
ValueError: Expected metadata value to be a str, int, or float, got [{'text': 'Git', 'url': '#git'}] which is a <class 'list'>

Expected behavior

I expect to see the split documents loaded into Chroma, however, this raise error for not passing type check for metadata.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 17

Commits related to this issue

Most upvoted comments

Something like this should do it:

from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.vectorstores import Chroma
from langchain.vectorstores import utils as chromautils

loader = UnstructuredMarkdownLoader(filename, mode="elements")
docs = loader.load()
docs = chromautils.filter_complex_metadata(docs)
db = Chroma.from_documents(docs, embeddings, persist_directory="./db")

Have you found an alternative to Chroma in that sense? I’m looking for ways of passing in a list of perturbed questions as metadata but as you stated Chroma doesn’t support this type. Any workarounds?

For the issue I mentioned, my current approach involves flattening or restructuring the list. I changed (str->list) to couples of (str->str). Here is my pr link if you like to check #8605

Note that this may not be a finalized solution and is only specific to the case I mentioned. I hope that provides some insight or has a connection to your problem. If you have any comments or feedback, please share your thoughts.

Seems logical, thank you for this. I actually just changed from Chroma to FAISS that supports list of strings, much simpler 😄

Hello everyone, thank you for this amazing tool and community ! 🚀

If it could help you to solve similar linked problems : I encountered almost the same problem with an UnstructuredPDFLoader with argument mode=elements instead of defaut “single”. Here’s the my code and the error message :

from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

loader = UnstructuredPDFLoader(PDF_FILE_PATH, mode="elements")
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=50)
all_splits = text_splitter.split_documents(data)
vectorstore = Chroma.from_documents(
    documents=all_splits, embedding=HuggingFaceEmbeddings()
)
Traceback (most recent call last):
  File "/home/hugo/Documents/Missions/POC_GEN_AI/langchain_discovery/4__test_langchain_pdf_breakdown.py", line 69, in <module>
    vectorstore = Chroma.from_documents(
  File "/home/hugo/Documents/Missions/POC_GEN_AI/langchain_discovery/.venv_3/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 603, in from_documents
    return cls.from_texts(
  File "/home/hugo/Documents/Missions/POC_GEN_AI/langchain_discovery/.venv_3/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 567, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "/home/hugo/Documents/Missions/POC_GEN_AI/langchain_discovery/.venv_3/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 208, in add_texts
    self._collection.upsert(
  File "/home/hugo/Documents/Missions/POC_GEN_AI/langchain_discovery/.venv_3/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 294, in upsert
    ids, embeddings, metadatas, documents = self._validate_embedding_set(
  File "/home/hugo/Documents/Missions/POC_GEN_AI/langchain_discovery/.venv_3/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 349, in _validate_embedding_set
    validate_metadatas(maybe_cast_one_to_many(metadatas))
  File "/home/hugo/Documents/Missions/POC_GEN_AI/langchain_discovery/.venv_3/lib/python3.10/site-packages/chromadb/api/types.py", line 172, in validate_metadatas
    validate_metadata(metadata)
  File "/home/hugo/Documents/Missions/POC_GEN_AI/langchain_discovery/.venv_3/lib/python3.10/site-packages/chromadb/api/types.py", line 140, in validate_metadata
    raise ValueError(
ValueError: Expected metadata value to be a str, int, float or bool, got {'points': ((70.86, 56.71268726919993), (70.86, 349.85999999999996), (521.45003121198, 349.85999999999996), (521.45003121198, 56.71268726919993)), 'system': 'PixelSpace', 'layout_width': 595.32, 'layout_height': 841.92} which is a <class 'dict'>

The PDF that I worked on can be found here

"Thanks for sharing that! It seems like the problem with unsupported metadata value types is common when put into Chroma. Other than my initial solution, I’m considering whether using a filter might be a better and more general way to solve this issue. Just like this pr #9015