langchain: HuggingFace Data Loader fails when context is not str

System Info

langchain 0.0.285 python 3.11.4

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

Try to load https://huggingface.co/datasets/hotpot_qa/viewer/fullwiki/validation

from langchain.document_loaders import HuggingFaceDatasetLoader dataset_name = “hotpot_qa” page_content_column = “context” name = “fullwiki”

loader = HuggingFaceDatasetLoader(dataset_name, page_content_column, name) docs = loader.load()


ValidationError Traceback (most recent call last) /Users/deanchanter/Documents/GitHub/comma-chameleons/hello_doc_read.ipynb Cell 1 line 8 4 name = “fullwiki” 7 loader = HuggingFaceDatasetLoader(dataset_name, page_content_column, name) ----> 8 docs = loader.load()

File ~/Documents/GitHub/comma-chameleons/env/lib/python3.11/site-packages/langchain/document_loaders/hugging_face_dataset.py:87, in HuggingFaceDatasetLoader.load(self) 85 def load(self) -> List[Document]: 86 “”“Load documents.”“” —> 87 return list(self.lazy_load())

File ~/Documents/GitHub/comma-chameleons/env/lib/python3.11/site-packages/langchain/document_loaders/hugging_face_dataset.py:76, in HuggingFaceDatasetLoader.lazy_load(self) 59 raise ImportError( 60 "Could not import datasets python package. " 61 “Please install it with pip install datasets.” 62 ) 64 dataset = load_dataset( 65 path=self.path, 66 name=self.name, (…) 73 num_proc=self.num_proc, 74 ) —> 76 yield from ( 77 Document( 78 page_content=row.pop(self.page_content_column), 79 metadata=row, 80 ) 81 for key in dataset.keys() 82 for row in dataset[key] 83 )

File ~/Documents/GitHub/comma-chameleons/env/lib/python3.11/site-packages/langchain/document_loaders/hugging_face_dataset.py:77, in <genexpr>(.0) 59 raise ImportError( 60 "Could not import datasets python package. " 61 “Please install it with pip install datasets.” 62 ) 64 dataset = load_dataset( 65 path=self.path, 66 name=self.name, (…) 73 num_proc=self.num_proc, 74 ) 76 yield from ( —> 77 Document( 78 page_content=row.pop(self.page_content_column), 79 metadata=row, 80 ) 81 for key in dataset.keys() 82 for row in dataset[key] 83 )

File ~/Documents/GitHub/comma-chameleons/env/lib/python3.11/site-packages/langchain/load/serializable.py:75, in Serializable.init(self, **kwargs) 74 def init(self, **kwargs: Any) -> None: —> 75 super().init(**kwargs) 76 self._lc_kwargs = kwargs

File ~/Documents/GitHub/comma-chameleons/env/lib/python3.11/site-packages/pydantic/main.py:341, in pydantic.main.BaseModel.init()

ValidationError: 1 validation error for Document page_content str type expected (type=type_error.str)

Expected behavior

Either extend class to handle more types of data or update docs. Will to do a PR to extend if your open.

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 15 (5 by maintainers)

Commits related to this issue

Most upvoted comments

🤖

That’s great to hear that you’re planning to contribute to LangChain by opening a documentation bug and creating a PR! It’s an excellent way to contribute to the project.

To address the issue you encountered with the HuggingFace Data Loader, it seems like the error message indicates that the context parameter should be of type str. You can start by updating the documentation to clarify this requirement.

Additionally, if you have a proposed fix for this issue, you can include it in your PR. It’s also a good idea to include any relevant tests to ensure the behavior is correct.

Regarding adding support for split and length, extending the JSON loader method sounds like a reasonable approach. You can refer to the existing code for guidance and adapt it to handle the split and length operations.

Once you have made the necessary changes and created the PR, the maintainers will review it and provide feedback. Thank you for your contribution and let me know if you need any further assistance or guidance!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Thank you for reaching out and providing the details of the issue you encountered with the HuggingFace Data Loader in LangChain.

It seems that the error is caused when the context is not of type str. To address this issue, you have proposed two potential solutions: extending the class to handle more types of data or updating the documentation.

Both of these solutions sound reasonable and would be valuable contributions to LangChain. I encourage you to create a pull request with your proposed solution. The maintainers will review your changes and decide if they should be merged into the project.

Thank you for your willingness to contribute to LangChain! Let me know if you have any further questions or need any assistance with creating the pull request.


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.