llama_index: [Bug]: SimpleWebPageReader metadata cannot be None
Bug Description
SimpleWebPageReader will have “None” as metadata if metadata_fn is not provided leading to pydantic validation error.
Version
0.8.15
Steps to Reproduce
Use SimpleWebPageReader without specifying metadata_fn:
from llama_index import SimpleWebPageReader
documents = SimpleWebPageReader(
html_to_text=True
).load_data(["http://paulgraham.com/worked.html"])
Relevant Logs/Tracbacks
│ /opt/homebrew/Caskroom/miniconda/base/envs/py38_trulens/lib/python3.8/site-packages/llama_index/ │
│ readers/web.py:65 in load_data │
│ │
│ 62 │ │ │ if self._metadata_fn is not None: │
│ 63 │ │ │ │ metadata = self._metadata_fn(url) │
│ 64 │ │ │ │
│ ❱ 65 │ │ │ documents.append(Document(text=response, metadata=metadata)) │
│ 66 │ │ │
│ 67 │ │ return documents │
│ 68 │
│ │
│ in pydantic.main.BaseModel.__init__:341 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationError: 1 validation error for Document
extra_info
none is not an allowed value (type=type_error.none.not_allowed)
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 17 (7 by maintainers)
🤖
To create a pull request on GitHub, you’ll need to follow these steps:
Fork the repository: Go to the repository’s page on GitHub and click on the “Fork” button in the top-right corner. This will create a copy of the repository under your GitHub account.
Clone the forked repository: On your local machine, open a terminal and use the
git clonecommand to clone the forked repository. For example:Create a new branch: Use the
git checkoutcommand to create and switch to a new branch for your changes. For example:Make the necessary changes: Use your preferred code editor to make the required changes to the codebase.
Commit your changes: Once you’ve made the necessary changes, use the following commands to commit your changes:
Push your changes: Push your changes to your forked repository using the
git pushcommand:Create a pull request: Go to the original repository’s page on GitHub and click on the “New pull request” button. Select your forked repository and the branch you just pushed. Provide a descriptive title and description for your pull request, and then click on the “Create pull request” button.
That’s it! Your pull request will be created, and the repository maintainers will be notified. They will review your changes and provide feedback or merge them into the main codebase if everything looks good.
Let me know if you have any further questions or need additional assistance!
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Sure, I’d be happy to guide you through the process of opening a pull request for your modifications to the
SimpleWebPageReaderclass in the LlamaIndex project. Here are the steps:First, you need to fork the LlamaIndex repository. You can do this by visiting the LlamaIndex Github page and clicking on the “Fork” button at the top right corner.
Once you’ve forked the repository, you’ll need to clone it to your local machine. Open your terminal and run the following command:
Make sure to replace
<your_username>with your actual GitHub username.Navigate to the cloned repository by running:
Create a new branch for your modifications. You can do this by running:
You can replace
prevent-pydantic-errorwith any name that you prefer for your branch.Now, you can make your modifications to the
SimpleWebPageReaderclass.After making your modifications, you’ll need to stage your changes. You can do this by running:
Commit your changes by running:
Push your changes to your forked repository by running:
Now, you can open a pull request. Go back to the LlamaIndex Github page and click on “New pull request”.
Click on “compare across forks” and select your forked repository and the branch you just pushed.
Click on “Create pull request” and fill in the necessary details about your modifications.
That’s it! You’ve now opened a pull request for your modifications to the
SimpleWebPageReaderclass in the LlamaIndex project. The maintainers of the project will review your pull request and merge it if everything looks good.For more detailed instructions, you can refer to the official GitHub documentation on creating a pull request from a fork.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for bringing this issue to our attention. I understand that you’re encountering a pydantic validation error when using the
SimpleWebPageReaderwithout specifying ametadata_fn.To prevent this issue, we could modify the
SimpleWebPageReaderclass to include a default metadata function that returns an empty dictionary when nometadata_fnis provided. This would ensure that the metadata attribute of the Document object is always a dictionary, preventing the pydantic validation error.Here’s a potential modification to the
SimpleWebPageReaderclass:This modification ensures that if no
metadata_fnis provided, a default function that returns an empty dictionary is used. This way, the metadata attribute of the Document object will always be a dictionary, even if it’s empty, thus preventing the pydantic validation error.Please note that this is a suggested modification and might not be the final solution. We will look into this issue further and consider implementing a fix in the next version of LlamaIndex.
Thank you for your patience and for using LlamaIndex.
Best, Dosu
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
That’s great to hear that you have made modifications to the
SimpleWebPageReaderclass in the LlamaIndex project! Your proposed solution to prevent the pydantic validation error by setting a defaultmetadata_fnand adjusting the lambda function in the__init__method sounds like a good approach.Thank you for taking the initiative to make these changes. I encourage you to go ahead and open a pull request with your modifications. The maintainers will review your changes and decide if they should be merged into the LlamaIndex project.
Keep up the good work! Let me know if you need any further assistance.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Thank you for your contribution to the LlamaIndex project. I appreciate your effort in improving the
SimpleWebPageReaderclass.Based on the context provided, here is the diff between the original and your modified
SimpleWebPageReaderclass:The changes include setting a default
metadata_fnthat returns a dictionary with the URL as metadata and adjusting the lambda function in the__init__method to return a dictionary with the key “url” and the value as the URL.This modification should help prevent a pydantic validation error when
metadata_fnis not provided, which is a great improvement.Please make sure to test your changes thoroughly and update any relevant documentation. Once you’ve done that, feel free to submit a pull request. We appreciate your contribution to the LlamaIndex project!
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
You’re on the right track with your modification. To change the “URL” key to lowercase “url”, you can simply adjust the lambda function in your
__init__method. Here’s the adjusted code:In this modification, if
metadata_fnis not provided, the lambda function will return a dictionary with the key “url” (in lowercase) and the value as the URL.I hope this helps! Let me know if you have any other questions.
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for bringing this issue to our attention. Your suggestion to modify the
SimpleWebPageReaderclass to use a defaultmetadata_fnthat returns a dictionary with the URL as metadata whenmetadata_fnis not provided is indeed a good solution to prevent the pydantic validation error.Here is how you can modify the
SimpleWebPageReaderclass:In the
__init__method, I’ve added a lambda function as a default value formetadata_fnthat returns a dictionary with the URL as metadata. This function will be used whenmetadata_fnis not provided. In theload_datamethod, I’ve removed the check formetadata_fnbeingNoneand directly usedmetadata_fnto get the metadata. This will prevent the pydantic validation error that occurs when creating aDocumentobject withNoneas metadata.Please note that this modification is not yet part of the official LlamaIndex repository and you would need to make these changes in your local copy of the code. We will consider incorporating this change in the future versions of LlamaIndex.
I hope this helps! If you have any other questions or issues, feel free to ask.
Best regards, Dosu
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
🤖
Hello,
Thank you for bringing this issue to our attention. The error you’re encountering is due to the
metadata_fnparameter not being specified when creating theSimpleWebPageReaderinstance. This parameter is an optional function that takes in a URL and returns a dictionary of metadata. Whenmetadata_fnis not specified, it defaults toNone, which is causing the pydantic validation error you’re seeing.To resolve this issue, you can provide a default
metadata_fnwhen creating theSimpleWebPageReaderinstance. This function should take a URL as input and return a dictionary with the URL as metadata. Here’s an example of how you can do this:In this code,
metadata_fn=lambda url: {"url": url}is used to provide a default function that takes a URL and returns it as metadata. This should prevent the pydantic validation error from occurring.I hope this helps! If you have any other questions or run into any other issues, please don’t hesitate to ask.
Best, Dosu
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@piotrm0 Ah I see the initial bug anyways, I can submit a bugfix for this