llama_index: [Bug]: SimpleWebPageReader metadata cannot be None

Bug Description

SimpleWebPageReader will have “None” as metadata if metadata_fn is not provided leading to pydantic validation error.

Version

0.8.15

Steps to Reproduce

Use SimpleWebPageReader without specifying metadata_fn:

from llama_index import SimpleWebPageReader
documents = SimpleWebPageReader(
    html_to_text=True
).load_data(["http://paulgraham.com/worked.html"])

Relevant Logs/Tracbacks

│ /opt/homebrew/Caskroom/miniconda/base/envs/py38_trulens/lib/python3.8/site-packages/llama_index/ │
│ readers/web.py:65 in load_data                                                                   │
│                                                                                                  │
│    62 │   │   │   if self._metadata_fn is not None:                                              │
│    63 │   │   │   │   metadata = self._metadata_fn(url)                                          │
│    64 │   │   │                                                                                  │
│ ❱  65 │   │   │   documents.append(Document(text=response, metadata=metadata))                   │
│    66 │   │                                                                                      │
│    67 │   │   return documents                                                                   │
│    68                                                                                            │
│                                                                                                  │
│ in pydantic.main.BaseModel.__init__:341                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationError: 1 validation error for Document
extra_info
  none is not an allowed value (type=type_error.none.not_allowed)

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 17 (7 by maintainers)

Most upvoted comments

🤖

To create a pull request on GitHub, you’ll need to follow these steps:

  1. Fork the repository: Go to the repository’s page on GitHub and click on the “Fork” button in the top-right corner. This will create a copy of the repository under your GitHub account.

  2. Clone the forked repository: On your local machine, open a terminal and use the git clone command to clone the forked repository. For example:

    git clone https://github.com/your-username/repository-name.git
    
  3. Create a new branch: Use the git checkout command to create and switch to a new branch for your changes. For example:

    git checkout -b my-branch-name
    
  4. Make the necessary changes: Use your preferred code editor to make the required changes to the codebase.

  5. Commit your changes: Once you’ve made the necessary changes, use the following commands to commit your changes:

    git add .
    git commit -m "Your commit message"
    
  6. Push your changes: Push your changes to your forked repository using the git push command:

    git push origin my-branch-name
    
  7. Create a pull request: Go to the original repository’s page on GitHub and click on the “New pull request” button. Select your forked repository and the branch you just pushed. Provide a descriptive title and description for your pull request, and then click on the “Create pull request” button.

That’s it! Your pull request will be created, and the repository maintainers will be notified. They will review your changes and provide feedback or merge them into the main codebase if everything looks good.

Let me know if you have any further questions or need additional assistance!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Sure, I’d be happy to guide you through the process of opening a pull request for your modifications to the SimpleWebPageReader class in the LlamaIndex project. Here are the steps:

  1. First, you need to fork the LlamaIndex repository. You can do this by visiting the LlamaIndex Github page and clicking on the “Fork” button at the top right corner.

  2. Once you’ve forked the repository, you’ll need to clone it to your local machine. Open your terminal and run the following command:

    git clone https://github.com/<your_username>/llama_index.git
    

    Make sure to replace <your_username> with your actual GitHub username.

  3. Navigate to the cloned repository by running:

    cd llama_index
    
  4. Create a new branch for your modifications. You can do this by running:

    git checkout -b prevent-pydantic-error
    

    You can replace prevent-pydantic-error with any name that you prefer for your branch.

  5. Now, you can make your modifications to the SimpleWebPageReader class.

  6. After making your modifications, you’ll need to stage your changes. You can do this by running:

    git add .
    
  7. Commit your changes by running:

    git commit -m "Prevent pydantic validation error in SimpleWebPageReader"
    
  8. Push your changes to your forked repository by running:

    git push origin prevent-pydantic-error
    
  9. Now, you can open a pull request. Go back to the LlamaIndex Github page and click on “New pull request”.

  10. Click on “compare across forks” and select your forked repository and the branch you just pushed.

  11. Click on “Create pull request” and fill in the necessary details about your modifications.

That’s it! You’ve now opened a pull request for your modifications to the SimpleWebPageReader class in the LlamaIndex project. The maintainers of the project will review your pull request and merge it if everything looks good.

For more detailed instructions, you can refer to the official GitHub documentation on creating a pull request from a fork.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Hello,

Thank you for bringing this issue to our attention. I understand that you’re encountering a pydantic validation error when using the SimpleWebPageReader without specifying a metadata_fn.

To prevent this issue, we could modify the SimpleWebPageReader class to include a default metadata function that returns an empty dictionary when no metadata_fn is provided. This would ensure that the metadata attribute of the Document object is always a dictionary, preventing the pydantic validation error.

Here’s a potential modification to the SimpleWebPageReader class:

class SimpleWebPageReader(BaseReader):
    def __init__(
        self,
        html_to_text: bool = False,
        metadata_fn: Optional[Callable[[str], Dict]] = None,
    ) -> None:
        try:
            import html2text  # noqa: F401
        except ImportError:
            raise ImportError(
                "`html2text` package not found, please run `pip install html2text`"
            )
        self._html_to_text = html_to_text
        self._metadata_fn = metadata_fn if metadata_fn is not None else lambda url: {}

    def load_data(self, urls: List[str]) -> List[Document]:
        if not isinstance(urls, list):
            raise ValueError("urls must be a list of strings.")
        documents = []
        for url in urls:
            response = requests.get(url, headers=None).text
            if self._html_to_text:
                import html2text
                response = html2text.html2text(response)

            metadata = self._metadata_fn(url)

            documents.append(Document(text=response, metadata=metadata))

        return documents

This modification ensures that if no metadata_fn is provided, a default function that returns an empty dictionary is used. This way, the metadata attribute of the Document object will always be a dictionary, even if it’s empty, thus preventing the pydantic validation error.

Please note that this is a suggested modification and might not be the final solution. We will look into this issue further and consider implementing a fix in the next version of LlamaIndex.

Thank you for your patience and for using LlamaIndex.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

That’s great to hear that you have made modifications to the SimpleWebPageReader class in the LlamaIndex project! Your proposed solution to prevent the pydantic validation error by setting a default metadata_fn and adjusting the lambda function in the __init__ method sounds like a good approach.

Thank you for taking the initiative to make these changes. I encourage you to go ahead and open a pull request with your modifications. The maintainers will review your changes and decide if they should be merged into the LlamaIndex project.

Keep up the good work! Let me know if you need any further assistance.


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Thank you for your contribution to the LlamaIndex project. I appreciate your effort in improving the SimpleWebPageReader class.

Based on the context provided, here is the diff between the original and your modified SimpleWebPageReader class:

class SimpleWebPageReader(BaseReader):
    """Simple web page reader.

    Reads pages from the web.

    Args:
        html_to_text (bool): Whether to convert HTML to text.
            Requires `html2text` package.
-        metadata_fn (Optional[Callable[[str], Dict]]): A function that takes in
-            a URL and returns a dictionary of metadata.
-            Default is None.
+        metadata_fn (Optional[Callable[[str], Dict]]): A function that takes in
+            a URL and returns a dictionary of metadata.
+            Default is a function that returns a dictionary with the URL as metadata.
    """

    def __init__(
        self,
        html_to_text: bool = False,
-        metadata_fn: Optional[Callable[[str], Dict]] = None,
+        metadata_fn: Optional[Callable[[str], Dict]] = lambda url: {"url": url},
    ) -> None:
        """Initialize with parameters."""
        try:
            import html2text  # noqa: F401
        except ImportError:
            raise ImportError(
                "`html2text` package not found, please run `pip install html2text`"
            )
        self._html_to_text = html_to_text
        self._metadata_fn = metadata_fn

The changes include setting a default metadata_fn that returns a dictionary with the URL as metadata and adjusting the lambda function in the __init__ method to return a dictionary with the key “url” and the value as the URL.

This modification should help prevent a pydantic validation error when metadata_fn is not provided, which is a great improvement.

Please make sure to test your changes thoroughly and update any relevant documentation. Once you’ve done that, feel free to submit a pull request. We appreciate your contribution to the LlamaIndex project!

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Hello,

You’re on the right track with your modification. To change the “URL” key to lowercase “url”, you can simply adjust the lambda function in your __init__ method. Here’s the adjusted code:

class SimpleWebPageReader(BaseReader):
    def __init__(
        self,
        html_to_text: bool = False,
        metadata_fn: Optional[Callable[[str], Dict]] = None,
    ) -> None:
        try:
            import html2text  # noqa: F401
        except ImportError:
            raise ImportError(
                "`html2text` package not found, please run `pip install html2text`"
            )
        self._html_to_text = html_to_text
        self._metadata_fn = metadata_fn if metadata_fn is not None else lambda url: {"url": url}

    def load_data(self, urls: List[str]) -> List[Document]:
        if not isinstance(urls, list):
            raise ValueError("urls must be a list of strings.")
        documents = []
        for url in urls:
            response = requests.get(url, headers=None).text
            if self._html_to_text:
                import html2text
                response = html2text.html2text(response)

            metadata = self._metadata_fn(url)
            documents.append(Document(text=response, metadata=metadata))

        return documents

In this modification, if metadata_fn is not provided, the lambda function will return a dictionary with the key “url” (in lowercase) and the value as the URL.

I hope this helps! Let me know if you have any other questions.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Hello,

Thank you for bringing this issue to our attention. Your suggestion to modify the SimpleWebPageReader class to use a default metadata_fn that returns a dictionary with the URL as metadata when metadata_fn is not provided is indeed a good solution to prevent the pydantic validation error.

Here is how you can modify the SimpleWebPageReader class:

class SimpleWebPageReader(BaseReader):
    def __init__(
        self,
        html_to_text: bool = False,
        metadata_fn: Optional[Callable[[str], Dict]] = None,
    ) -> None:
        try:
            import html2text  # noqa: F401
        except ImportError:
            raise ImportError(
                "`html2text` package not found, please run `pip install html2text`"
            )
        self._html_to_text = html_to_text
        self._metadata_fn = metadata_fn if metadata_fn is not None else lambda url: {"URL": url}

    def load_data(self, urls: List[str]) -> List[Document]:
        if not isinstance(urls, list):
            raise ValueError("urls must be a list of strings.")
        documents = []
        for url in urls:
            response = requests.get(url, headers=None).text
            if self._html_to_text:
                import html2text
                response = html2text.html2text(response)

            metadata = self._metadata_fn(url)
            documents.append(Document(text=response, metadata=metadata))

        return documents

In the __init__ method, I’ve added a lambda function as a default value for metadata_fn that returns a dictionary with the URL as metadata. This function will be used when metadata_fn is not provided. In the load_data method, I’ve removed the check for metadata_fn being None and directly used metadata_fn to get the metadata. This will prevent the pydantic validation error that occurs when creating a Document object with None as metadata.

Please note that this modification is not yet part of the official LlamaIndex repository and you would need to make these changes in your local copy of the code. We will consider incorporating this change in the future versions of LlamaIndex.

I hope this helps! If you have any other questions or issues, feel free to ask.

Best regards, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

🤖

Hello,

Thank you for bringing this issue to our attention. The error you’re encountering is due to the metadata_fn parameter not being specified when creating the SimpleWebPageReader instance. This parameter is an optional function that takes in a URL and returns a dictionary of metadata. When metadata_fn is not specified, it defaults to None, which is causing the pydantic validation error you’re seeing.

To resolve this issue, you can provide a default metadata_fn when creating the SimpleWebPageReader instance. This function should take a URL as input and return a dictionary with the URL as metadata. Here’s an example of how you can do this:

from llama_index import SimpleWebPageReader
documents = SimpleWebPageReader(
    html_to_text=True,
    metadata_fn=lambda url: {"url": url}
).load_data(["http://paulgraham.com/worked.html"])

In this code, metadata_fn=lambda url: {"url": url} is used to provide a default function that takes a URL and returns it as metadata. This should prevent the pydantic validation error from occurring.

I hope this helps! If you have any other questions or run into any other issues, please don’t hesitate to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@piotrm0 Ah I see the initial bug anyways, I can submit a bugfix for this