requests-cache: Unable to modify `CachedResponse` object, raises `AttributeError`

The problem

One of the hooks I run modifies the Response object.

Since 0.7 release any hook that attempts to modify the Response object – now returned as CachedResponse object – raises an AttributeError.

Expected behavior

Hooks should be able to modify the CachedResponse object.

Steps to reproduce the behavior

def parse(r, *args, **kwargs):
    if r.status_code == requests.codes.ok:
        r.html = lxml.html.document_fromstring(ftfy.fix_encoding(r.text), base_url=r.url)
        r.html.make_links_absolute()
        return r
  1. Access uncached resource 😃
  2. Reaccess resource 😦

Workarounds

None at the moment.

Environment

  • requests-cache version: 0.7.1
  • Python version: 3.9.6
  • Platform: macOS High Sierra 10.13.6

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

Ok. I found a simple way to update the stored responses. But it depends on the availability of cattrs. Essentially, install cattrs so the default pickle_serializer uses it. Then load and store each response. This makes them use just the basic python types with robust default values for newly introduced attributes. Then upgrade to newer version of this library. Python 3.6 is EOL in ~5 month. Most with current systems (non LTS servers) will probably be on python 3.8 or 3.9. So a conversion script might be nice but not really necessary?

Code:

from requests_cache.backends.sqlite import DbPickleDict
# assumes default names from the DbCache, with db_path being the cache filename
# serializer=pickle_serializer (is default, uses cattrs if installed)
dpd = DbPickleDict(db_path=".cache.sqlite", table_name="responses", use_temp=False, fast_save=False)
for key in dpd:
    dpd[key] = dpd[key]

The 0.7 update also added a way to make your cache more future-compatible: if you pip install cattrs, then requests-cache will use that to ‘unstructure’ responses so they will be pickled as builtin types instead of CachedSession objects. This is not yet the default behavior and isn’t fully documented, since it requires python 3.7+, and requests-cache still supports python 3.6 (which will be dropped in a future release).

This is done in serializers/cattrs.py and serializers/preconf.py, if you’re curious.

Also, the fix for this issue (as well as the new CachedResponse.cache_key attribute) is now in the latest stable release (0.7.2).

I’m going to go ahead and close this issue, but you’re welcome to create more if you have any more problems or feature requests.

It was kind of for both. As the dev myself it was mostly for debugging purposes for now, e. g. checking what URLs are cached, possibly cleaning up cache entries from previous requests, …

My situation: I had to crawl/scrape a large amount of URLs. Unfortunately the cache file only includes the cache-key which I was not able to inversely map to the request URL if it is not stored beforehand. I intend to later just check by cache key whether I have a stored response without requesting again. It is not strictly neccessary but allows me to use the URLs to retrieve the responses without simulating my whole scraping process. (And for me, only the URL changes between each request, so my lookup is essentially 1:1).

The code I had to use to attach the URLs to my cache keys. I simulate the crawling/scraping to retrieve the cached responses, and then just compute the cache keys. (I could do this while I do my actual crawling but this came afterwards.)

import json
from os import PathLike
from pathlib import Path
from typing import Iterator
from urllib.parse import urljoin

import requests
from parsel import Selector
from requests_cache import CachedSession
from requests_cache.backends.sqlite import DbDict

REQ_HEADERS = {
    "User-Agent": "...",
}
REQ_COOKIES = {}


# subclass cached session to retrieve the key (response here not required)
class MyCacheKeySession(CachedSession):
    def send(self, request, **kwargs):
        return self.cache.create_key(request, **kwargs)
        # NOTE: the code below failed
        # response = super().send(request, **kwargs)
        # setattr(response, "cache_key", self.cache.create_key(request, **kwargs))
        # return response


# simulate the crawling / scraping process
def iter_serieslists_urls(sess: requests.Session) -> Iterator[str]:
    url = "https://the.start.url/series/"

    while True:
        yield url

        resp = sess.get(url)
        assert getattr(resp, "from_cache", False)

        sel = Selector(resp.text)
        next_url = sel.css("div.pagination a.next_page::attr(href)").get()
        if not next_url:
            break

        url = urljoin(url, next_url)


def get_session(
    cache_file: str = ".cache.sqlite",
    allowable_codes=(200,),
    session_cls=CachedSession,
    **kwargs
) -> requests.Session:
    backend = DbCache(cache_file)
    session = session_cls(backend=backend, allowable_codes=allowable_codes)

    if "headers" in kwargs:
        headers = kwargs["headers"]
    else:
        headers = REQ_HEADERS
    if headers:
        session.headers.update(headers)

    if "cookies" in kwargs:
        cookies = kwargs["cookies"]
    else:
        cookies = REQ_COOKIES
    if cookies:
        requests.utils.add_dict_to_cookiejar(session.cookies, REQ_COOKIES)

    return session


def main():
    cache_file = ".cache.sqlite"
    session_kwargs = dict(cache_file=cache_file, allowable_codes=(200, 302))
    sess = get_session(**session_kwargs)
    fakesess = get_session(**session_kwargs, session_cls=MyCacheKeySession)
    cache_url_table = DbDict(cache_file, table_name="key2url")

    url_iter = iter_serieslists_urls(sess)  # crawling simulator, returns the request urls
    for url in url_iter:
        key = fakesess.get(url)  # to get the cache key
        cache_url_table[key] = url  # store url for key

I could solve this differently but it works for me now.

Having the cache key as a attribute would be nice. But I would make storing the URL or other metadata in the cache optional. My last crawl had 10k or more? responses and my cache file is 4GB large. Just the URLs are probably only ~5MB but including more metadata if they are not required/used might increase the filesize again. Maybe including an empty hook that allows users to subclass the CachedSession to store arbitrary JSON strings or pickled data for a given cache-key? After thinking, the request or response timestamp would also be interesting for me as I “abuse” the cache as a kind of web-archive. For eviction after some time because page content changed (manual review) this might be interesting. But with some readme to keep track it is not really a major issue, and then just move the cache file for allow updates.

I’ll give this a test today, many thanks!

The slotted class would also break caching support for the requests-HTML package, they also assign new attributes to the Response object.