requests-cache: Unable to modify `CachedResponse` object, raises `AttributeError`
The problem
One of the hooks I run modifies the Response object.
Since 0.7 release any hook that attempts to modify the Response object – now returned as CachedResponse object – raises an AttributeError.
Expected behavior
Hooks should be able to modify the CachedResponse object.
Steps to reproduce the behavior
def parse(r, *args, **kwargs):
if r.status_code == requests.codes.ok:
r.html = lxml.html.document_fromstring(ftfy.fix_encoding(r.text), base_url=r.url)
r.html.make_links_absolute()
return r
- Access uncached resource 😃
- Reaccess resource 😦
Workarounds
None at the moment.
Environment
- requests-cache version:
0.7.1 - Python version:
3.9.6 - Platform:
macOS High Sierra 10.13.6
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (9 by maintainers)
Ok. I found a simple way to update the stored responses. But it depends on the availability of
cattrs. Essentially, installcattrsso the defaultpickle_serializeruses it. Then load and store each response. This makes them use just the basic python types with robust default values for newly introduced attributes. Then upgrade to newer version of this library. Python 3.6 is EOL in ~5 month. Most with current systems (non LTS servers) will probably be on python 3.8 or 3.9. So a conversion script might be nice but not really necessary?Code:
The
0.7update also added a way to make your cache more future-compatible: if youpip install cattrs, then requests-cache will use that to ‘unstructure’ responses so they will be pickled as builtin types instead ofCachedSessionobjects. This is not yet the default behavior and isn’t fully documented, since it requires python 3.7+, and requests-cache still supports python 3.6 (which will be dropped in a future release).This is done in serializers/cattrs.py and serializers/preconf.py, if you’re curious.
Also, the fix for this issue (as well as the new
CachedResponse.cache_keyattribute) is now in the latest stable release (0.7.2).I’m going to go ahead and close this issue, but you’re welcome to create more if you have any more problems or feature requests.
It was kind of for both. As the dev myself it was mostly for debugging purposes for now, e. g. checking what URLs are cached, possibly cleaning up cache entries from previous requests, …
My situation: I had to crawl/scrape a large amount of URLs. Unfortunately the cache file only includes the cache-key which I was not able to inversely map to the request URL if it is not stored beforehand. I intend to later just check by cache key whether I have a stored response without requesting again. It is not strictly neccessary but allows me to use the URLs to retrieve the responses without simulating my whole scraping process. (And for me, only the URL changes between each request, so my lookup is essentially 1:1).
The code I had to use to attach the URLs to my cache keys. I simulate the crawling/scraping to retrieve the cached responses, and then just compute the cache keys. (I could do this while I do my actual crawling but this came afterwards.)
I could solve this differently but it works for me now.
Having the cache key as a attribute would be nice. But I would make storing the URL or other metadata in the cache optional. My last crawl had 10k or more? responses and my cache file is 4GB large. Just the URLs are probably only ~5MB but including more metadata if they are not required/used might increase the filesize again. Maybe including an empty hook that allows users to subclass the CachedSession to store arbitrary JSON strings or pickled data for a given cache-key? After thinking, the request or response timestamp would also be interesting for me as I “abuse” the cache as a kind of web-archive. For eviction after some time because page content changed (manual review) this might be interesting. But with some readme to keep track it is not really a major issue, and then just move the cache file for allow updates.
I’ll give this a test today, many thanks!
The slotted class would also break caching support for the
requests-HTMLpackage, they also assign new attributes to theResponseobject.