joblib: Expiry of old results

I’ve added a small extra feature that recalculates if a result is older than a certain age.

@memory.cache(expires_after=60)
def f(x):
 ...

Will recompute if cached result is older than 60 seconds. Code is in this branch: https://github.com/fredludlow/joblib/tree/expires-after

Is this of interest? And if so, is there anything else that needs doing before sending a pull request? (I’ve added test coverage in test_memory.py)

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 33
  • Comments: 26 (9 by maintainers)

Most upvoted comments

Is this feature in the pipline for release?

Totally agree with the philosophy of keeping it as unsurprising as possible - this wouldn’t break any existing code and the default would be to behave exactly as before.

My use case is a web app which aggregates data from a bunch of components (SQL queries and remote web APIs) and returns a bundle of data to the client. A non-cached request for this project takes maybe 10 seconds. If you use an http cache (requests_cache) that reduces to 5 seconds.If I put joblib caching around each component (there are about 10) then that reduces to miliseconds for second and subsequent calls.

I’m happy with serving up results that are, at most, a week old (the quickest changing data sources I’m using update on that timescale) but not happy with caching “forever” (the current joblib model). Without this modification my options are:

  1. Set-up a cron job to clear caches every week - this breaks portability of the app (adds another installation/uninstallation step).

  2. Build the cache maintenance into the request cycle either in a very broad-brushed way (delete the whole lot if root cache_dir > max_age) or per result (what this change does).

Joblib has been great for this during development because small changes to one component don’t cause a complete recalculation of the whole bundle, so the “Page Broken” --> “Make a change” --> “Hit F5” --> “Wait…” cycle is minimized.

Of course it’s your decision, if it’s not right for joblib then close this issue. Also if you think I’m going about this wrong and there’s a better way of doing it I’d be very happy to hear it - I couldn’t see any other similar library that provides this functionality, though maybe that’s because it’s a silly way to solve the problem!)

The idea from the issue creator @memory.cache(expires_after=60) is absolutely awesome! I need this for a web api application which caches data between a server side table loading with pagination (call count -> creates cache, call data uses this cache) … but the cache should be invalidated quite soon because data gets updated … just to make the second request faster. Can you merge this in immediately please?

👍 as well for having a way to clean up the cache with time based parameters

I think this feature is indeed a nice one, and would prove very useful in joblib.Memory, in particular when interacting with SQL databases or changing environments.

A classical hack to implement this feature directly with the function is to pass an argument that only change when the condition is changed, like passing the day as a dead argument. This has 2 major disadvantages:

  • The user need to change the way it call the function or add a custom decorator.
  • The cache will grow very large as it is not cleaned automatically.

My main concern with the proposed implementation is that there can be many other reasons why one would want to invalidate the cache based on some external change to the function arguments. With time, this would lead to an ever growing list of arguments for Memory.cache that might become impractical.

To me, a good compromise would be to add an argument valid_cache which provides a callable which is called if the result is in cache. If the callable returns False, the cached result is considered invalid and re-run. We might as well provide some callable to check the time to make it easier for such use case.

WDYT @GaelVaroquaux @lesteve ? In particular, I am thinking of the APHP, this feature would have simplified the caching code a lot.

Ok if you like it then, I did a small PR to try to implement my solution. Let me know what you think of it. In particular, I am not very fan of the name validate_cache but couldn’t come up with a better one. Feel free to propose.

I kept the expires_after name for the time based helper as I think it is a good one! Thanks for @fredludlow for the implementation, it was a really good start for the code base.

I have the same use case (TTL based cache expiration in web-application interacting with the database) as @fredludlow.

Here is my hack for expiring the cache every day:

class daily_memory:

    """
    daily memory, expires every day
    usage:

        @daily_memory.cache
        def myfunction()
    """
    @staticmethod
    def cache(func):
        return daily_memory(func)

    def __init__(self, func):
        self.func = func

    def __call__(self, *args, **kargs):
        return Memory(cachedir=os.path.join(__storage_dir__, str(pacific_today())), verbose=0).cache(self.func)(*args, **kargs)

+1 for the TTL use case (ignore current cached result if older than X). We use it to cache slowly-changing data instead of reading it from a DB on each initialization, but a TTL mechanism is a must for this use case (so we can re-read it from the DB, say once a day). The bytes_limit argument doesn’t really address this use case.

+1 to time to live. I have the same usecase as Fred and I bet a lot of other devs also. All major python caching libraries have ttl but none of them are as good as joblib in an overall. A much needed feature indeed.

• The cache will grow very large as it is not cleaned automatically.

If our cache replacement policy works, this shouldn’t be a problem (but I’m not convinced that our cache replacement policy works).

My main concern with the proposed implementation is that there can be many other reasons why one would want to invalidate the cache based on some external change to the function arguments. With the time, this would lead to an ever growing list of arguments that might become impractical.

Indeed. Is the present proposal absolutely central and more than others? It might be.

To me, a good compromise would be to add an argument valid_cache which provides a callable which is called if the result is in cache.

I like that!

@chengguangnan, wouldn’t this clutter the disk with cached results from all the previous days?

You mean adding an extra function for explicitly clearing a cache ? Something like :

joblib.clear_cache(mem, min_days=30, max_days=60, min_size=10, max_size=30)

I think this could certainly be useful, but the reason I didn’t go down this route is you still have to choose where a call to this function sits. It’s either in the request cycle (and it’s complexity is O(size_of_cache) so potentially a large overhead per-request) or in a separate scheduled maintenance process/thread/whatever (see previous comments about portability, as well as robustness - if your maintenance process/thread dies or stops, there’s no way for the request to know if the cache is in date).

Adding it as a kwarg to the cache decorator makes it O(1) (it only checks the single entry we care about for that call).

We try to keep joblib caching as much unsurprising as possible so my first reaction would be against this idea.

+1. It seems out of scope.

However, it might be good to make sure that it is possible to build such a cache removal mechanism outside of joblib, using joblib.