Publish-Docker-Github-Action: Cache Expiry Proposal
Hi,
I noticed as per #14 that your cache option has no expiry. I was trying to figure out a way to add automatic cache expiry without needing the cron solution, and I found something that may do the trick.
The docker history command can give you a log of all the docker actions that were taken to create an image. Particularly with the -H=false option, it is possible to get timestamps for when each image/layer was created.
Unfortunately, it lists all the layers from any image your image is based on via the FROM command as well (as makes sense). This means I can’t figure out a way to tell what layers we are able to rebuild or not, and hence are at risk of always triggering a full build anyway if we are based on an old image.
I don’t know if you will find this information helpful, but just in case you hadn’t seen this before I thought I would drop a mention here. Hopefully you can figure out a way to use this for elegant automatic cache expiry.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 26 (26 by maintainers)
Unfortunately I don’t think this quite works. This is the bit I wasn’t about to figure out. For an example, here is the output of
docker history -H=falsefor an image I maintain:Which gives a
HISTORY_TIMESTAMPof 20191026. This all seems to be working fine so far, but the issue arises when we consider the case where I have run a previous build that did use the cache.HISTORY_TIMESTAMPis based off the topmost (and hence most recent) history entry. If my cached build updated only the last couple of layers, we would have no way of knowing based offHISTORY_TIMESTAMP, and wouldn’t be able to notice that the lower layers were very stale.At first it seems the simple solution to this is to use the last history entry instead, but as in my above example, the history logs include layers from the base image used in a dockerfile FROM command. In this case, my image is based off debian:stretch. You can see that debian:stretch was updated on 2019-10-16, a full ten days before I did my build. Doing a full build will not, of course, refresh these layers, as these are built and maintained by someone else. Therefore we are at risk of always invalidating the cache, since the base image is too old.
docker history -H=false debian:stretchgives the following:As you can see, these are the oldest two layers from my image above.
I am not sure how to solve this, as we effectively need to be able to tell the age of the oldest layer that we are able to rebuild. The
docker historycommand itself provides no hint as to which layers were built by us and which were inherited.The only two solutions I have thought of are as follows, both of which have downsides:
Firstly, we could parse the Dockerfile to find if our image is based off any external image, and if so, pull it, get its history, and exclude these history lines from anywhere they appear in our history. This way we (in theory) see only the history lines for the layers we are able to build. We can then take the oldest of these to use for cache invalidation.
Secondly, we could do something similar to https://github.com/einaregilsson/build-number and apply a tag to the commit that the most recent full-build was made from. However, as per the GitHub API documentation, you don’t get data on when a reference was edited or created, so we would have to tag the commit with something like
full-build-2019-10-26, and then parse that date to find when the last full build was performed.I am not super happy with either of these solutions. The first one is probably my favourite, but seems like it would be very fragile, and relies on assumptions about how people use FROM statements. I will continue to think about it, but I still don’t think either of these is quite ready for production use.
Hope this helps. Sorry for the wall of text.