restic: Parent-snapshot detection fails with changing --files-from
Output of restic version
restic 0.9.4 compiled with go1.11.5 on linux/amd64
(debian testing, from the debian repository)
What should restic do differently? Which functionality do you think we should add?
In short:  With an invocation like restic backup --files-from my-files-and-dirs.lst, restic could be more efficient about choosing a parent snapshot.
In long:
Whenever my-files-and-dirs.lst changes, no matter how slightly, restic apparently sees a different set of paths.  Only slightly different, but still different.  When searching for a parent snapshot during backup, this leads to unexpected behavior:
On the one hand, restic sees that the path set is different from anything seen ever before, and assumes a completely new backup. All ~data is re-uploaded~ files are scanned again, even though only changed data is uploaded:
repository 7b6b235d opened successfully, password is correct
Files:        1905 new,     0 changed,     0 unmodified
Dirs:            4 new,     0 changed,     0 unmodified
Added to the repo: 44.372 KiB
processed 1905 files, 346.776 MiB in 0:02
snapshot e50ef85d saved
On the other hand, I made only a small change in my-files-and-dirs.lst, so I expected that only the new files need to be uploaded.
I’m new to restic, so maybe I’m using it wrong.  However, using tags does not seem to change automatic parent detection, and --parent latest does not seem to be supported.  And I don’t want to specify --parent 12345678 manually all the time, and would like to avoid fiddling with restic snapshots on my own.
I’m not sure which feature to propose. There are multiple things that might help:
- Allow --parent latestto use the latest snapshot, no matter what. This would be helpful for people like me, who only have one guest per repository anyway, but might result in other scenarios.
- Instead of 1., allow --parent latest-sametagsto use the latest snapshot of the same tag set. This would avoid potential problems, and still cover most use cases.
- Automatic parent detection could try to find a close match in the previous few snapshots, and if it finds one, use that. As far as I can see, a false positive cannot have a bad impact, can it?
What are you trying to do?
I’m making snapshots of parts of my home directory, and have restrictions on the target repository size.  So I only want to include specific things, like ~/workspace/, ~/bin/, ~/.bashrc, and so on.  But not other things, like the gigantic folder of virtual machines, as the restic host has not enough space for that.  Obviously, this list is subject to change.
With the current behavior, a simple backup without --parent 5eab0a7 makes the backup run a bit longer than necessary.  Deduplication does its job perfectly, and no excess data is stored.
With the proposed behavior, no such delay would happen, or only when detection fails.
Did restic help you or made you happy in any way?
I’m currently using rsnapshot, and it works great for its use case. However, with so many small files, and some large files constantly changing (thunderbird is a strong offender), my home directory falls outside rsnapshot’s use case. It seems restic is fast and space-efficient enough to cope with that much better. Hooray!
Just for the record: I no longer have this particular issue, since I personally can just avoid it. I don’t close the issue because I recognize that other people do run into trouble due to this issue.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 13
- Comments: 41 (16 by maintainers)
It is possible to workaround this with the
--parentflag on the backup cmd however it does require the kind of fiddling OP wanted to avoid. For anyone who’s interested though the following bash uses the most recent snapshot as the parent:It requires installing
jqto work on the json output.Mirroring the
--group-bysemantic to--parentshould be fairly simple to implement It’s probably just a matter of callingFindLatestSnapshotincmd/restic/cmd_backup.go:findParentSnapshotthe right way.Overloading the
--parentoption might be a bit unexpected for users, but as long as the overloaded parameters cannot be mistaken for a snapshot id, it should be fine.It’s mostly macOS clients, very few Windows clients. I wouldn’t call it rare considering how many Mac users there are out there both generally and using restic though…
I haven’t seen the hostname change for Windows clients, so you’re probably right there.
It’s indeed of less value when it changes, yes. Which is part of my point that it’s not the best thing to fall back to for parent snapshot detection.
What’s your use case for that? That your hosts have a lot of similar data so you want to make use of deduplication?
I wouldn’t bet either, and no there are no hard numbers, I just feel that most home users back up their main computer and that’s it. Obviously there’s a lot that back up several or with multiple backup sets, but given the questions and discussions over years it doesn’t seem to me that this is more common than just one host per repository, rather the opposite. Even when you have multiple hosts to back up, I’d guess it’s more common that they back up to each their own repository, except when there are specific use cases to use one and the same repository. But it’s just what I think, noone in the entire world will ever know for sure.
Overall I think that making a guess about the parent snapshot when the regular way of deducing it isn’t working has a high chance of not being successful in a lot of cases. But taking a step back what you’re saying here is pretty much the same type of thing as the
--group-byoption to theforgetcommand provides. Such option values could perhaps even be incorporated into the--parentoption as it is, not needing a new option name.Another option that could be of interest, although it’s hard to know how valuable it would be, would be some kind of similarity comparison between the current paths and previous snapshots. If we find a snapshot has a high similarity in its paths, textually speaking, to the current (to be backed up) paths, that might be something to go by.
Then again; Even if a system moves around a bit and changes hostname now and then, in most cases it would have one out of maybe two or three common hostnames (e.g. work, home and whatever). Also, if one is using
--files-fromthen one might already be able to make sure the hostname is always the same using the--hostoption tobackup. So it might not be too bad to use that as fallback (at least by default).@MichaelEischer What do you think about making
--parentaccept the same type of values as--group-by(besides the currentlatestand<snapshot ID>, would it be overly complicated or messy?Perhaps the multi-parent feature suggested by @aawsome earlier might be useful too.
Just for the record: I no longer have this particular issue, since I personally can just avoid it. I don’t close the issue because I recognize that other people do run into trouble due to this issue.
@BenWiederhake
Ha, I’ve written almost the exact same thing, for the same purpose of “Wrappers that make restic more cronable”. Mine of course is organized a little differently. And doesn’t yet exist as a repo on my profile. I have one main restic wrapper (that batches multiple optional commands and provides logging and other universal services), that in turn is
sourced from any number of specific configuration wrapper scripts that provide the configuration (exactly backwards from “read this config file”, but often useful in bash - I think of it like an Iinterface in .NET), and then a cron wrapper that takes the configuration wrapper as an argument. The cron wrapper provides the appropriate user context and environment. (And implementing that cron wrapper, possibly, may have been what somehow caused restic to see all files as “new” (#3004).)Thanks for the link to your project. The readme is a good read. It’s validating to see other users with the exact same challenges I have, and the varied ways in which we tackle them.
For example, I’m writing this (and already use the exact logic described, as individual commands for other purposes - e.g. local
rsync --files-frombackup).The idea - and method of tackling impossibly large initial backups - is to eventually use a more robust, binary version of
x9incexc(being written - or at least started - in Go) to generate a list of files that might be various combinations of:mtimes.Much of which you’ve solved in a different way. The point is, it’s interesting how differently people tackle the same problems.
I would still argue that a rarely-changing
--files-fromis an “inappropriate” use of such a feature for your use case, or at least that exact flag would be for, say,rsyncand countless other utilities. But considering there isn’t actually anything like a--patterns-fromflag in restic, then my “position” is more or less defanged and irrelevant.But the way I solve that same problem (of defining which directories to back up with a concept of “backup sets” or “profiles”), is in my config wrapper scripts. All these wrapper script do is provide specific cloud storage details, and a rarely-changing list of directories to scan. I simply load each directory to scan into an array, one directory per “add to array” line. Then the
sourced restic script itself unpacks that array and turns them into positional command-line arguments. So, essentially the same thing you are doing but with a very different approach. However, a big difference is that in my script, they have to be explicit paths, obviously no wildcards. (Which I’ve argued at length is the “proper” way most users - e.g. familiar with rsync - expect a--files-fromfeature to behave.) Then my--exclude-fromfile contains only universally excluded files and folders, with no absolute paths.I should also point out that using
--files-fromin the way I’m intending - as is fairly industry standard (with an exact list of literal filename paths as generated by a different utility), may not actually work with restic, because as I’m learning:--files-from, and instead just treat it as any other filter specification on it’s own file list. (I’ll test that concern and cross that bridge when I get there.)--files-from, which - as intended by design for the same flag in most other programs (e.g. rsync, rclone) - is different every time, or at least potentially so and is part of the point of the feature.Edits: do more best grammar, and added last two paragraphs.