restic: Parent-snapshot detection fails with changing --files-from

Output of restic version

restic 0.9.4 compiled with go1.11.5 on linux/amd64

(debian testing, from the debian repository)

What should restic do differently? Which functionality do you think we should add?

In short: With an invocation like restic backup --files-from my-files-and-dirs.lst, restic could be more efficient about choosing a parent snapshot.

In long:

Whenever my-files-and-dirs.lst changes, no matter how slightly, restic apparently sees a different set of paths. Only slightly different, but still different. When searching for a parent snapshot during backup, this leads to unexpected behavior:

On the one hand, restic sees that the path set is different from anything seen ever before, and assumes a completely new backup. All ~data is re-uploaded~ files are scanned again, even though only changed data is uploaded:

repository 7b6b235d opened successfully, password is correct
Files:        1905 new,     0 changed,     0 unmodified
Dirs:            4 new,     0 changed,     0 unmodified
Added to the repo: 44.372 KiB
processed 1905 files, 346.776 MiB in 0:02
snapshot e50ef85d saved

On the other hand, I made only a small change in my-files-and-dirs.lst, so I expected that only the new files need to be uploaded.

I’m new to restic, so maybe I’m using it wrong. However, using tags does not seem to change automatic parent detection, and --parent latest does not seem to be supported. And I don’t want to specify --parent 12345678 manually all the time, and would like to avoid fiddling with restic snapshots on my own.

I’m not sure which feature to propose. There are multiple things that might help:

  1. Allow --parent latest to use the latest snapshot, no matter what. This would be helpful for people like me, who only have one guest per repository anyway, but might result in other scenarios.
  2. Instead of 1., allow --parent latest-sametags to use the latest snapshot of the same tag set. This would avoid potential problems, and still cover most use cases.
  3. Automatic parent detection could try to find a close match in the previous few snapshots, and if it finds one, use that. As far as I can see, a false positive cannot have a bad impact, can it?

What are you trying to do?

I’m making snapshots of parts of my home directory, and have restrictions on the target repository size. So I only want to include specific things, like ~/workspace/, ~/bin/, ~/.bashrc, and so on. But not other things, like the gigantic folder of virtual machines, as the restic host has not enough space for that. Obviously, this list is subject to change.

With the current behavior, a simple backup without --parent 5eab0a7 makes the backup run a bit longer than necessary. Deduplication does its job perfectly, and no excess data is stored.

With the proposed behavior, no such delay would happen, or only when detection fails.

Did restic help you or made you happy in any way?

I’m currently using rsnapshot, and it works great for its use case. However, with so many small files, and some large files constantly changing (thunderbird is a strong offender), my home directory falls outside rsnapshot’s use case. It seems restic is fast and space-efficient enough to cope with that much better. Hooray!


Just for the record: I no longer have this particular issue, since I personally can just avoid it. I don’t close the issue because I recognize that other people do run into trouble due to this issue.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 13
  • Comments: 41 (16 by maintainers)

Most upvoted comments

It is possible to workaround this with the --parent flag on the backup cmd however it does require the kind of fiddling OP wanted to avoid. For anyone who’s interested though the following bash uses the most recent snapshot as the parent:

restic backup --parent $(restic  --json snapshots | jq -r 'max_by(.time) | .short_id')

It requires installing jq to work on the json output.

Mirroring the --group-by semantic to --parent should be fairly simple to implement It’s probably just a matter of calling FindLatestSnapshot in cmd/restic/cmd_backup.go:findParentSnapshot the right way.

Overloading the --parent option might be a bit unexpected for users, but as long as the overloaded parameters cannot be mistaken for a snapshot id, it should be fine.

Which OS do you use having such a rare behaviour?

It’s mostly macOS clients, very few Windows clients. I wouldn’t call it rare considering how many Mac users there are out there both generally and using restic though…

In windows and normally in linux you have a constant hostname.

I haven’t seen the hostname change for Windows clients, so you’re probably right there.

If the hostname changes with a different network connection the host field is not a very helpful indicator for the backup origin host anymore.

It’s indeed of less value when it changes, yes. Which is part of my point that it’s not the best thing to fall back to for parent snapshot detection.

My repo for example contains backups of multiple hosts.

What’s your use case for that? That your hosts have a lot of similar data so you want to make use of deduplication?

I wouldn’t bet for it. Do you have any numbers?

I wouldn’t bet either, and no there are no hard numbers, I just feel that most home users back up their main computer and that’s it. Obviously there’s a lot that back up several or with multiple backup sets, but given the questions and discussions over years it doesn’t seem to me that this is more common than just one host per repository, rather the opposite. Even when you have multiple hosts to back up, I’d guess it’s more common that they back up to each their own repository, except when there are specific use cases to use one and the same repository. But it’s just what I think, noone in the entire world will ever know for sure.

Maybe it would be more helpful if a user can decide which heuristic to use. That could be anything: latest snapshot, host, tag, paths or host & tag, host & paths, tag & paths and so on.

Overall I think that making a guess about the parent snapshot when the regular way of deducing it isn’t working has a high chance of not being successful in a lot of cases. But taking a step back what you’re saying here is pretty much the same type of thing as the --group-by option to the forget command provides. Such option values could perhaps even be incorporated into the --parent option as it is, not needing a new option name.

Another option that could be of interest, although it’s hard to know how valuable it would be, would be some kind of similarity comparison between the current paths and previous snapshots. If we find a snapshot has a high similarity in its paths, textually speaking, to the current (to be backed up) paths, that might be something to go by.

Then again; Even if a system moves around a bit and changes hostname now and then, in most cases it would have one out of maybe two or three common hostnames (e.g. work, home and whatever). Also, if one is using --files-from then one might already be able to make sure the hostname is always the same using the --host option to backup. So it might not be too bad to use that as fallback (at least by default).

@MichaelEischer What do you think about making --parent accept the same type of values as --group-by (besides the current latest and <snapshot ID>, would it be overly complicated or messy?

Perhaps the multi-parent feature suggested by @aawsome earlier might be useful too.

Just for the record: I no longer have this particular issue, since I personally can just avoid it. I don’t close the issue because I recognize that other people do run into trouble due to this issue.

@BenWiederhake

Because the list is still not very short, and I use my own wrappers around restic: https://github.com/BenWiederhake/spread

Ha, I’ve written almost the exact same thing, for the same purpose of “Wrappers that make restic more cronable”. Mine of course is organized a little differently. And doesn’t yet exist as a repo on my profile. I have one main restic wrapper (that batches multiple optional commands and provides logging and other universal services), that in turn is sourced from any number of specific configuration wrapper scripts that provide the configuration (exactly backwards from “read this config file”, but often useful in bash - I think of it like an Iinterface in .NET), and then a cron wrapper that takes the configuration wrapper as an argument. The cron wrapper provides the appropriate user context and environment. (And implementing that cron wrapper, possibly, may have been what somehow caused restic to see all files as “new” (#3004).)

Thanks for the link to your project. The readme is a good read. It’s validating to see other users with the exact same challenges I have, and the varied ways in which we tackle them.

For example, I’m writing this (and already use the exact logic described, as individual commands for other purposes - e.g. local rsync --files-from backup).

The idea - and method of tackling impossibly large initial backups - is to eventually use a more robust, binary version of x9incexc (being written - or at least started - in Go) to generate a list of files that might be various combinations of:

  • Of a specified total size (e.g. 1GB); and then the next 1GB would necessarily also includes any new files that would have been part of the first 1GB had they existed then, etc.
  • Only files older and/or newer than specified mtimes.
  • Only files larger and/or smaller than specified filesize.
  • Ordered by newest, or smallest, or a weighted ranking of newest percentile and smallest percentile.
    • The reasoning behind backing up smallest first, is the (debatable) premise that the size of a file, if no other information is considered or available, has no bearing on importance. IOW, all else being equal, a 1 KB user data file is exactly as important as a 1 GB file. Therefore if you back up smallest first, you are literally providing orders of magnitude more value, if measured by “units of importance backed up first”. If that’s important…and is to me.
  • Filtered by multiple include/exclude regexes (optionally using “macros” to stand in for common complex regex expressions - which can make definitions more readable and reliable).

Much of which you’ve solved in a different way. The point is, it’s interesting how differently people tackle the same problems.

I would still argue that a rarely-changing --files-from is an “inappropriate” use of such a feature for your use case, or at least that exact flag would be for, say, rsync and countless other utilities. But considering there isn’t actually anything like a --patterns-from flag in restic, then my “position” is more or less defanged and irrelevant.

But the way I solve that same problem (of defining which directories to back up with a concept of “backup sets” or “profiles”), is in my config wrapper scripts. All these wrapper script do is provide specific cloud storage details, and a rarely-changing list of directories to scan. I simply load each directory to scan into an array, one directory per “add to array” line. Then the sourced restic script itself unpacks that array and turns them into positional command-line arguments. So, essentially the same thing you are doing but with a very different approach. However, a big difference is that in my script, they have to be explicit paths, obviously no wildcards. (Which I’ve argued at length is the “proper” way most users - e.g. familiar with rsync - expect a --files-from feature to behave.) Then my --exclude-from file contains only universally excluded files and folders, with no absolute paths.

I should also point out that using --files-from in the way I’m intending - as is fairly industry standard (with an exact list of literal filename paths as generated by a different utility), may not actually work with restic, because as I’m learning:

  • Since restic treats each line as a globbing definition, that may indicate that it won’t honor the order of files as present in --files-from, and instead just treat it as any other filter specification on it’s own file list. (I’ll test that concern and cross that bridge when I get there.)
  • This very issue I’m commenting - not having a parent snapshot to compare the current backup to - might prevent me from using --files-from, which - as intended by design for the same flag in most other programs (e.g. rsync, rclone) - is different every time, or at least potentially so and is part of the point of the feature.
  • I forgot what the third problem is but I seem to remember it being something of a show-stopper for it’s use.

Edits: do more best grammar, and added last two paragraphs.