restic: Hash does not match id during prune
Output of restic version
I typically use:
restic 0.8.3
compiled with go1.10 on linux/amd64
but I also tried with:
restic 0.9.2 compiled with go1.10.3 on linux/amd64
How did you run restic exactly?
restic -r <repo_path> forget --keep-last 5 --prune
What backend/server/service did you use to store the repository?
This may probably be very relevant: the source of the data is on a USB hard disk, which is mounted always in the same mount point. The repository is instead a SMB share mounted on a specific constant path via cifs. SMB share is stored on a ext4 filesystem. I’ve seen this failure reported other times, but I created a new issue anyway to explain this particular situation.
Expected behavior
Prune should succeed.
Actual behavior
Backing up succeeds. Then, I run the above command, which should remove exactly one snapshot. Prune fails returning exit code 1. Running prune once again instead seems to work and report 5 available snapshots. The error reported by the first prune command is:
enter password for repository:
repository 2b913e7d opened successfully, password is correct
Applying Policy: keep the last 5 snapshots snapshots
snapshots for ([...]):
keep 5 snapshots:
ID Date Host Tags Directory
----------------------------------------------------------------------
[...]
----------------------------------------------------------------------
5 snapshots
remove 1 snapshots:
ID Date Host Tags Directory
----------------------------------------------------------------------
[...]
----------------------------------------------------------------------
1 snapshots
1 snapshots have been removed, running prune
counting files in repo
building new index for repo
[8:32] 100.00% 30451 / 30451 packs
incomplete pack file (will be removed): 0419682275c07ffa260e5369a3a1d14c750a8e58ef4e1a88d69339717233874b
incomplete pack file (will be removed): 052ae43adbf9501b7acee6ff25ace7a5f57ebc254428c5c4d5e350390f0161e4
incomplete pack file (will be removed): 06ffe477f19fc9f0857897ce98170951400694d155f4efcc0d93cca29dcf0153
incomplete pack file (will be removed): 08af490a1f8debbca69808314b24bc61f45be5e12c084ecd304518b8875a2bde
incomplete pack file (will be removed): 098c7c1c13f5070c7265aeb16aa2f2e3c10fd53147114e2d5dc5434ea29dd9ba
incomplete pack file (will be removed): 0d0db3810b0be0fbc3582521c34b4cf5729418c23853e47f08021717a30967a7
incomplete pack file (will be removed): 11c7e8b95ed3eff46b2819bc3608a0bbe2afa658df1e5a5a94745f76cfba0304
incomplete pack file (will be removed): 11ef82a4fe5f62fc37ad432fb9ffc1611fa1fbc060c28b06160b8f7b0e05d095
incomplete pack file (will be removed): 1281ca112f53a1a1944b10cf258c3fabd36209b394f837a6afea5922c4c41697
incomplete pack file (will be removed): 1c5afa76417bcbc932499b000ab4fbc5c1fabf604d006883c454e6fdad9b3836
incomplete pack file (will be removed): 1c6606767eb76b658e554d8189a468a1db418df1dc448905428b2b087703932c
incomplete pack file (will be removed): 28a3b8b4c8d466fa4998d34412bee15ba6e05e4623a0fee0f39823411a1eca12
incomplete pack file (will be removed): 2bae1122447e8de3001327596b3af9983414307856d71da12db1ede6b871166f
incomplete pack file (will be removed): 3182fea28f3095ed9db5c3e41ff5cd34856c881f0f797004ff7ce3c66705525a
incomplete pack file (will be removed): 31f069689a117e800522e5568ff3df49883a85d9d1a29da3fb66c758a7caf10d
incomplete pack file (will be removed): 3279f70d1f19b400df47acc858c9967ba1845fc84f85e6a0e880ea19dea4c3d4
incomplete pack file (will be removed): 362358d67c70d287b35f149245f7d17faa274c86286779397b6c0a50d958d3e2
incomplete pack file (will be removed): 3aa916f478c9b53c59d2e9a4b23f81acf3dc2612c43612f14e2c6c7edee6f5d5
incomplete pack file (will be removed): 3ac13a9630f020b88a00d43211141126a0f83aa11bcd9cfa74acd2c30c5eb7d0
incomplete pack file (will be removed): 3f67a59f7fb7f86ebb3baea412fef3690dc3422d743430f411c2d89275d7436c
incomplete pack file (will be removed): 3fc8f32d65c08e9d94954bd25faa886eeffb782a53ac66f4e606509539fd5786
incomplete pack file (will be removed): 4db38e486f34070801bf13e71d468c8927c66bff4fb55f53fa1c48ce605b6cf9
incomplete pack file (will be removed): 4f1380d6ce36a11eb8158ebfa331f9fbb657fc57d3e683b738cf31bc55ef87da
incomplete pack file (will be removed): 4ff210a088d4a93ef8dadfbd5548ad243988b084cfbe4f320d1eb5f7826890cb
incomplete pack file (will be removed): 5853d3e8d5bebb3ffd4157ca6c9bdd62302486d14b09df2f60027422484fe483
incomplete pack file (will be removed): 64cbd81bcb3c0a3839400bf51c256f5ef48cf55ebdc947c9542a2157e6dbcb45
incomplete pack file (will be removed): 67080490cf05f4f4c2b2a79a329f7ae12d42803199a1a218a06329d96eeb8f48
incomplete pack file (will be removed): 701b3d47ed15afd38b4a54f4dfef8dac39adfd7fa5c220b4a27fc08f663ffba6
incomplete pack file (will be removed): 744f4a5f609cc8e0d32d6db97d41e442f64a3b4f375385f07dd9d28ebcb2733a
incomplete pack file (will be removed): 78fc31910055722c4f30d4dc98a8a102b62717169b4017463af808a64e5a271b
incomplete pack file (will be removed): 7ce924782de235bd51417e2f4b7cda9ba769d4c059eea12a016482ed8a18a7ee
incomplete pack file (will be removed): 7fefd7b9bdf8b271087996b781a1555aab0b7f87e4776e15657c2cff8c8abf20
incomplete pack file (will be removed): 80f7cae4759579cc2d24404bdc4d1ed1734c13957b825b228080d436b1f6d799
incomplete pack file (will be removed): 84d6087c5930c270f52f1f328aae66b9a1f5c37d8e950dbf6552957fb7ecce5b
incomplete pack file (will be removed): 869c67459a7d996706896a1d1a4c2ae31be1a40a4daa9d422f1e0befa3d1dcea
incomplete pack file (will be removed): 8aa1cb77f0e799835ff635b1fbb15c67c6ced46da705df7e13d89577d4f5f441
incomplete pack file (will be removed): 97b80c967d5649794deb32cc6915c1b14e537e0d1b8004a198860bf7fd5dd941
incomplete pack file (will be removed): 9a83cd05dfdcd5cfdeb32c52aa629b53e0f66d1cacbb9c9a36c89c75a7b0367a
incomplete pack file (will be removed): 9b34ae7836789b440715ee337721d56a0e4f874319b4a9f9d7bef76553cdc4b5
incomplete pack file (will be removed): 9f2a40348559b9d8e9b8323aeb39752e92beb3df06b243c7ab7ef9cdc1d27c69
incomplete pack file (will be removed): 9f715e1df6d580914abcd1fb565bca83ec862128c71f1a0b5a0ca8e1497a86df
incomplete pack file (will be removed): ac312cb7d3163de46f4a1a17ee18d3584426f08d2cc77b763baa08dcba27eea6
incomplete pack file (will be removed): ada80341e6fac08e2b4cba6c5d3ccdf4b6d55337598ae327f67e3a08a1dc167d
incomplete pack file (will be removed): b473443d7609fc9680d298ce9b498e0bf8155d96cc36b0a3d1cdfda007629856
incomplete pack file (will be removed): b4939cbed564a8e32c25d94eb83d3b95cbcfd1d7b3ff70efe991f6fdf4c9e33b
incomplete pack file (will be removed): b604e614ed8a4cd19aba44feb1f9608151a81c02eb1beef8484a2f56db3d2b6b
incomplete pack file (will be removed): b64921a9827653f311c38f77622188fe8aa59b28f3083fd12b1884a62ff235a7
incomplete pack file (will be removed): ba6c08b1bb7c9f717ce6b4dcd27d45e22fd6b8dd7cd9ba324af1ee051e72f2bd
incomplete pack file (will be removed): be1e7a1489b837859b23f6b28ff4de93f894d4cb01651c633d05487836e5d8eb
incomplete pack file (will be removed): be7d36159084ba965e2773ea331496a453df42d4cf5db940d925e13ea8ac5809
incomplete pack file (will be removed): c1290a18a7cb37b071749629d3780153c5e750e461d8f9c823dcf975b9817a45
incomplete pack file (will be removed): c65e2b45b9d5e2e7bf4413f67ac9b4ba2fab16ce799656ccd025ebc013dcae77
incomplete pack file (will be removed): c6d6eb2b00e8fad066abaa9f3495ad772e3beb74b5152775d524bf49a861fc58
incomplete pack file (will be removed): c73c3cb9a2794ef6a8960c530d515c1a250f11b4bc5d90a8230a24ce67a5e551
incomplete pack file (will be removed): c988a36ddd335b5b8e4b76ba63235632f0015ce8c8f70fcc85f8db1e7f1b185d
incomplete pack file (will be removed): daaed644c6cfb078f6211df1c5ef121fd62eb1b0f81a6d542b0388d5bfa928cf
incomplete pack file (will be removed): dd3d09a8062d80aa2c97759005f9729a88391af54f214c0895caadb34aeb571e
incomplete pack file (will be removed): e5952a97b5da1234beb15bcec2e8b31231834328b44ab2ee7cbcfa90c98f90d4
incomplete pack file (will be removed): e8c6d9c2794945503574853b5d36291d92503d6a259b51823b38f714fc178a4e
incomplete pack file (will be removed): e8da886b00b552c6bb95e9656b69366c568f778136f3d4ce49cbc0978ca13156
incomplete pack file (will be removed): e96b280db2ded87eeddfde558a72d1756e01dc430e1706a1a540586b71606623
incomplete pack file (will be removed): f00ceebe50cd5062abb52b4b73f19f95cbeed29e674b04ec3dce24251888635e
incomplete pack file (will be removed): fa1649e4c66455325e9e320bf06fbfd2e038a70d49346ab28c87b21cf36bc14e
incomplete pack file (will be removed): ff96a7f6ed64b19412c5bd69503eec2ad76cd2a82bb1d16dc7ec1ab1ffee45c3
repository contains 30387 packs (117851 blobs) with 147.154 GiB
processed 117851 blobs: 24540 duplicate blobs, 29.393 GiB duplicate
load all snapshots
find data that is still in use for 5 snapshots
[0:00] 100.00% 5 / 5 snapshots
found 93121 of 117851 data blobs still in use, removing 24730 blobs
will remove 64 invalid files
will delete 67 packs and rewrite 8017 packs, this frees 29.468 GiB
hash does not match id: want ad4b41a807fef9716a60f92f63334d3b28fd5a8a744a43069455fcd42f1db352, got 2ac3712c12e1b363c94fd6637fa8c09e8cffbc6d0bfc13699d83b8121e45c085
github.com/restic/restic/internal/repository.Repack
src/github.com/restic/restic/internal/repository/repack.go:35
main.pruneRepository
src/github.com/restic/restic/cmd/restic/cmd_prune.go:278
main.runForget
src/github.com/restic/restic/cmd/restic/cmd_forget.go:240
main.glob..func10
src/github.com/restic/restic/cmd/restic/cmd_forget.go:24
github.com/restic/restic/vendor/github.com/spf13/cobra.(*Command).execute
src/github.com/restic/restic/vendor/github.com/spf13/cobra/command.go:762
github.com/restic/restic/vendor/github.com/spf13/cobra.(*Command).ExecuteC
src/github.com/restic/restic/vendor/github.com/spf13/cobra/command.go:852
github.com/restic/restic/vendor/github.com/spf13/cobra.(*Command).Execute
src/github.com/restic/restic/vendor/github.com/spf13/cobra/command.go:800
main.main
src/github.com/restic/restic/cmd/restic/main.go:86
runtime.main
/usr/local/go/src/runtime/proc.go:198
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:2361
Steps to reproduce the behavior
Backup once. Prune with the above command. Prune fails. Second prune command succeeds. This seems to be always reproducible.
Do you have any idea what may have caused this?
Unfortunately no idea.
Do you have an idea how to solve the issue?
I tried to rebuild the index. It succeeds, but the the same behaviour can be reproduced: backup, prune, failure. I also removed restic cache from the home directory, nothing changed. I created this repo for development purposes, I’m not much interested in the data it contains. I can freely experiment on it to fix this issue if someone is interested. I cannot provide the repo itself as it contains a private data set.
Did restic help you or made you happy in any way?
Of course. Seems to be a good backup solution, but still needs a few critical features.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 93 (56 by maintainers)
Commits related to this issue
- Add 'debug examine' command to debug #1999 — committed to restic/restic by fd0 6 years ago
- Add 'debug examine' command to debug #1999 — committed to restic/restic by fd0 6 years ago
- Add 'debug examine' command to debug #1999 — committed to restic/restic by fd0 6 years ago
- Add 'debug examine' command to debug #1999 — committed to restic/restic by fd0 6 years ago
- Add 'debug examine' command to debug #1999 — committed to restic/restic by fd0 6 years ago
- Add 'debug examine' command to debug #1999 — committed to restic/restic by fd0 6 years ago
- Add 'debug examine' command to debug #1999 — committed to mfrischknecht/restic by fd0 6 years ago
Ah, yeah, it depends a lot on how large the broken blob is (and on your CPU, obviously). You don’t need to run it, it was just a quick test.
So, here comes my analysis, the prerequisites are:
From the test code’s output we know:
In summary, for both repos, a part of a file was modified. The following observations assume that the modification has happened accidentally (i.e. not on purpose by an attacker).
The modification may have happened within restic or outside of it. What I’ve seen so far makes makes me think it’s very unlikely that the modification of the data has happened within restic. The reasoning is as follows:
I’m at a loss what happened here, and I’m sorry, but what I see from the result of the analysis makes me think it’s a fault outside of restic, i.e. harddisk, memory, or a network error. If it were a bug in restic (which is entirely possible!), then I would have expected it to appear much more often for the huge sets of data restic processes for various users (~1TB of data, with at most 600KB modified for @askielboe)…
For @carlonluca (repo mounted via CIFS from a server, stored on ext4 there), the modification may have been caused by the following issues:
For @askielboe (repo accessed via sftp, stored on a NAS):
We can rule out a network error here because sftp is used through SSH, which authenticates data.
I’m not sure what to do next here. Ideally, I’d love to get the results of
memtest86for all machines involved… but that’ll be time-consuming and (in case of @askielboe’s NAS) maybe not even possible…We can also try to examine the damage a bit more closely, by decrypting the broken blob while disregarding the signature. If it’s something from a file format that is readable without the context (or we can even find the source file on your machines), we would be able to see how many bytes have been modified and in which way exactly. I’m not sure if that helps.
So, what do you think? Is there anything I missed, or any conclusions I got wrong?
Oh yes, absolutely! restic needs to handle such cases a lot better. When the hash does not match for a file, this is just an indication that something is wrong, it should be clearly printed as such. Restic can then try to load and decrypt the individual blobs. For the blobs which could be successfully verified, restic can just use them. The others (which do not verify) should be discarded, that’s what we have the signature for. After all, they could have been modified by an attacker, and should be treated as such. This allows at least to salvage the unmodified parts of the file.
Then,
restic pruneshould proceed, not including any broken blobs in the repository. This makesrestic checkcomplain that data is missing, but most of the time the original data is still there (like in your case) and the repository will just repair itself. Or users discard and forget the snapshots which reference the missing data.We should definitely do 1) and 2), 3) is hard to do because I don’t have any idea what’s going on. It’d be great to have some feedback should you ever find out what’s going on with your hardware!
Hm, odd: The differences start at offset 0x646fa, the blob is unmodified again starting at offset 0x6c6fa, which means 32768 bytes are modified. That’s exactly 32KiB. The start offset is not a multiple of 32KiB though. XORing the bytes does not show any obvious pattern.
When we consider the position of the blob in the file in the repo, the faulty bytes start at offset 2701558+0x646fa = 3112944, which is also not a multiple of 32KiB. Oddly enough, that’s a multiple of 16, so the start offset in the file is 0x2f7ff0.
I’m thinking aloud here: For a memory (RAM) error that’s a rather surprising result, it could hint at storage problems. Maybe the file system on the NAS, or the underlying storage medium, uses 32KiB blocks? Even then, I would have expected the start offset to be a multiple of 32KiB…
Another thought: The usual transfer chunk size for data over sftp is also 32KiB…
It’d be great to have the same data from @carlonluca so we can compare it.
@askielboe can you check what the block size for the FS on the NAS is (for ext4 it’s either 1024 or 4096 for my systems)? You can either use
tune2fs -l /dev/sda1(or whatever the partition for the data is) or usestat -fc %s /path/to/mountpointI wanted to follow up on this thread to point others to a potential cause that affected me - a kernel bug affecting some builds before versions 5.3.15, 5.4.2, and 5.5 that can cause CPU register corruption. More details are available in my thread on the forum.
Please see https://github.com/restic/restic/issues/828#issuecomment-706186047 for updated instructions.
@n8henrie The debug-1999 branch been merged into master. It has gained the functionality to extract all intact parts of the damaged pack file, in the meantime. With that it’s actually possible to recover (well, actually replace) a damaged pack file. As I haven’t posted usage instructions here, here’s a complete walk-through:
masterbranch usinggo build -tags debug ./cmd/resticrestic debug examine --extract-pack <id-of-broken-pack-file>to extract all blobs in the damaged pack file into the current folder (-> it’s best to use a temporary folder to reduce the chaos). The packfile must still be located in the proper place inside the repository, but it is not necessary for it to be contained in the repository index (i.e., there’s no need to runrebuild-index, please do not run that command right now) The option--try-repaircan be used to try to repair a single bit flip, and--repair-bytetries to repair a full byte at once (but is also much slower). Theexaminecommand won’t modify the repository, so it’s safe to run even though it’s experimental code.restic rebuild-index(which should work fairly when using the current master branch).restic backup .in the folder which contains the extracted blobs. This let’s restic recover the blobs which were previously contained in the damaged pack file.restic check --read-datashould no longer report errors or only a few ones.run restic backup <original-file>, which will recover the missing blobs. (With restic before 0.10.0 you have to add the--forceflag).Apologies for the late reply, I’ve been away from the Internet for a while. The fixes for #2066 and #2068 have fixed the problems I’ve been seeing. I’ve pulled both those fixes into a local branch which also has the debug-1999 branch pulled into it.
Regarding the earlier comment, we don’t have a “malware protection gateway” sort of device, but we were indeed having some serious network issues, though we didn’t know that when I posted into this issue ticket. (It was a switching loop, where the “loop” was caused by several wifi APs.) That may have contributed to the problems we saw.
Here is my now working output:
Thanks a bunch for the help!
The discussion about how to handle these cases flows into a forum thread I made a while ago with an idea about restic offering a suggestion about what to do next, in case of an error: https://forum.restic.net/t/deterministic-output-of-restic-check-to-know-what-to-do-next/865?u=matt
The example there is for check, but I could see it happening for errors on any other command, too. I would be interested in implementing this if someone wanted to give me a pointer as to the best, centralized place to handle errors in this way.
To me these tests, along with your reasoning @fd0, seem to suggest that this is most likely a hardware issue, or at least an issue outside of restic.
What are your thoughts on handling these cases in restic? Are there any meaningful paths that we could take to improve the user experience when repo data gets corrupted?
A few thoughts/ideas:
Documentation: We could extend the error message somewhat to inform the user that they should check the consistency of their repository. Judging from the number of issues on this, there is a bit confusion about what a wrong object hash means. I’m not entirely sure if we’ve covered all cases of this, and can say anything definitive about it yet, but it seems that at least some of it could be hardware related - and warrant some action by the user to protect their backups.
Active recovery: Given that I managed to repair my repository by removing the broken pack, and let restic take care of the rest, would it be a viable option to let restic try to repair the repo by re-uploading bad blobs to the repo if the data is available? At least it would feel safer to let restic manipulate the repository, rather than having to modify the objects manually.
More tests: Run more tests to figure out what’s going on before we take action. If we had some way of reliable reproduction that would be helpful, but since it’s likely a hardware failure I guess that’s difficult.
In any case it would still be nice to have a way to recover from corrupted blobs using restic. I probably don’t currently know enough about the internals of restic to decide what the best approach is, but I’d like to help out if possible.
Ah, I’ve got something we can try: run the following on all Linux-based machines involved, then try again:
This will clear all file system caches from memory, so the data needs to be re-read from disk. Then run
sha256sumon the broken file in question, best would be on the machine which stores the file, or even better: run the test code again (without any--try-*options).@carlonluca @askielboe I’ve pushed a new branch
debug-1999with extra code to find out what’s going on here.Please pull the repo, then run
git checkout debug-1999and build restic withgo run build.go -tags debug. Then you should have therestic debug examinecommand, it can be used like this (I’ve manually inserted an error into the file):Please paste the output here! 😃
Thanks for the report, I’m sorry for the late reply.
Incomplete pack files are usually the result of restic being aborted (e.g. by pressing
^C) during a run ofrestic backup. It will then abort and not finalize the file, so it is too short to be valid, which is then detected early on.This case (in contrast to the others) is a bit odd because CIFS is involved, which may or may not be the cause.
I have several ideas how we can try to drill down here. The core issue is that restic wrote data with the SHA256 hash of
ad4b41a807[...], but when reading the data back it got data with the hash2ac3712c12e[...]:So, we need to find out what’s going on and where a (potential) bug in restic may be located.
I’d like you to do the following
sha256sumon the file, report back the hash that is returnedls -alof the fileThanks!