restic: Fatal: number of used blobs is larger than number of available blobs!
Restic Version
restic 0.9.6 compiled with go1.13.4 on linux/amd64
Command(s)
restic prune
restic check --read-data-subset=05/53
(the commands are automated as part of a weekly backup maintenance script)
What did you expect?
An error free prune followed by an error free check.
What happened instead?
Both the prune and the check failed.
Output of prune:
counting files in repo
building new index for repo
incomplete pack file (will be removed): 056c60842c58dd17d452aacbd8a3a2a28b7729a55df17862c0a9715205178ded
[5:30:10] 100.00% 148048 / 148048 packs
incomplete pack file (will be removed): 173d2b9e645bd13eb794149b3f6f7b938940272f5a3b10cc0d48b13b61f0f73d
incomplete pack file (will be removed): 25be7e6fc725d5c5354e56133eda461d4fc2189ef90a458db837f1580eedffe6
incomplete pack file (will be removed): 429b728243cd9bf1777fe87eba0717ec3d7e04708692972db539ec168bdf35a2
incomplete pack file (will be removed): 54dcb5d83bd0ddeecc49a73f1e48e817b4cc0d5999a0663103455a0335a6116f
incomplete pack file (will be removed): 581b0af89c5ea7ba0bf47d8cfe36463001c08c46a534263c44c779185eb8ad80
incomplete pack file (will be removed): 5908682c326af00157838c8335e15414c7a7e79b0450148edbfa9d32b6eb2989
incomplete pack file (will be removed): 745354f809468d971a5488567f73461348fdc92468e8cc6140c9a50b53f8a265
incomplete pack file (will be removed): 79f599e46864df4a3c94a7b515939998045d7c4984d12f5733a91324e22774ed
incomplete pack file (will be removed): 7ee35edaa9c22be3855a54f67fff32ab88e1b90c0f65a3c7c27852fd55236292
incomplete pack file (will be removed): 8da7e3200bb309debf0ad338387c26dafc97de1afc7ba3011c0ba1241c6d0d08
incomplete pack file (will be removed): 8f74dab11a8de782759efd25db340326c3524851a7c46a4cb2df23d175546328
incomplete pack file (will be removed): a044002799ae0a4c8b0caeed64a3c1c2314f3bd5652f52e4c92af1f47a1ecc9f
incomplete pack file (will be removed): a60b3a2a76f225256377731107b3889c924413a1db8ac212db59c5f2b6efb87d
incomplete pack file (will be removed): b0404632536e3fa1851066a783ba08d9a871fa45b6037e1471ac8b0cfc3fa34a
incomplete pack file (will be removed): b492fd1c892f2b5e9ed63df629aa084d220730686de252b3030484989254f8c1
incomplete pack file (will be removed): b5bbee347e45875217ad33288e03899cbee0c2fcc0fcc7edb2892eb50ef89990
incomplete pack file (will be removed): c05b64748d945517114f6fede76c95f9336ddf03e13994c5c04f34b7bcd43d95
incomplete pack file (will be removed): ce95f9d896c44eb9da8717276d3959242edff24fcf7f8c79efc2908d34f26645
incomplete pack file (will be removed): d1a46a0406973a86eceed8314470ed6c3a25642ba810c1af3bb2f0e1e275a469
incomplete pack file (will be removed): d2fa90492bac4743f184dc24137247de74ee56afd122e66318c511e39f43f013
incomplete pack file (will be removed): d4c832432f9addf84646adc3b340853b65dcfc97d49d0d74ed9e3bba2a2c7cc9
incomplete pack file (will be removed): d542d0c8a92b5ad1c37169c8fcfdff9ce810df04a2804c9e0e4e8d4db512246b
incomplete pack file (will be removed): e57d166ccd9b367380139b8e563add2e6069af33f7a6f26d68969623e7b493a1
incomplete pack file (will be removed): f4507052a2d20fbea4db14c8c4ce8d1e68af912f3efa77ae42b6c60a015db480
incomplete pack file (will be removed): f51a455d0dbde2fd452bd607f5d1e5b2cf79cca0bca2ae3d6b996ac7bbaeeb19
incomplete pack file (will be removed): f5abb13ab88935727bf60e3032cf28176897bd2f807e53a19c50a2fc6817dd8e
incomplete pack file (will be removed): fad51e77a344f1cf9e38fc7073b9c638d1d407c2d505e6847620d3577084a865
incomplete pack file (will be removed): faf4d42c802b96b95017b492793d174dc14da7a177c40deb049aa16b24c1aae0
incomplete pack file (will be removed): fe735339b138268beb4aadd0ae1e9fa1bcfc79e1b741fd1fb152613e594f66ed
repository contains 148018 packs (826578 blobs) with 671.196 GiB
processed 826578 blobs: 0 duplicate blobs, 0 B duplicate
load all snapshots
find data that is still in use for 213 snapshots
[0:32] 100.00% 213 / 213 snapshots
Fatal: number of used blobs is larger than number of available blobs!
Please report this error (along with the output of the 'prune' run) at
https://github.com/restic/restic/issues/new
Output of Check
using temporary cache in /tmp/restic-check-cache-061487687
created new cache in /tmp/restic-check-cache-061487687
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read group #5 of 2874 data packs (out of total 148048 packs in 53 groups)
Pack ID does not match, want 39186bde, got 1da9e122
Pack ID does not match, want 04efce75, got a76690ba
[46:52] 100.00% 2874 / 2874 items
Fatal: repository contains errors
Other details
- Backend: B2
- This machine spent at least a week with it’s clock set incorrectly after a power failure/reboot. There were several backups, prunes and checks performed while the local machine had a clock that was set to a time roughly 30 days in the past. Backups, purnes and checks since the clock has been fixed have been successful.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 30 (10 by maintainers)
After a multi day exploration using a number of our engineers, we believe we determined the root cause of this issue.
When data is uploaded to our infrastructure, each file is erasure coded across 20 storage servers, which we call a Backblaze Vault. The files identified by @kmwoley and other Restic customers were all stored in vault-1129.
This vault happened to have disk errors on a number of hard drives. Therefore, it was possible for a download to intermittently be corrupt, as reported. However, because of the way files are stored across multiple servers, the file is still correct and available. No data has been lost.
Backblaze has an integrity checker which slowly runs over the disks looking for issues. In fact, by the time engineering started looking into the problem last week, some of the corrupt parts where already detected and moved aside and queued to be rebuilt. This led to taking longer to understand the problem. Going forward, our tools to determine the state of files will be improved. Also, there is a project that will be going live in the next month or so to better detect this issue while reading a file for download.
At this point - we believe the issue has been resolved; vault-1129 is healthy again. If this is not the case, please do let us know by opening a support ticket and referencing this post.
It is a mystery why only restic customers reported this issue to Backblaze and further why the data was stored on vault-1129. (For context, each vault has 1,200 drives and as of Dec. 31, 2020, we have 165,530 drives in production, or ~137 vaults. This was probably just a coincidence.
@kmwoley is correct. We deployed a fix that should finally resolve this issue. Thanks to @kmwoley and everyone who reported the issue. If you continue to see a problem, please file a ticket and reference this issue.
Thanks to everyone who helped diagnose this issue and your patience.
Welp. Bad news. I’m still seeing corruption when I run
restic check --read-data… got an error within 10 min of running--read-data.I’ve sent details to Backblaze support to continue the investigation.
Thanks to @kmwoley for reproducing this issue without restic. Our engineering team has the detail he provided and we are currently investigating.
Hey folks - I can confirm that Backblaze deployed a fix on Feb 26, 2021 (around or before 12:29 PST) that resolved the data reading errors I was experiencing. Many thanks to all of you who helped diagnose the issue and get it in front of Backblaze for attention.
@kmwoley
As you pipe the output of check into another command, that is considered as non-interactive usage. And for that case restic doesn’t output any live progress. With restic 0.12.0 it’s now possible to configure that using the
RESTIC_PROGRESS_FPSenvironment variable or by sending restic a SIGUSR1 signal.Summary
I believe this to be an issue independent of Restic. I have downloaded one pack directly from B2’s web UI that resulted in a checksum mismatch. I have run 2 full data checks from different machines which appear to give random checksum results. Every single manual pack checksum check I have done passes, except for the one documented below.
Manual Checksum Failure -
49791035In the full
--read-datachecks, there were 5 packs that showed in error across the two checks. I manually verified each of those packs by downloading the pack from B2’s Web UI. One resulted in a corrupt file for the first time I downloaded the file (on Windows, Chrome). On second download, the file was correct.Here’s the SHA-1 results (to compare against the Backblaze UI):
Binary Diff
I was curious if his was a single bit flip, or something else. So I put the corrupt file and the correct file into a binary comparison tool. The result was that about 12% of the files differed… in 5 different locations… not just the start or end.
I also downloaded the file through the UI 5 more times… each of those subsequent downloads through the UI was correct.
Checking the single file in a loop
I’m currently running the download / shasum of the single file in a loop - I’ll report back on what I find there.
Summary of
--read-datatestingFull Check Results
Windows Machine
Windows 10, Restic 0.11, Wireless Network
Linux Machine
Unraid Linux (backup host). Restic 0.11. Wired network.
Packs with errors from both runs
I downloaded each of these packs from B2 using the GUI to validate their checksums. All of them passed, except
49791035as noted above.Other notes
--read-datacheck to see if it made a difference. It did not - I still got errors. 😕