bitcoin: Do not crash if peers.dat is corrupted

When peers.dat is corrupted an error message is shown: Invalid or corrupt peers.dat (Checksum mismatch, data corrupted). then the node restart.

Most of our users aren’t really tech enough to manually delete the peers.dat files, nor can we detect it for them. It means that this error give us lot’s of work on our support team when somebody is impacted.

peers.dat isn’t an essential file, as such Bitcoin Core should just be fine restarting without crashing.

A bash workaround to detect the checksum mismatch would also considerably help us.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 37 (27 by maintainers)

Commits related to this issue

If peers data is corrupted, move it. (https://github.com/bitcoin/bitcoin/issues/26599) — committed to btcpayserver/dockerfile-deps by NicolasDorier a year ago
If peers data is corrupted, move it. (https://github.com/bitcoin/bitcoin/issues/26599) — committed to btcpayserver/dockerfile-deps by NicolasDorier a year ago
If peers data is corrupted, move it. (https://github.com/bitcoin/bitcoin/issues/26599) — committed to btcpayserver/dockerfile-deps by NicolasDorier a year ago
Merge bitcoin/bitcoin#26909: net: prevent peers.dat corruptions by only serializing once 5eabb61b2386d00e93e6bbb2f493a56d1b326ad9 addrdb: Only call Serialize() once (Martin Zumsande) da6c7aeca38e1d0a... — committed to bitcoin-core/gui by deleted user a year ago
Merge bitcoin/bitcoin#26909: net: prevent peers.dat corruptions by only serializing once 5eabb61b2386d00e93e6bbb2f493a56d1b326ad9 addrdb: Only call Serialize() once (Martin Zumsande) da6c7aeca38e1d0a... — committed to syscoin/syscoin by deleted user a year ago

Most upvoted comments

I believe I’ve found the bug that caused this with the help of the provided peers.dat (which was completely ok as far as I can see, just that the checksum was wrong, and when overwriting the bad checksum with the correct one it would load correctly):

Every 15 minutes, the scheduler thread will dump peers.dat to disk - for this it calls https://github.com/bitcoin/bitcoin/blob/f4ef856375c5b295d78169b136c6aee928c19bc9/src/addrdb.cpp#L38-L40

which first writes the data (i.e. AddrMan) into the stream, and then writes the same data into a hasher - which then provides the hash that is added to the stream in the third line. The problem is that AddrMan can change in between the first two calls (e.g. if we receive a new address), and then the data and hash won’t match anymore and the written file is corrupt.

I could reproduce this by adding a sleep for the scheduler thread in between the two writes of data, manually adding artificial addresses with addpeeraddress during this sleep, and then killing bitcoind (so that it can’t correct the peers.dat at a clean shutdown). That way, I would corrupt my own peers.dat.

I will work on a fix!

mzumsande on Jan 12, 2023

The peers.dat file is designed to avoid having to reach out to the DNS or hardcoded seeds more than once, as this is the moment your node is most susceptible to being poisoned with attacker ip addresses and perhaps in the future blocks and transactions.

If the file becomes corrupted then anchors.dat should help protect the node from a successful future eclipse attack, but new addresses will have to either be added manually or fetched from DNS or hardcoded seeds again.

I agree that the best course of action here is to find out what’s corrupting peers.dat and fix that, rather than have Core silently ignore errors on something that could be used as a first step towards eclipse attacking you…

Side note: it does make me wonder whether it could be worth having certain runtime “profiles”. For example I have seen software with “paranoia level” settings, and we could perhaps have something like

Paranoid: Fail on detecting any corrupt file, data, etc. debugging enabled on many categories by default, notification of re-orgs >= n blocks etc.
Normal (default): Somthing similar to todays defaults
Resilient: Try to recover from more non-critical errors to stay operational. Automatcially restart and rebuild broken indexes, etc. if needed

willcl-ark on Dec 1, 2022

I will ask for the next time it happens to save the peers.dat so we can analyze it.

NicolasDorier on Dec 1, 2022

It means that this error give us lot’s of work

Do you happen to know why it corrupts? If it is due to hardware error, it might be scary to just continue, because it might also corrupt wallet.dat.

maflcko on Nov 29, 2022

So if your node crashes every few days, it sounds like you have another, unrelated problem.

@beeduul do you want to follow up with a new issue, providing more info if possible? Assuming this isn’t a hardware related problem.

fanquake on May 22, 2023

Although this issue has been marked as fixed for the next release, I’ll leave this additional note here for posterity.

This issue happens me every few days on my 4gb pi umbrel. It appears that immediately before each crash, the log contains Socks5() connect to xxx.xxx.xxx.xxx:8333 failed: InterruptibleRecv() timeout or other failure.

To be clear: the fix doesn’t prevent any crashes from happening - what it fixes is that if the node crashes for some unrelated reason, peers.dat shouldn’t get corrupted anymore (which would only be visible at the next startup). So if your node crashes every few days, it sounds like you have another, unrelated problem.

mzumsande on May 20, 2023

I opened #26909 to fix this.

mzumsande on Jan 17, 2023

To me it sounds like a bug that should be fixed and not silently ignored

maflcko on Dec 1, 2022