bitcoin: IBD stalls permanently with Bitcoin core v23 after "ignoring getheaders from peer=0 because node is in initial block download"

With Bitcoin core v23 (Qt versin on Windows, or with the daemon from commandline), the IBD process can stall indefinitely after receiving a large block:

2022-08-07T13:24:50Z [net.cpp:3043] [CNode] Added connection to 104.248.143.83:8333 peer=0
2022-08-07T13:24:50Z [net.cpp:3060] [PushMessage] sending version (102 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:1146] [PushNodeVersion] send version message: version 70016, blocks=239242, them=104.248.143.83:8333, txrelay=1, peer=0
2022-08-07T13:24:50Z [net_processing.cpp:2558] [ProcessMessage] received: version (102 bytes) peer=0
2022-08-07T13:24:50Z [net.cpp:3060] [PushMessage] sending verack (0 bytes) peer=0
2022-08-07T13:24:50Z [net.cpp:3060] [PushMessage] sending getaddr (0 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:2754] [ProcessMessage] receive version message: /Satoshi:0.20.1/: version 70015, blocks=748403, us=104.248.143.83:8333, txrelay=1, peer=0, peeraddr=104.248.143.83:8333
2022-08-07T13:24:50Z [timedata.cpp:57] [AddTimeData] added time data, samples 2, offset -1 (+0 minutes)
2022-08-07T13:24:50Z [net_processing.cpp:2558] [ProcessMessage] received: verack (0 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:2797] [ProcessMessage] New outbound peer connected: version: 70015, blocks=748403, peer=0, peeraddr=104.248.143.83:8333 (manual)
2022-08-07T13:24:50Z [net.cpp:3060] [PushMessage] sending sendheaders (0 bytes) peer=0
2022-08-07T13:24:50Z [net.cpp:3060] [PushMessage] sending sendcmpct (9 bytes) peer=0
2022-08-07T13:24:50Z [net.cpp:3060] [PushMessage] sending sendcmpct (9 bytes) peer=0
2022-08-07T13:24:50Z [net.cpp:3060] [PushMessage] sending ping (8 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:4650] [SendMessages] initial getheaders (319216) to peer=0 (startheight:748403)
2022-08-07T13:24:50Z [net.cpp:3060] [PushMessage] sending getheaders (997 bytes) peer=0
2022-08-07T13:24:50Z [net.cpp:3060] [PushMessage] sending feefilter (8 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:2558] [ProcessMessage] received: sendheaders (0 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:2558] [ProcessMessage] received: sendcmpct (9 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:2558] [ProcessMessage] received: sendcmpct (9 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:2558] [ProcessMessage] received: ping (8 bytes) peer=0
2022-08-07T13:24:50Z [net.cpp:3060] [PushMessage] sending pong (8 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:2558] [ProcessMessage] received: getheaders (1029 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:3223] [ProcessMessage] Ignoring getheaders from peer=0 because node is in initial block download
2022-08-07T13:24:50Z [net_processing.cpp:2558] [ProcessMessage] received: feefilter (8 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:4051] [ProcessMessage] received: feefilter of 0.00001000 BTC/kvB from peer=0
2022-08-07T13:24:50Z [net_processing.cpp:2558] [ProcessMessage] received: pong (8 bytes) peer=0
2022-08-07T13:24:50Z [net_processing.cpp:2558] [ProcessMessage] received: headers (162003 bytes) peer=0

Then any other attemps to connect to other peers (or accept incoming connections) stalls permanently (we can waitfior hours: no network activity, no disk activity, but some thread uses 100% of a CPU core, probably in a tight loop generated by a deadlock and infinite attemps to recover from it).

If we try to shutdown, al most all databases thread are closed, the RPC services are closed, But the conenction manager remains locked for about ~15 minutes, not doing anything.

In fact the received headers are not even starting to be parse until AFTER these ~15 minutes. At which time we see that in logs: (note: I’ve enabled logtimestamps=1, logsourcelocations=1 and logthreadnames=1 in bitcoin.conf to have more traceable traces; and run it with debug=all).

It seems that a lock is kept somewhere in “net_processing.cpp” after the ProcessMessage() exists with “Ignoring getheaders from peer=0 because node is in initial block download”

Only when we shut down the node we see instantly:

2022-08-07T13:26:17Z [qt/bitcoin.cpp:207] [DebugMessageHandler] GUI: requestShutdown : Requesting shutdown
2022-08-07T13:26:17Z [rpc/server.cpp:302] [operator()] Interrupting RPC
2022-08-07T13:26:17Z [rpc/server.cpp:314] [operator()] Stopping RPC
2022-08-07T13:26:17Z [init.cpp:371] [OnRPCStopped] RPC stopped.
2022-08-07T13:26:17Z [qt/bitcoin.cpp:207] [DebugMessageHandler] GUI: Running Shutdown in thread
2022-08-07T13:26:17Z [httpserver.cpp:436] [InterruptHTTPServer] Interrupting HTTP server
2022-08-07T13:26:17Z [httprpc.cpp:312] [InterruptHTTPRPC] Interrupting HTTP RPC server
2022-08-07T13:26:17Z [init.cpp:205] [Shutdown] Shutdown: In progress...
2022-08-07T13:26:17Z [httprpc.cpp:317] [StopHTTPRPC] Stopping HTTP RPC server
2022-08-07T13:26:17Z [httpserver.cpp:657] [UnregisterHTTPHandler] Unregistering HTTP handler for / (exactmatch 1)
2022-08-07T13:26:17Z [httpserver.cpp:657] [UnregisterHTTPHandler] Unregistering HTTP handler for /wallet/ (exactmatch 0)
2022-08-07T13:26:17Z [httpserver.cpp:448] [StopHTTPServer] Stopping HTTP server
2022-08-07T13:26:17Z [httpserver.cpp:450] [StopHTTPServer] Waiting for HTTP worker threads to exit
2022-08-07T13:26:17Z [util/thread.cpp:19] [TraceThread] opencon thread exit
2022-08-07T13:26:17Z [util/thread.cpp:19] [TraceThread] dnsseed thread exit
2022-08-07T13:26:17Z [util/thread.cpp:19] [TraceThread] addcon thread exit
2022-08-07T13:26:17Z [httpserver.cpp:291] [ThreadHTTP] Exited http event loop
2022-08-07T13:26:17Z [httpserver.cpp:463] [StopHTTPServer] Waiting for HTTP event thread to exit
2022-08-07T13:26:17Z [httpserver.cpp:475] [StopHTTPServer] Stopped HTTP server
2022-08-07T13:26:17Z [mapport.cpp:210] [ProcessUpnp] UPNP_DeletePortMapping() returned: 0
2022-08-07T13:26:17Z [util/thread.cpp:19] [TraceThread] mapport thread exit

And then we wait for about 15-20 minutes to finally see this: the received block headers are processed, an attempt for sending another requests for headers is emitted, but all peers are finally disconnected (because we were shutting down), and all remaining data in moemory is flushed to disk and the node finally exits correctly.

2022-08-07T13:43:47Z [validation.cpp:3606] [ProcessNewBlockHeaders] Synchronizing blockheaders, height: 321216 (~43.64%)
2022-08-07T13:43:47Z [net_processing.cpp:2198] [ProcessHeadersMessage] more getheaders (321216) to end to peer=0 (startheight:748403)
2022-08-07T13:43:47Z [net.cpp:1262] [CreateNodeFromAcceptedSocket] connection from [2604:d500:4:1::2]:63949 accepted
2022-08-07T13:43:47Z [util/thread.cpp:19] [TraceThread] net thread exit
2022-08-07T13:43:47Z [net.cpp:3060] [PushMessage] sending getheaders (997 bytes) peer=0
2022-08-07T13:43:47Z [dbwrapper.cpp:198] [WriteBatch] WriteBatch memory usage: db=txindex, before=0.0MiB, after=0.0MiB
2022-08-07T13:43:47Z [util/thread.cpp:19] [TraceThread] msghand thread exit
2022-08-07T13:43:47Z [dbwrapper.cpp:198] [WriteBatch] WriteBatch memory usage: db=txindex, before=0.0MiB, after=0.0MiB
2022-08-07T13:43:47Z [util/thread.cpp:19] [TraceThread] txindex thread exit
2022-08-07T13:43:47Z [random.cpp:520] [SeedPeriodic] Feeding 1009016 bytes of dynamic environment data into RNG
2022-08-07T13:43:47Z [net.cpp:1846] [DumpAddresses] Flushed 6626 addresses to peers.dat  701ms
2022-08-07T13:43:47Z [net.cpp:567] [CloseSocketDisconnect] disconnecting peer=0
2022-08-07T13:43:47Z [net_processing.cpp:1282] [FinalizeNode] Cleared nodestate for peer=0
2022-08-07T13:43:47Z [net.cpp:567] [CloseSocketDisconnect] disconnecting peer=1
2022-08-07T13:43:47Z [net_processing.cpp:1282] [FinalizeNode] Cleared nodestate for peer=1
2022-08-07T13:43:47Z [net.cpp:1846] [DumpAddresses] Flushed 6626 addresses to peers.dat  15ms
2022-08-07T13:43:47Z [util/thread.cpp:19] [TraceThread] scheduler thread exit
2022-08-07T13:43:47Z [validation.cpp:4644] [DumpMempool] Writing 0 unbroadcast transactions to disk.
2022-08-07T13:43:47Z [validation.cpp:4654] [DumpMempool] Dumped mempool: 0s to copy, 0.036892s to dump
2022-08-07T13:43:47Z [policy/fees.cpp:998] [FlushUnconfirmed] Recorded 0 unconfirmed txs from mempool in 0s
2022-08-07T13:43:47Z [logging/timer.h:57] [Log] FlushStateToDisk: write block and undo data to disk started
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write block and undo data to disk completed (38.02ms)
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write block index to disk started
2022-08-07T13:43:48Z [dbwrapper.cpp:198] [WriteBatch] WriteBatch memory usage: db=index, before=0.0MiB, after=0.3MiB
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write block index to disk completed (36.98ms)
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write coins cache to disk (0 coins, 0kB) started
2022-08-07T13:43:48Z [txdb.cpp:161] [BatchWrite] Writing final batch of 0.00 MiB
2022-08-07T13:43:48Z [dbwrapper.cpp:198] [WriteBatch] WriteBatch memory usage: db=chainstate, before=0.0MiB, after=0.0MiB
2022-08-07T13:43:48Z [txdb.cpp:163] [BatchWrite] Committed 0 changed transaction outputs (out of 0) to coin database...
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write coins cache to disk (0 coins, 0kB) completed (0.00ms)
2022-08-07T13:43:48Z [validationinterface.cpp:243] [ChainStateFlushed] Enqueuing ChainStateFlushed: block hash=000000000000012a873661c59cd6301ace5a9108786284693eb0e35038713f4d
2022-08-07T13:43:48Z [validationinterface.cpp:243] [operator()] ChainStateFlushed: block hash=000000000000012a873661c59cd6301ace5a9108786284693eb0e35038713f4d
2022-08-07T13:43:48Z [dbwrapper.cpp:198] [WriteBatch] WriteBatch memory usage: db=db, before=0.0MiB, after=0.0MiB
2022-08-07T13:43:48Z [dbwrapper.cpp:198] [WriteBatch] WriteBatch memory usage: db=db, before=0.0MiB, after=0.0MiB
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write block and undo data to disk started
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write block and undo data to disk completed (31.27ms)
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write block index to disk started
2022-08-07T13:43:48Z [dbwrapper.cpp:198] [WriteBatch] WriteBatch memory usage: db=index, before=0.3MiB, after=0.3MiB
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write block index to disk completed (15.63ms)
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write coins cache to disk (0 coins, 0kB) started
2022-08-07T13:43:48Z [txdb.cpp:161] [BatchWrite] Writing final batch of 0.00 MiB
2022-08-07T13:43:48Z [dbwrapper.cpp:198] [WriteBatch] WriteBatch memory usage: db=chainstate, before=0.0MiB, after=0.0MiB
2022-08-07T13:43:48Z [txdb.cpp:163] [BatchWrite] Committed 0 changed transaction outputs (out of 0) to coin database...
2022-08-07T13:43:48Z [logging/timer.h:57] [Log] FlushStateToDisk: write coins cache to disk (0 coins, 0kB) completed (0.00ms)
2022-08-07T13:43:48Z [validationinterface.cpp:243] [ChainStateFlushed] Enqueuing ChainStateFlushed: block hash=000000000000012a873661c59cd6301ace5a9108786284693eb0e35038713f4d
2022-08-07T13:43:48Z [init.cpp:322] [Shutdown] Shutdown: done
2022-08-07T13:43:48Z [qt/bitcoin.cpp:207] [DebugMessageHandler] GUI: Shutdown finished
2022-08-07T13:43:48Z [qt/bitcoin.cpp:207] [DebugMessageHandler] GUI: ~InitExecutor : Stopping thread
2022-08-07T13:43:48Z [qt/bitcoin.cpp:207] [DebugMessageHandler] GUI: ~InitExecutor : Stopped thread

If we relaunch the node, it loads all data successfully, connects to a new peer=0 and requests another set of headers, then stalls again exactly as above, and we have to shutdown and wait 15-20 minutes exactly the same way as above. A single connection is made instantly to download a single nex set of headers, and then stalls forever. Other connections are added to the list (the UPnP discovery occurs, other peers are also discovered), but will never start. These extra connections are locked again until…

We shutdown the UI and then wait for 15-20 minutes when the RPC services are closed. These pending connections are then open and closed immediately and the UI closes.

Result: we have been able to download a single set of headers at each run (less than 1 minute to start and then seeing the lock after the 1st download of headers, just about 180KB, then no network or disk activity at all dor an infinite time, then shutdown and wait 15-20 minutes).

Expected behavior

IBD should not stall. It should also connect to other peers, and not just to the initial one (the QT interface dispays constantly: “connecting to peers…”. System information

Bitcoin v23 for Windows (from https://bitcoincore.org/en/download/, extracted from the ZIP file, because the current EXE installer does not run on a Windows system enforcing ASLT)

Windows 11 x64 (Intel Core i7), 32 GB RAM

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 54 (18 by maintainers)

Most upvoted comments

So, for whatever reason, your node has marked block 277117 (0000000000000002cdf7bce4c09c0849f789a9508d812d3ca430429f6213688b) incorrectly as invalid. It’s not synchronizing further, because it is treating every peer that hands it block 277177 as malicious. We’ll need to figure out why that is happening. Your comment above about running out of file descriptors seems like a useful lead.

First, we need to know if it is the block on disk itself that is corrupted, or the validation database. It’s likely the latter, but to find out, run reconsiderblock 0000000000000002cdf7bce4c09c0849f789a9508d812d3ca430429f6213688b, and report what you see in debug.log at that time.

Do I need to manually delete some recent blk0*.dat and rev0*.dat and restart indexing again and again?

If the blocks themselves are corrupted, you’ll need to use -reindex. If it’s just the database that is corrupted, -reindex-chainstate should suffice. Given that it sounds like you’ve tried the latter several times and that doesn’t help, it’s either a problem with the blocks themselves, or somehow your Bitcoin Core instance is corrupting itself time after time. Looking at the reconsiderblock output will help figure this out.

Are they caused by misbehaving nodes (or malwares) in the P2P network distributing wrong blocks and causing normal nodes in IBD to fail or stall?

Misbehaving nodes should never be able to cause Bitcoin Core to incorrectly mark a block as invalid, so no.

May be this is the sign that Bitcoin Core does not fully apply the “consensus” rules for validating new blocks, and that it accepts them too early

No.

But may be there are other competing softwares that could work such as https://www.bitcoinabc.org/ (if they’ve detected such attack and are able to isolate and fix it)?

That software is for a different currency, with different consensus rules. It will not synchronize with the Bitcoin network/chain/currency.

sipa on Aug 11, 2022