btcd: seems btcd is way slower than bitcoin core in syncing?

I remember I used bitcoin core to sync btc full node on EC2 (200+G) in one day. But I used btcd to sync full node – two days later it just synced 108G ?

The EC2 config is the same.

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 3
Comments: 59 (16 by maintainers)

Most upvoted comments

So the ballast (10gb) does appear to substantially increase the garbage collection rate: Ballast: ballast No ballast: ballastless

It looks like the greatest decrease here comes from: Ballast without debug.SetGCPercent(10): ballast_no10 No Ballast without debug.SetGCPercent(10): gc_ballastless_no10

This appears to result in a 3x increase in heap size: 10% GC limit w/ ballast: 10ballast 10% GC limit w/ no ballast: 10noballast No Limit no ballast noballastno10 No Limit

jakesylvestre on Feb 1, 2020

I remember reading this one before. I am going to try it out with my fork at https://github.com/p9c/pod and I will report back if it gets that kind of result (it is on a different, much smaller change but even still appears to have at least one big GC cleanup every 50-100,000 blocks and that is at least part of the problem for sure, as one or two blocks end up taking like a minute to process.

l0k18 on Dec 13, 2019

I will be focusing on solving this problem in my project (github.com/p9c/pod), but thanks to Rjected I also have looked at some of the treap code as well as the script engine and I am pretty sure both have got serious garbage accumulation problems. Optimising them and aiming for zero runtime allocation will probably go a long way towards a solution.

Incidentally, a part of btcd that I have had INTENSE work with is the CPU miner. It uses two mutexes, and stops and creates new goroutines constantly. I built a library that allows me to attach an RPC to the standard output, built a small, dedicated single thread (two threads but primary work), I use two channels, stop and start, and I thought I would need an atomic or compareandswap but it turns out just using two channels and a for loop with a ‘work mode’ second loop, both parts of the loop drain the channels not relevant (the runner ignores the run signal channel and the pauser ignores the pause signal channel), and the lock contention is obviously so bad that a little over 10% of potential performance is chewed up with synchronisation.

I know well enough from what I saw of the script engine and database drivers/indexers that the programmers who write it are obviously former C++/Java programmers because they pretty much mainly rely on mutexes for synchronisation and mutexes, which are the slowest sync primitives, and where channels are used, in places that I would expect to see async calls used in these older, less network-focused languages.

For bitcoin forks, especially small, neglected ones like parallelcoin, its sync rate is fine. 8 minutes on my Ryzen 5 1600/SSD/32gb machine, at a height of about 210,000. The chain barely has maybe 2 transactions per block, on average. But even still, at 99000 and again around 160000 it bogs down badly and appears to be mainly garbage collecting, so with the typical block payload of bitcoin, I imagine the complexity of the chain of provenance of tokens explodes exponentially, and that graph exactly shows this pattern.

I’m not sure where I will start with it, but I strongly suspect write amplification is also hiding in there, a performance problem well known to be existing with LevelDB, and even RocksDB and BoltDB, and resolved in Badger, so first step will be building a badger driver. I’d guess that especially as the number of transactions grows that write amplification is causing an issue every time the database updates the values it has to write the keys again as well, combined with the geometric rise in complexity of validations to confirm they correctly chain back to the coinbase.

Second thing I expect to look at is the treaps. There is some parts of btcd that attempt to eliminate runtime allocations, at least one buffer freelist, but there is a lot of creation of small byte slices that are discarded later. As tends to be the case with Go, the naive (I did mention the mutexes, they are a naive use of Go) implementation does not take into account GC or thread scheduling, and when the bottlenecks are really bad, usually it means you have to take over both memory and scheduling work from the Go runtime to get a better result. I already saw one clear example of this just in the use of isolated processes connected via IPC pipes and using two channels and one ticker instead of 2 or 3 mutexes and 3 different tickers made it produce more than 10% more hashes.

If anyone is interested who is following, keep an eye on the repo I mentioned above as over the next 6 months I will be focusing on optimizing everything. I am nearly finished implementing the beta spec for my fork, and I have aimed to make it accessible enough and not stomping over top of too much of what is already there that differs for the chain I am working on. I am still a bit lost as to how to enable full segwit support, and you will see I have merged btcwallet, btcd, btcutil and btcjson repositories into one, created a unified configuration system, and mostly done and robust handling of concurrent running of wallet and node together for a more conventional integral node/wallet mode of operation. I understand some of the reasons behind so vigorously separating them but in my opinion in the absence of a really good SPV implementation makes doing this a step backwards.

Based on watching CPU utilization and a little bit of profiling I can see so much empty spaces between with CPU doing literally nothing for about 60-70% of the time, during the sync process, so I am very leaning towards the idea that synchronization is the bigger issue, and second is garbage generation, and thirdly, write amplification due to the database log structure, with updating metadata related to block nodes in the database.

l0k18 on Nov 26, 2019

In my profiling, garbage collection and runtime operations were taking up a lot of CPU time during sync, which was worrying - so I profiled some more (for allocations), and immutable treap operations were by far the biggest allocators, so that may be the issue.

Here’s what I have for allocations when syncing 2012 blocks:

File: btcd
Type: alloc_space
Time: Aug 16, 2019 at 2:53pm (EDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 182.87GB, 85.76% of 213.25GB total
Dropped 379 nodes (cum <= 1.07GB)
Showing top 10 nodes out of 109
      flat  flat%   sum%        cum   cum%
   83.99GB 39.39% 39.39%    83.99GB 39.39%  github.com/btcsuite/btcd/database/internal/treap.cloneTreapNode
   64.99GB 30.48% 69.86%       65GB 30.48%  github.com/btcsuite/btcd/txscript.parseScriptTemplate
    6.75GB  3.16% 73.03%    34.08GB 15.98%  github.com/btcsuite/btcd/database/internal/treap.(*Immutable).Delete
    6.34GB  2.97% 76.00%     6.34GB  2.97%  github.com/btcsuite/goleveldb/leveldb/table.(*Reader).newBlockIter
    5.06GB  2.37% 78.37%    10.78GB  5.05%  github.com/btcsuite/goleveldb/leveldb/util.Hash
    4.92GB  2.31% 80.68%     4.92GB  2.31%  bytes.NewBuffer
    3.25GB  1.52% 82.20%     3.25GB  1.52%  github.com/btcsuite/btcd/database/internal/treap.newTreapNode
    2.87GB  1.35% 83.55%     2.87GB  1.35%  github.com/btcsuite/btcd/wire.(*MsgTx).BtcDecode
    2.52GB  1.18% 84.73%    16.38GB  7.68%  github.com/btcsuite/btcd/blockchain.(*UtxoViewpoint).addTxOut
    2.20GB  1.03% 85.76%     2.20GB  1.03%  github.com/btcsuite/btcd/database/internal/treap.(*parentStack).Push

For cloneTreapNode:

(pprof) list cloneTreapNode
Total: 213.25GB
ROUTINE ======================== github.com/btcsuite/btcd/database/internal/treap.cloneTreapNode in /home/dan-server/btcd/database/internal/treap/immutable.go
   83.99GB    83.99GB (flat, cum) 39.39% of Total
         .          .     14:	return &treapNode{
         .          .     15:		key:      node.key,
         .          .     16:		value:    node.value,
         .          .     17:		priority: node.priority,
         .          .     18:		left:     node.left,
   83.99GB    83.99GB     19:		right:    node.right,
         .          .     20:	}
         .          .     21:}
         .          .     22:
         .          .     23:// Immutable represents a treap data structure which is used to hold ordered
         .          .     24:// key/value pairs using a combination of binary search tree and heap semantics.

And parseScriptTemplate:

(pprof) list parseScriptTemplate
Total: 213.25GB
ROUTINE ======================== github.com/btcsuite/btcd/txscript.parseScriptTemplate in /home/dan-server/btcd/txscript/script.go
   64.99GB       65GB (flat, cum) 30.48% of Total
         .          .    193:
         .          .    194:// parseScriptTemplate is the same as parseScript but allows the passing of the
         .          .    195:// template list for testing purposes.  When there are parse errors, it returns
         .          .    196:// the list of parsed opcodes up to the point of failure along with the error.
         .          .    197:func parseScriptTemplate(script []byte, opcodes *[256]opcode) ([]parsedOpcode, error) {
   64.99GB    64.99GB    198:	retScript := make([]parsedOpcode, 0, len(script))
         .          .    199:	for i := 0; i < len(script); {
         .          .    200:		instr := script[i]
         .          .    201:		op := &opcodes[instr]
         .          .    202:		pop := parsedOpcode{opcode: op}

etc. etc. nothing else significant

So these combined account for about 70% of allocations, but account for less than 10% of in-use space at runtime. In both cases, keeping some state so we reduce the number of allocations (and deallocations) would be beneficial. I bet replacing the immutable treaps in the dbcache would really help sync speed. LevelDB calls are also fine, they are all fairly lightweight - IMO the issue is the cache. More profiling probably needs to be done, but my guess is the dbcache.

Rjected on Oct 29, 2019

For me it was dealbraker and I switched to bitcoin core 😦 I synced whole blockchain in little more than day. With btcd after few days I was still in the 60% of the height and like 30% of the size. 😦 It makes me sad, I would like to see more diversity but with this, I have no other option 😦

ghost on Oct 29, 2019

dcrd has an issue open related to multipeer downloads https://github.com/decred/dcrd/issues/1145 which appears to be partially done.

I haven’t been following it closely enough to know if it is likely to be backportable to btcd but might be interesting to take a look at.

jcvernaleo on Oct 25, 2018

Ok, same issue here, btcd is ridiculous slow.

This seems to be mostly goleveldb related (as others have stated). The following PR seems highly relevant: https://github.com/syndtr/goleveldb/pull/338

bitcoind configures leveldb based on a memory flag. I am using WriteBuffer: defaultCacheSize, BlockCacheCapacity: 2*defaultCacheSize for similar results. It would be nice to have a flag for cache / memory usage to fine-tune this.

I am also using NoSync: true for the initial sync. And syncing from bitcoind on localhost.

That said the initial sync is still days off but at least it is making decent progress.

rtreffer on Aug 9, 2021

I want to confirm that my memory-pressure measurements today show that the cloneTreapNode is the chief culprit. The second worst is the ecdsa.Verify; there are a surprising number of make calls deep in big.nat. The third worst is the leveldb find function.

BrannonKing on Jun 7, 2021