exonum: Performance degradation on a long running network
Description:
We have a demo network of 4 nodes running in Aws. At the start, the network performs well, blocks are committed according to configured timeouts. After a couple of weeks, performance starts degrading. Intervals between blocks become about 30 seconds. Height of one of the nodes becomes much less than the height of others.
There is still a lot of RAM (>3 GB) and SSD (>10 GB) free space. Size of DB folder is about 4 GB.
The most interesting part is that restarting of all node processes helps to fix the issue. Not sure if the laggard node is able to catch up others, but other nodes start working as at the beginning.
It would be great to know how to fix this issue or at least where to look at and how to efficiently debug such cases.
Thanks!
Configuration:
Network size: 4 nodes Node OS: Ubuntu 18.04 Exonum version: 1.0.0 Consensus configuration:
[consensus]
first_round_timeout = 3000
max_message_len = 1048576
max_propose_timeout = 2000
min_propose_timeout = 1000
peers_timeout = 10000
propose_timeout_threshold = 50
status_timeout = 5000
txs_block_limit = 1000
Steps to reproduce:
- Strart a network of 4 nodes in the same region in AWS
- Wait for a cople of weeks
Result:
Network performance is degraded. Intervals between blocks become about 30 seconds.
Expected result:
Network performance is not degraded.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 41 (17 by maintainers)
UPD. At this time I’ve found out that a reason of such behaviour is a continuously growing size of WAL journal of the
rocksdb. A manually decreasing a size of WAL journal here seems fixes the issue:I’m trying to find out how to configure WAL journal correctly and will provide more details and a patch soon.
We have investigated the issue using callgrind. Attached are three callgrind outputs obtained from running a plain node without executing transactions. The callgrind dump was always triggered after
handle_majority_precommits, i.e. after each block.Exonum was built using
RUSTFLAGS="-Cforce-frame-pointers=yes" cargo build --releasewith the following added toCargo.tomlfor proper debugging symbols:Afterwards, the callgrind command line was:
callgrind.out.80761.10-01: snapshot after block 10callgrind.out.80761.601-01: snapshot after block 601 …now the node was restarted…callgrind.out.80915.603-01: snapshot after block 603 on the restarted nodeThe time spent in
handle_majority_precommitswas ~14 ms for block 10, and already ~40 ms for block 601. After restarting the node, it was down to ~15 ms again for block 603.You can investigate the files with KCacheGrind or QCacheGrind. It might be worth comparing especially the dumps for block 601 (before restart) and 603 (after restart), where it appears that the issue could be with the iterators on RocksDb. Those seem to take much longer and longer with each block. [1]
These iterators appear to be called through
exonum::merkledb::backends::rocksdb::RocksDbSnapshot::iter(). I have issues following the control flow in cachegrind, but naively it seems be called somewhere like this:According to the link [1] below this could be relieved by replacing
self.snapshot.iterator_cfwithself.snapshot.iterator_cf_optand passing additional options such asiterate_upper_bound, but I don’t know enough about Exonum to propose a fix yet.I hope this can help shed some light on the issue and can run more of these tests if need be.
[1] Consider this blog post about the performance of RocksDb iterators in the presence of tombstones and how to improve it.
callgrind.out.zip
Sorry, I haven’t gotten around to wrapping up our investigation on this, yet…
We have ran the example with a limit on the
max_total_wal_sizefor several days and that does fix this issue. However, it causes the number of open files to permanently increase, until the limit of the OS (runulimit -n) is reached.But, the good news is that the combination of limiting
max_total_wal_sizeanddefaults.set_max_open_filesseems to resolve both issues!So yes, exposing
max_total_wal_size(max_open_filesis already configurable) would definitely help.I asked about this issue on https://github.com/facebook/rocksdb but I haven’t received any answer yet. What I can propose right now is to add a property to a config by which you could to set a
max_total_wal_size. E.g. I’ve checked that a limit to 20mb fixes this issue.Thank you for this report. I tried to find out the reason of performance regression by profiling and I’ve got the same result.
Now I’m trying to understand what has been changed and how to fix it. Because I have a service on exonum 0.12.2 which started 9 months ago and it works fine. There is no such problem. Maybe a turning of the rocksdb settings could help to fix the issue.
Here is another two-day test.
Deployment: local Binary: exonum_cryptocurrency_advanced Number of nodes: 2 Configuration: default