exonum: Performance degradation on a long running network

Description:

We have a demo network of 4 nodes running in Aws. At the start, the network performs well, blocks are committed according to configured timeouts. After a couple of weeks, performance starts degrading. Intervals between blocks become about 30 seconds. Height of one of the nodes becomes much less than the height of others.

There is still a lot of RAM (>3 GB) and SSD (>10 GB) free space. Size of DB folder is about 4 GB.

The most interesting part is that restarting of all node processes helps to fix the issue. Not sure if the laggard node is able to catch up others, but other nodes start working as at the beginning.

It would be great to know how to fix this issue or at least where to look at and how to efficiently debug such cases.

Thanks!

Configuration:

Network size: 4 nodes Node OS: Ubuntu 18.04 Exonum version: 1.0.0 Consensus configuration:

[consensus]
first_round_timeout = 3000
max_message_len = 1048576
max_propose_timeout = 2000
min_propose_timeout = 1000
peers_timeout = 10000
propose_timeout_threshold = 50
status_timeout = 5000
txs_block_limit = 1000

Steps to reproduce:

  1. Strart a network of 4 nodes in the same region in AWS
  2. Wait for a cople of weeks

Result:

Network performance is degraded. Intervals between blocks become about 30 seconds.

Expected result:

Network performance is not degraded.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 41 (17 by maintainers)

Most upvoted comments

UPD. At this time I’ve found out that a reason of such behaviour is a continuously growing size of WAL journal of the rocksdb. A manually decreasing a size of WAL journal here seems fixes the issue:

defaults.set_max_total_wal_size(1 << 20);

I’m trying to find out how to configure WAL journal correctly and will provide more details and a patch soon.

We have investigated the issue using callgrind. Attached are three callgrind outputs obtained from running a plain node without executing transactions. The callgrind dump was always triggered after handle_majority_precommits, i.e. after each block.

Exonum was built using RUSTFLAGS="-Cforce-frame-pointers=yes" cargo build --release with the following added to Cargo.toml for proper debugging symbols:

[profile.release]
debug = true

Afterwards, the callgrind command line was:

valgrind --tool=callgrind --collect-jumps=yes --separate-threads=yes --dump-after='*handle_majority_precommits*' ./target/release/ledger run --node-config node.toml --db-path db --master-key-pass pass:""

callgrind.out.80761.10-01: snapshot after block 10 callgrind.out.80761.601-01: snapshot after block 601 …now the node was restarted… callgrind.out.80915.603-01: snapshot after block 603 on the restarted node

The time spent in handle_majority_precommits was ~14 ms for block 10, and already ~40 ms for block 601. After restarting the node, it was down to ~15 ms again for block 603.

You can investigate the files with KCacheGrind or QCacheGrind. It might be worth comparing especially the dumps for block 601 (before restart) and 603 (after restart), where it appears that the issue could be with the iterators on RocksDb. Those seem to take much longer and longer with each block. [1]

These iterators appear to be called through exonum::merkledb::backends::rocksdb::RocksDbSnapshot::iter(). I have issues following the control flow in cachegrind, but naively it seems be called somewhere like this:

exonum::runtime::dispatcher::Dispatcher::commit_block()
\___
   +- exonum::runtime::dispatcher::schema::Schema<exonum::merkledb::db::Fork>::activate_pending()
   +- exonum::runtime::dispatcher::schema::Schema<exonum::merkledb::db::Fork>::take_modified_instances()
   +- exonum::runtime::dispatcher::schema::Schema<exonum::merkledb::db::Fork>::take_pending_artifacts()
   \_____ exonum::merkledb::views::View::iter_bytes()
         \_____ exonum::merkledb::backends::rocksdb::RocksDbSnapshot::rocksdb_iter()
exonum::runtime::dispatcher::Dispatcher::before_transactions()
\___ exonum::merkledb::Patch::iter()
    \_____ exonum::merkledb::backends::rocksdb::RocksDbSnapshot::rocksdb_iter()

According to the link [1] below this could be relieved by replacing self.snapshot.iterator_cf with self.snapshot.iterator_cf_opt and passing additional options such as iterate_upper_bound, but I don’t know enough about Exonum to propose a fix yet.

I hope this can help shed some light on the issue and can run more of these tests if need be.

[1] Consider this blog post about the performance of RocksDb iterators in the presence of tombstones and how to improve it.

callgrind.out.zip

Sorry, I haven’t gotten around to wrapping up our investigation on this, yet…

We have ran the example with a limit on the max_total_wal_size for several days and that does fix this issue. However, it causes the number of open files to permanently increase, until the limit of the OS (run ulimit -n) is reached.

But, the good news is that the combination of limiting max_total_wal_size and defaults.set_max_open_files seems to resolve both issues!

So yes, exposing max_total_wal_size (max_open_files is already configurable) would definitely help.

I asked about this issue on https://github.com/facebook/rocksdb but I haven’t received any answer yet. What I can propose right now is to add a property to a config by which you could to set a max_total_wal_size. E.g. I’ve checked that a limit to 20mb fixes this issue.

Thank you for this report. I tried to find out the reason of performance regression by profiling and I’ve got the same result. 2021-02-22 09 27 43

Now I’m trying to understand what has been changed and how to fix it. Because I have a service on exonum 0.12.2 which started 9 months ago and it works fine. There is no such problem. Maybe a turning of the rocksdb settings could help to fix the issue.

Here is another two-day test.

Deployment: local Binary: exonum_cryptocurrency_advanced Number of nodes: 2 Configuration: default

Metrics