cosmos-sdk: Excessive per-block write IO activity

Summary of Bug

Excessing write IO activity for each block.

Appears to be upwards of 40 MB written per block, making it unnecessarily hard to spin up a full node and catch up with the network.

The activity seems to be centered around data/gaia.db, excerpt from the LOG file in same directory:

3:17:10.741444 table@remove removed @67320
13:17:10.741531 table@remove removed @67321
13:17:10.741609 table@remove removed @67322
13:17:10.741687 table@remove removed @67323
13:17:10.741765 table@remove removed @67324
13:17:10.741837 table@remove removed @67325
13:17:19.045515 table@compaction L1·1 -> L2·11 S·20MiB Q·78844571
13:17:19.076806 table@build created L2@67400 N·21455 S·2MiB "s/k..08D,v37526980":"s/k..335,v68374100"
13:17:19.109740 table@build created L2@67401 N·21422 S·2MiB "s/k..062,v36987389":"s/k..53A,v36279697"
13:17:19.145881 table@build created L2@67402 N·21389 S·2MiB "s/k..7F9,v61224864":"s/k..2B3,v32492752"
13:17:19.167930 table@build created L2@67403 N·21432 S·2MiB "s/k..B73,v37833306":"s/k..57E,v78692588"
13:17:19.189592 table@build created L2@67404 N·21431 S·2MiB "s/k..C64,v22931364":"s/k..4BC,v35049316"
13:17:19.213970 table@build created L2@67405 N·21449 S·2MiB "s/k..D82,v39794490":"s/k..49E,v8828758"
13:17:19.236666 table@build created L2@67406 N·21442 S·2MiB "s/k..01A,v60100601":"s/k..602,v941815"
13:17:19.259804 table@build created L2@67407 N·21471 S·2MiB "s/k..351,v20888038":"s/k..9B0,v67797389"
13:17:19.281040 table@build created L2@67408 N·21371 S·2MiB "s/k..7D8,v21824578":"s/k..F08,v68107408"
13:17:19.302025 table@build created L2@67409 N·21377 S·2MiB "s/k..734,v51262793":"s/k..7F2,v41321596"
13:17:19.302220 table@build created L2@67410 N·5 S·613B "s/k..FCC,v6161581":"s/k..5C2,v22347125"
13:17:19.302342 version@stat F·[0 8 118] S·220MiB[0B 2MiB 218MiB] Sc·[0.00 0.02 0.22]
13:17:19.302495 table@compaction committed F-1 S-156KiB Ke·0 D·2012 T·256.835847ms
13:17:19.302592 table@remove removed @67375
13:17:19.302711 table@remove removed @67272
13:17:19.302795 table@remove removed @67251
13:17:19.302876 table@remove removed @67252
13:17:19.302956 table@remove removed @67253
13:17:19.303030 table@remove removed @67254
13:17:19.303106 table@remove removed @67255
13:17:19.303193 table@remove removed @67256
13:17:19.303277 table@remove removed @67257
13:17:19.303349 table@remove removed @67258
13:17:19.303419 table@remove removed @67259
13:17:19.303488 table@remove removed @67260
13:17:21.111785 table@compaction L1·1 -> L2·10 S·20MiB Q·78846279
13:17:21.143318 table@build created L2@67411 N·21439 S·2MiB "s/k..1F7,v10462741":"s/k..7A8,v76676418"
13:17:21.178527 table@build created L2@67412 N·21438 S·2MiB "s/k..631,v68298584":"s/k..644,v14305475"
13:17:21.229220 table@build created L2@67413 N·21410 S·2MiB "s/k..A49,v36257895":"s/k..641,v66389798"
13:17:21.253648 table@build created L2@67414 N·21381 S·2MiB "s/k..AED,v53323503":"s/k..6B9,v8168541"
13:17:21.277524 table@build created L2@67415 N·21389 S·2MiB "s/k..C69,v49211289":"s/k..846,v23921805"
13:17:21.299747 table@build created L2@67416 N·21420 S·2MiB "s/k..793,v42554802":"s/k..307,v28286350"
13:17:21.320809 table@build created L2@67417 N·21439 S·2MiB "s/k..8D9,v33792045":"s/k..434,v24947960"
13:17:21.342135 table@build created L2@67418 N·21393 S·2MiB "s/k..DC8,v66679886":"s/k..7F0,v56705919"
13:17:21.366599 table@build created L2@67419 N·21433 S·2MiB "s/k..844,v19039774":"s/k..A50,v7869655"
13:17:21.394589 table@build created L2@67420 N·21441 S·2MiB "s/k..B0A,v45510309":"s/k..D6A,v66006804"
13:17:21.394992 table@build created L2@67421 N·57 S·5KiB "s/k..0E2,v36658606":"s/k..B4E,v27887217"
13:17:21.395196 version@stat F·[0 7 119] S·220MiB[0B 2MiB 218MiB] Sc·[0.00 0.02 0.22]
13:17:21.395404 table@compaction committed F~ S-158KiB Ke·0 D·2108 T·283.487393ms
13:17:21.395535 table@remove removed @67372
13:17:21.395743 table@remove removed @67283
13:17:21.395913 table@remove removed @67284
13:17:21.396043 table@remove removed @67285
13:17:21.396136 table@remove removed @67286
13:17:21.396230 table@remove removed @67287

The compaction above seems to coincide with a new block being created, I imagine after a lot of updates to the table.

Steps to Reproduce

Participate in gaia-8001 testnet.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 28 (20 by maintainers)

Most upvoted comments

Would appreciate to keep it open until there is a proposed fix that is confirmed working.

To @mdyring’s point, the current behavior of SaveVersion is that it will recursively flush new nodes to the db, but any nodes that haven’t been changed since the last version won’t be saved. This does create a lot of non-uniform usage on block boundaries though. #1952 doesn’t change that, it’s biggest performance related change is not loading all the roots from disk when the tree is loaded, but otherwise most operations are essentially unchanged. The problem with updating the nodes as you go, is two fold. First it could lead to setting values which then have to be unset, for example if you’re using a working tree which you then roll back rather than committing. This could still be done, but it would require more work in the rollback operation, since it would have to actually delete stuff from the DB. Second problem would be the set operation needing to actually write to the disk would likely increase per-operation latency.

It’s definitely doable in theory, but I think it would end up being a question of batching vs streaming updates. Streaming updates could lead to less bursty writes on the block boundaries, but more usage overall. That could end up being better though, so it’s really hard to say without more data. I’m currently working on some benchmarks the various operations, so hopefully that can help inform this a bit more.

I just had a quick look at the iavl repo in relation to https://github.com/cosmos/cosmos-sdk/issues/2169.

I have no intimate knowledge of the codebase. With this disclaimer, it appears to me that if SaveVersion() is invoked it will flush the entire tree to leveldb.

Wouldn’t it be more efficient/natural to just update the underlying database as the changes are made at the node level? So, basically, just doing incremental updates to the DB when necessary instead of flushing everything on block boundaries.

This might make versioning a harder, but perhaps leveldb snapshots could be used here.

Anyway, would be good to get some better Prometheus metrics first to determine what is causing it.