solana: Ledger purging causes severe stall, leaving uncleaned bloated snapshots even across reboots
Our tds validator Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN got stuck for about 40 minutes, from slot 56925460 to 56925591:
I0105 10:37:19.471441 1 slots.go:116] confirmed slot 56925459 (offset 241203, +3), epoch 144 (from slot 56684256 to 57116256, 190797 remaining)
I0105 10:37:20.467172 1 slots.go:116] confirmed slot 56925460 (offset 241204, +1), epoch 144 (from slot 56684256 to 57116256, 190796 remaining)
I0105 10:37:21.470273 1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:22.477470 1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:23.470636 1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:24.467110 1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:25.466756 1 slots.go:112] slot has not advanced at 56925460, skipping
[rip]
I0105 11:14:05.472919 1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 11:14:06.466652 1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 11:14:07.466721 1 slots.go:116] confirmed slot 56925591 (offset 241335, +131), epoch 144 (from slot 56684256 to 57116256, 190665 remaining)
I0105 11:14:08.467120 1 slots.go:116] confirmed slot 56925597 (offset 241341, +6), epoch 144 (from slot 56684256 to 57116256, 190659 remaining)
I0105 11:14:09.466760 1 slots.go:116] confirmed slot 56925605 (offset 241349, +8), epoch 144 (from slot 56684256 to 57116256, 190651 remaining)
Validator node log: http://logbin.certus.one:8080/Qui9iehu/solana-github-14434/Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN.log

And from our on-chain monitoring:

Looks like something just locked up completely? Suspiciously similar to #14410, except it’s not 1.5.0.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 24 (24 by maintainers)
Yes. The issue seems to be that the broadcast shred insert seemed to be stuck for a long while which then blocked the replay insert by way of the blockstore insert lock preventing the node from processing any new blocks. I think we should talk on the open issue #14586
Everything was stalled in this case because the broadcast insert_shred was holding the blockstore lock for the whole time so new shreds were not able to be inserted into blockstore. Accounts had nothing to do if there are no new shreds to consume. So I don’t think it was stalled actually, it just had nothing to clean.
In our case, disk wasn’t a bottleneck and iowait remained close to zero.
I think this is hardware-configuration dependant or tx transaction pattern difference?
@sakridge oh, notice that tds’s appendVec is usually completely reclaimed unlike mainnet. That’s because there is no kin migration there… if there is no alive account at all for given recent slot, validator doesn’t need shrinking to prevent accounts dir bloat. Hard system interactions. xD
https://docs.rs/rocksdb/0.15.0/rocksdb/struct.Env.html#method.set_high_priority_background_threads
I see. Hmm, then I don’t know immediate clue for no metrics for you. Maybe, your rocksdb is well under --limit-ledger-size?
Yeah increasing interval should improve performance. I don’t think increased interval relates to this bug’s frequency, though. (I know, I have to work on your nice proposal #14216; so many fires…)
The problem is four-fold like a cancer…:
STAGE-I: It causes system stall firstly as reported originally here. STAGE-II: And then, it stops AccountBackgroundCleaning service, creating huge accounts dir STAGE-III: Then, a bloat snapshot can be desperately created under such a high load, including (uncleaned) unshrunken appendvec… STAGE-IV: Finally, this makes some validators stuck unless removing their ledger dir altogether because just restarting creates 300G accounts dir even they
rm -rf accountsmanually…@leoluk Thanks for detailed reporting. Ok, I’ve spotted the root cause:
The problem is rocksdb’s ledger pruning (including compaction) is taking so much time and resource:
This is your validator’s metrics with slightly adjusted timescale (in UTC)
https://metrics.solana.com:3000/d/monitor/cluster-telemetry?orgId=2&var-datasource=Solana Metrics (read-only)&var-testnet=tds&var-hostid=Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN&from=1610130868735&to=1610152010663&panelId=70
(My customize metrics view: https://metrics.solana.com:3000/d/V5LPmn_Zk/testnet-monitor-edge-ryoqun?orgId=2&var-datasource=Solana Metrics (read-only)&var-testnet=tds&var-hostid=Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN&from=1610130868735&to=1610152010663)