solana: Ledger purging causes severe stall, leaving uncleaned bloated snapshots even across reboots

Our tds validator Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN got stuck for about 40 minutes, from slot 56925460 to 56925591:

I0105 10:37:19.471441       1 slots.go:116] confirmed slot 56925459 (offset 241203, +3), epoch 144 (from slot 56684256 to 57116256, 190797 remaining)
I0105 10:37:20.467172       1 slots.go:116] confirmed slot 56925460 (offset 241204, +1), epoch 144 (from slot 56684256 to 57116256, 190796 remaining)
I0105 10:37:21.470273       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:22.477470       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:23.470636       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:24.467110       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 10:37:25.466756       1 slots.go:112] slot has not advanced at 56925460, skipping
[rip]
I0105 11:14:05.472919       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 11:14:06.466652       1 slots.go:112] slot has not advanced at 56925460, skipping
I0105 11:14:07.466721       1 slots.go:116] confirmed slot 56925591 (offset 241335, +131), epoch 144 (from slot 56684256 to 57116256, 190665 remaining)
I0105 11:14:08.467120       1 slots.go:116] confirmed slot 56925597 (offset 241341, +6), epoch 144 (from slot 56684256 to 57116256, 190659 remaining)
I0105 11:14:09.466760       1 slots.go:116] confirmed slot 56925605 (offset 241349, +8), epoch 144 (from slot 56684256 to 57116256, 190651 remaining)

Validator node log: http://logbin.certus.one:8080/Qui9iehu/solana-github-14434/Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN.log

https://metrics.solana.com:3000/d/monitor/cluster-telemetry?orgId=2&from=1609841617832&to=1609846954674&var-datasource=Solana Metrics (read-only)&var-testnet=tds&var-hostid=Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN

image

And from our on-chain monitoring:

image

Looks like something just locked up completely? Suspiciously similar to #14410, except it’s not 1.5.0.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 24 (24 by maintainers)

Most upvoted comments

Log: http://logbin.certus.one:8080/Qui9iehu/solana-github-14434/Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN-2.log

I looked this:

[2021-01-08T21:50:14.438905685Z INFO  solana_metrics::metrics] datapoint: replay-loop-timing-stats total_elapsed_us=2413089000i compute_bank_stats_elapsed=6052i select_vote_and_reset_forks_elapsed=12i start_leader_elapsed=10894i reset_bank_elapsed=3534i voting_elapsed=2412471272i select_forks_elapsed=1i compute_slot_stats_elapsed=12i generate_new_bank_forks_elapsed=1486i replay_active_banks_elapsed=9319i re>


[2021-01-08T21:50:14.591364336Z INFO  solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=2413407i slot=57486550i last_index=396i

Looks bad? @sakridge @carllin

Yes. The issue seems to be that the broadcast shred insert seemed to be stuck for a long while which then blocked the replay insert by way of the blockstore insert lock preventing the node from processing any new blocks. I think we should talk on the open issue #14586

Why does it stall the cleaning service? System is out of IOps?

Yeah, this is a bit odd part; I’m guessing out of IOps or like that as well. Or as your priority pr does, ABS could be stalled by the already stalled rocksdb? possibly via banking/replaying stage? (Because this issue seems to stall everything)

Everything was stalled in this case because the broadcast insert_shred was holding the blockstore lock for the whole time so new shreds were not able to be inserted into blockstore. Accounts had nothing to do if there are no new shreds to consume. So I don’t think it was stalled actually, it just had nothing to clean.

Why does it stall the cleaning service? System is out of IOps?

Yeah, this is a bit odd part; I’m guessing out of IOps or like that as well. Or as your priority pr does, ABS could be stalled by the already stalled rocksdb? possibly via banking/replaying stage? (Because this issue seems to stall everything)

In our case, disk wasn’t a bottleneck and iowait remained close to zero.

image

… we are not seeing these huge accounts directories there.

I think this is hardware-configuration dependant or tx transaction pattern difference?

@sakridge oh, notice that tds’s appendVec is usually completely reclaimed unlike mainnet. That’s because there is no kin migration there… if there is no alive account at all for given recent slot, validator doesn’t need shrinking to prevent accounts dir bloat. Hard system interactions. xD

@ryoqun I don’t have enough storage for --no-rocksdb-compaction 😉

I see. Hmm, then I don’t know immediate clue for no metrics for you. Maybe, your rocksdb is well under --limit-ledger-size?

This may be unrelated, but I did slow down the snapshot frequency with --snapshot-interval-slots 500 and that helped performance in general.

Yeah increasing interval should improve performance. I don’t think increased interval relates to this bug’s frequency, though. (I know, I have to work on your nice proposal #14216; so many fires…)

The problem is four-fold like a cancer…:

STAGE-I: It causes system stall firstly as reported originally here. STAGE-II: And then, it stops AccountBackgroundCleaning service, creating huge accounts dir STAGE-III: Then, a bloat snapshot can be desperately created under such a high load, including (uncleaned) unshrunken appendvec… STAGE-IV: Finally, this makes some validators stuck unless removing their ledger dir altogether because just restarting creates 300G accounts dir even they rm -rf accounts manually…