solana: RPC: Blocks go missing

Problem

We’ve noticed on several occasions that blocks goes missing from RPC. This typically happens when the RPC node becomes overwhelemd (not necessarily crashes, but becomes unresponsive for sometime) or when there’s an abrupt reboot or something like this.

In these cases, minimumLedgerSlot will report an old value, but there will be holes in the confirmed blocks between minimumLedgerSlot and the current slot. The RPC service will report that “Slot xxx was skipped, or missing due to ledger jump to recent snapshot”. However, this occurs even when we haven’t pulled a snapshot. Also if pulling snapshot would we not presume that minimumLedgerSlot would get updated?

We’ve seen between 300 and 900 blocks missing typically. Most common (anecdotally) seems to be around300 blocks.

Sample output from a node that was missing blocks:

$ curl http://hk1.rpcpool.com:8899 -X POST  -H "Content-Type: application/json" -d '{"jsonrpc": "2.0","id":1,"method":"getConfirmedBlock","params":[75140922, {"encoding": "json","transactionDetails":"full","rewards":false}]}'
{"jsonrpc":"2.0","error":{"code":-32007,"message":"Slot 75140922 was skipped, or missing due to ledger jump to recent snapshot"},"id":1}

However, the slot clearly exists in the ledger directory as solana ledger tool can access it:

solana@hk1:~$ solana-ledger-tool -l /solana/ledger slot 75140922 >75140922.bak.slot
[2021-04-25T19:28:45.836799091Z INFO  solana_ledger_tool] solana-ledger-tool 1.6.6 (src:fe775a97; feat:3714435735)
[2021-04-25T19:28:45.836851189Z INFO  solana_ledger::blockstore] Maximum open file descriptors: 750000
[2021-04-25T19:28:45.836867639Z INFO  solana_ledger::blockstore] Opening database at "/solana/ledger/rocksdb"
[2021-04-25T19:28:45.862167110Z WARN  solana_ledger::blockstore_db] Error when opening as primary: IO error: While lock file: /solana/ledger/rocksdb/LOCK: Resource temporarily unavailable
[2021-04-25T19:28:45.862181937Z WARN  solana_ledger::blockstore_db] Trying as secondary at : "/solana/ledger/rocksdb/solana-secondary"
[2021-04-25T19:28:45.862189672Z WARN  solana_ledger::blockstore_db] This active secondary db use may temporarily cause the performance of another db use (like by validator) to degrade
[2021-04-25T19:29:08.644763501Z INFO  solana_ledger::blockstore] "/solana/ledger/rocksdb" open took 22.8s

solana-ledger-tool-75140922.slot.txt

We have one node where we have backed up the broken ledger which have about 900 missing slots. Blocks downloaded with getConfirmedBlocks between epoch start and current slot in epoch (at the time 175). minimumLedgerSlot was at the time in epoch 174 for all nodes, so there should have been a contiguous history for all nodes. All nodes agreed except one (in this case fra1).

$ diff hk1-175.slots sg1-175.slots | wc -l
0
$ diff fra1-175.slots sg1-175.slots |wc -l
926

When we request any of the 926 missing blocks we get the following error from the RPC service:

$ curl fra1.rpcpool.com:8899  -X POST -H "Content-Type: application/json" -d '{"jsonrpc": "2.0","id":1,"method":"getConfirmedBlock","params":['75634639', {"encoding": "json","transactionDetails":"full","rewards":false}]}'
{"jsonrpc":"2.0","error":{"code":-32007,"message":"Slot 75634639 was skipped, or missing due to ledger jump to recent snapshot"},"id":1}

On a node without missing blocks we get the data we expect:

curl ams1.rpcpool.com:8899  -X POST -H "Content-Type: application/json" -d '{"jsonrpc": "2.0","id":1,"method":"getConfirmedBlock","params":['75634639', {"encoding": "json","transactionDetails":"full","rewards":false}]}'

ams1-75634639.json.txt

Looking at the ledger on the broken node we see that the data is present in ledger:

$ solana-ledger-tool  slot 75634639 > fra1-75634639.txt

fra1-75634639.txt

This pattern repeats for all “missing blocks”. The issue persists and the “hole” in the history will continue to exist until minimumLedgerSlot advances beyond the hole. We can currently identify it by looking calling getConfirmedBlocks and comparing it to other nodes. If the total slots stored (currentSlot - minimumLedgerSlot) are comparable on two nodes we can check getConfirmedBlocks to see if one of them are missing blocks.

Proposed solution

A temporary fix would be if the RPC node could reach out to BigTable to fetch the blocks that are missing in case it cannot find the block in the current ledger. Currently it seems that BigTable is only used for when minimumLedgerSlot is higher than requested slot, meaning that we’ll get those missing slot errors in the case of these ledger holes.

Another option would be if we could have a way to easily correct the issue, i.e. run a “repair” that would fix a broken ledger and allow those blocks to be visible. We can monitor for holes, take a node out of pool and repair it if we had some way to do so.

This is an issue that has cropped up for app developers, I have heard reports of people having problem with the “Slot xxx was skipped or missing” error message. Ideally an RPC node configured with BigTable should never return this result.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 17 (11 by maintainers)

Most upvoted comments

Thanks to @carllin (https://github.com/solana-labs/solana/pull/17402#issuecomment-846379439), I was just able to deliberately reproduce this issue on a mnb node! And also to confirm that #17402 prevents the issue

CriesofCarrots on May 24, 2021

@CriesofCarrots Stopping the node and running this wouldn’t be the end of the world as a work around. I think we have a node with this problem right now so I’m going to check if this helps us fix it.

@linuskendall , did you have a change to try this?

Can confirm that the repair tool works for us on a production node. Stop the node, run the repair tool and then restart. Afterwards blocks are accessible in RPC again. So solution 2 is available to us now.

linuskendall on May 7, 2021

@ryoqun There has definitely been occasions we have seen this after OOM or crashes. However, we’ve also seen it when the machine didn’t crash but started swapping (so really slow) and we had to go in and stop the solana process. After this, the missing blocks would appear. We also saw it once when we did multiple reboots of a machine in short succession so that the solana service started/stopped regularly (but not uncleanly).

We don’t have an exact pattern of events/activities that triggers this issue. It’s definitely at times when the machine is under heavy load/working poorly and/or crashing.

linuskendall on May 5, 2021

@CriesofCarrots - can you take this please

mvines on May 3, 2021