solana: solana-ledger-tool errors with `Unable to load bank forks at slot 0 due to disconnected blocks.`

Problem

It seems that net.sh is broken when spinning up gce clusters.

    ./gce.sh create -n 3 -c 0 -p testnet-dev-kin-haoran -P --dedicated  --validator-boot-disk-size-gb 3600 --self-destruct-hours 0 -z us-east1-b --custom-machine-type "--custom-cpu 64 --min-cpu-platform Intel%20Skylake --custom-vm-type n1 --custom-memory 256GB"

    ./net.sh start --internal-nodes-stake-lamports 1000000000000 --extra-primordial-stakes 3 --faucet-lamports 500000000000000000 --slots-per-epoch 432000

After creating ledger on the bootstrap node, ledger-tool fails to extract the bank hash from snap of slot1. And it can’t bring up the cluster.

++ solana-ledger-tool -l config/bootstrap-validator bank-hash
[2023-01-20T02:34:52.423038671Z INFO  solana_ledger_tool] solana-ledger-tool 1.15.0 (src:devbuild; feat:2221197578)
[2023-01-20T02:34:52.423884939Z INFO  solana_ledger::blockstore] Maximum open file descriptors: 1000000
[2023-01-20T02:34:52.423901857Z INFO  solana_ledger::blockstore] Opening database at "/home/solana/solana/config/bootstrap-validator/rocksdb"
[2023-01-20T02:34:52.423913781Z INFO  solana_ledger::blockstore_db] Disabling rocksdb's automatic compactions...
[2023-01-20T02:34:52.429892189Z INFO  solana_ledger::blockstore_db] Opening Rocks with secondary (read only) access at: "/home/solana/solana/config/bootstrap-validator/rocksdb/solana-secondary"
[2023-01-20T02:34:52.429906849Z INFO  solana_ledger::blockstore_db] This secondary access could temporarily degrade other accesses, such as by solana-validator
[2023-01-20T02:34:52.446842048Z INFO  solana_ledger::blockstore] "/home/solana/solana/config/bootstrap-validator/rocksdb" open took 22ms
Unable to load bank forks at slot 0 due to disconnected blocks.
+ bankHash=
haoran_yi_solana_com@testnet-dev-kin-haoran-bootstrap-validator:~$  ls /home/solana/solana/config/bootstrap-validator/
accounts.ledger-tool  genesis.bin  genesis.tar.bz2  identity.json  rocksdb  snapshot-1-WiowSLurQoZrBXfmwyBb84k1mbFh3sHBLq41Anrx5t4.tar.zst  snapshot.ledger-tool  stake-account.json  vote-account.json

Proposed Solution

Debug and fix the issue with ledger-tool for loading slot 0.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

@apfitzge @HaoranYi - Since there are PR’s in flight from all of us, here is the rundown for the sake of coordination. For reference, here is the relevant piece of code that these are in reference to: https://github.com/solana-labs/solana/blob/a3c763c2a0ee430feaa5b04a5a02b8100487802e/ledger-tool/src/main.rs#L1068-L1075

  1. https://github.com/solana-labs/solana/pull/29868: Ensures that halt_slot > starting_slot prior to Blockstore function + gives more detailed error message in this scenario
  2. https://github.com/solana-labs/solana/pull/29870: Add --halt-at-slot to shred-version subcommand
  3. https://github.com/solana-labs/solana/pull/29865: Add --halt-at-slot to bank-hash subcommand
  4. https://github.com/solana-labs/solana/pull/29860: Skip connected check for halt_slot == 0
  5. https://github.com/solana-labs/solana/pull/29873: Update net script to utilize 3. from this list

The order of operations on these:

  • 1 and 3 can go in immediately once CI passes
  • 2 has already been closed per discussion in the PR
  • 4 can go in (with slight modification in light of 2) after 1 (expecting a merge conflict).
    • If halt_at_slot was specified, we want to skip the existing check and check that 1 adds if halt_slot == 0
    • if halt_slot != 0, then perform the halt_slot > starting_slot check added by 1, followed by the existing blockstore.slot_range_connected(starting_slot, halt_slot) check
  • 5 can go in once 3 has

It looks like this assert is introduced in #26506.

Should we special case the connected check for slot 0? So that we can run gce cluster, which start from a snapshot 1 immediately after genesis? https://github.com/solana-labs/solana/pull/29860

@apfitzge and @steviez