solana: accounts_db indexes taking too long be created

Problem

Nodes with the --account-index argument will take too much to create the accounts_db indexes. n2-standard-32 (32 vCPUs, 128 GB memory) servers will take ~ 4 hours. n2-standard-64 (64 vCPUs, 256 GB memory) servers will take from 30 mins to 1 hour.

Here are the logs from a fresh n2-standard-32 setup:

solana-validator.log LOG.txt

It’s been up for more than 2 hours now and it’s still at 100037/381300

The problem with indexing taking so long is that the nodes get behind up to 100K slots, making it days for it to catch up.

Marco has reported the same issue with the bare-metal servers, it takes them 1 hour to restart their 128 GB RAM + 872GB PDIMM dual Xeon Gold 24 core servers.

Let me know if you need more logs.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 26 (26 by maintainers)

Most upvoted comments

I will begin this prototype wed. morning. Talked with @carllin , @steviez , and @brooksprumo

This is the expensive key I think:

[2021-05-05T18:23:19.450413579Z INFO  solana_runtime::accounts_index] spl_token_mint_index: kinXdEcpDQeHPEuQnqmUgtYykqKGVFq6CeVX5iAHJq6 (
        60789809,
        60818121,
    )

I talked with @sakridge . I will first do async index generation. Then, rocksdb secondary index. Steven will investigate the customer use cases.

@jeffwashington I think ultimately we should try experimenting with offloading these secondary indexes (essentially SecondaryIndex structure, which is essentially a HashMap<Pubkey, HashSet<Pubkey>>`) to disk. This generally means something like a disk-based Btree or LSM tree.

An experiment I think that is worth trying is moving these SecondaryIndexes to RocksDb, where the key for an account in the mint index would look something like(Mint Pubkey, Account Pubkey) and we can run prefix searches to find all the accounts for a given mint.

Pros I can think of:

  1. Offloads unused keys to disk
  2. Heavy write path is optimized since all deletions/additions happen in memory, so it’s fast, and immediately observable.
  3. Memtables and SST files allow concurrent read/write b/c they are based on a concurrent skiplist (a robust implementation of which is currently unavailable in Rust), so we should be able to support RPC scan through the secondary index without blocking writes from replay, where DashMap currently blocks large chunks of accounts from updating while scanning over the shards in the DashMap here: https://github.com/solana-labs/solana/blob/master/runtime/src/secondary_index.rs#L66-L68.

Yes, it’s sucking up all the ram