solana: Hash mismatch on v1.5 in accounts simulation

Problem

Bootstrap leader seems to have encountered an erroneous hash, causing it to fork off during the accounts migration tests

[2021-01-29T01:06:13.269096499Z INFO  solana_runtime::bank]
bank frozen: 181896 hash: 312MfFWZ9hpLWjaxbxZAAXiKdBE6dMwcHi3rghaPDN3F accounts_delta: GNBLwxAitac48gyceVwsdUUYNQYJV4AFaCKba7g7Rwjt signature_count: 10 last_blockhash: HFXB6GLYKH32nAhXeHUdVuqx5eCgFNYDJNx7wZiyAnb9 capitalization: 5500954163744417169

[2021-01-29T01:06:13.192369211Z INFO  solana_runtime::bank]
bank frozen: 181896 hash: Cq9w8kKAQsMe9YCwz75j5m2UzjEA5HEbKknQgNXLnbmq accounts_delta: A6Q4wAxkYcxo6eK3fe9q6FUfgSJ9nUwijVf2W7Fv1zRJ signature_count: 10 last_blockhash: HFXB6GLYKH32nAhXeHUdVuqx5eCgFNYDJNx7wZiyAnb9 capitalization: 5500954163744417169

Proposed Solution

Debug and fix.

The first obstacle seems to be that the snapshots generated during the test run into 'Load from snapshot failed: Serialize(Io(Custom { kind: Other, error: "incorrect layout/length/data" }))' when trying to boot from them. This may somehow be related to the update_accounts_hash() being commented out in AccountsBackgroundService. Trying to see if I can salvage these to avoid replaying from genesis.

@sakridge @CriesofCarrots @ryoqun

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 24 (24 by maintainers)

Most upvoted comments

@Mrmaxmeier The last inconsistency you found should be fixed with this: https://github.com/solana-labs/rbpf/tree/fix/lddw_related_bugs

I also updated the CLI tool so that you can run the verifier on the executables.

@Mrmaxmeier You’ve seen this issue on non-jit builds?

Looks like a difference with JIT:

This is the status with bpf_jit enabled:

[2021-02-06T04:13:19.016641109Z INFO  solana_runtime::bank] 
slot: 34795614 result: Err(InstructionError(2, ProgramFailedToComplete)) 
tx: Transaction { 
signatures: [
5wweTBQoMzwfwAheb8rxBW35eyydWkA9xZgJ28fPZqpjRDQ8st96jqzAa4Dot26b95KK6ePo3wn1Hadvuq9Q7eoF, 5bV6Tg1JrbURFn5Z4LkdBQ3W4eZHLSvsnMBgzjN8GELAv1YZzzBdn74igEVeos9WFBSkZMq2RVL7ELsukT7ptQzv, 
6CZyTSKSUArmTHcEytj5Bgiib73zVpNqpKytwyN2HcygbA4VdqY2Jfw2hXDXmLSzQr9z4EHaWRk6N2tSwo9KvCA], 
message: 
Message { header: MessageHeader { num_required_signatures: 3, 
num_readonly_signed_accounts: 0, 
num_readonly_unsigned_accounts: 4 }, 
account_keys: [
UXszRsAuGAYH9HHKBdRgx9E5ZU2fM2im8BeNUeyuZVR, 
4mZ9cR7Tx9gKRtWnp1EySKye16uPXad2AEE754F8Rcp4, 
7ePbDTiv3aXGt5W1V1gi6gdWJYM4WMvfYKNWYYcyHYgo, 
42FuDW9zi4f2N2eDSkcejepUSGYkK1AmZbyQFAEed5vB, 
SysvarRent111111111111111111111111111111111, 
11111111111111111111111111111111, 
DrDgTGU5CZehEvjEnSrVuELPKyQ1zkhe1WNbYMMCbH4r], 
recent_blockhash: 76dYvw5XE7guX7JUzDEatLxFxMyCEtZTCU2RQyBmFfTd, 
instructions: [CompiledInstruction { 
program_id_index: 5, accounts: [0, 1], 
data: [0, 0, 0, 0, 176, 215, 29, 0, 0, 0, 0, 0, 153, 0, 0, 0, 0, 0, 0, 0, 190, 232, 10, 83, 77, 59, 142, 208, 168, 148, 182, 127,
113, 55, 52, 145, 132, 139, 231, 113, 16, 251, 76, 26, 227, 211, 122, 16, 24, 57, 215, 31] }, CompiledInstruction { program_id_index: 5, accounts
: [0, 2], data: [0, 0, 0, 0, 224, 192, 121, 3, 0, 0, 0, 0, 58, 32, 0, 0, 0, 0, 0, 0, 190, 232, 10, 83, 77, 59, 142, 208, 168, 148, 182, 127, 113,
55, 52, 145, 132, 139, 231, 113, 16, 251, 76, 26, 227, 211, 122, 16, 24, 57, 215, 31] }, CompiledInstruction { program_id_index: 6, accounts: [1,
2, 3, 4], data: [197, 45, 4, 78, 201, 65, 227, 90, 60, 13, 86, 125, 70, 182, 243, 0, 64, 145, 106, 181, 101, 72, 164, 89, 201, 119, 245, 181, 174,
101, 210, 14, 167, 67, 207, 87, 132, 156, 177, 233, 7, 13, 152, 213, 6, 184, 151, 196, 41, 112, 199, 143, 212, 112, 182, 32, 144, 122, 136, 202,
62, 103, 115, 216, 132, 210, 187, 138, 241, 109, 176, 240, 251, 4, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 170, 0, 0, 0] }] } }

cc @Lichtso @jackcmay

Ok logs are in the ~/issue-14948 in dv/dw/da now. Also ~/issue-14948 on dv contains the full ledger (which is consistent across the machines) and you can run:

sol@validator-us-west1-b:~$ solana-ledger-tool --ledger issue-14948/ verify --halt-at-slot 34795615

to observe how dv using v1.5 calculates the correct bank hash with ledger-tool

So in both of the available snapshots, the corrupted AppendVec is for slot 174160, id: 331582

AppendVec { path: "/home/carl_solana_com/DebugConsensus/accounts/174160.331582", map: MmapMut { ptr: 0x7f0cd4ed2000, len: 286720 }, append_offset: Mutex { data: 285149 }, current_len: 285149, file_size: 286720, remove_on_drop: true }

The first ~70 accounts in the AppendVec are all readable, and then it becomes a bunch of the above garbage…

Given the mismatch popped up almost 7000 slots later in slot 181896, I’m guessing something must have corrupted the AppendVec, and then we read from the corrupted AppendVec. Maybe shrink or recycle. Would be interesting to see if all the accounts in that storage entry for that slot were actual accounts touched by transactions in that slot.

The first obstacle seems to be that the snapshots generated during the test run into 'Load from snapshot failed: Serialize(Io(Custom { kind: Other, error: "incorrect layout/length/data" }))' when trying to boot from them. This may somehow be related to the update_accounts_hash() being commented out in AccountsBackgroundService. Trying to see if I can salvage these to avoid replaying from genesis.

This actually only seems to be happening on the node that forked off…interesting. The check that is failing is the self.account_meta.lamports != 0 || self.clone_account() == Account::default() in append_vec::sanitize_lamports(). The “account” that’s failing this check looks like:

StoredAccountMeta { meta: StoredMeta { write_version: 0, pubkey: 11111111111111111111111111111111, data_len: 0 }, account_meta: AccountMeta { lamports: 0, owner: 11111111111111111111111111111111, executable: false, rent_epoch: 4294967296 }, data: [], offset: 31192, stored_size: 136, hash: 1111111111111111VJBRrGXMNSx9EXYe92GTF5 }

I believe the above is a corrupted storage entry, as these appear multiple times, just usually they are all default so do not trigger the check to fail, looking like this:

System account: StoredAccountMeta { meta: StoredMeta { write_version: 0, pubkey: 11111111111111111111111111111111, data_len: 0 }, account_meta: AccountMeta { lamports: 0, owner: 11111111111111111111111111111111, executable: false, rent_epoch: 0 }, data: [], offset: 30240, stored_size: 136, hash: 11111111111111111111111111111111 }

Note these are not the system account, which looks like: StoredAccountMeta { meta: StoredMeta { write_version: 509, pubkey: 11111111111111111111111111111111, data_len: 14 }, account_meta: AccountMeta { lamports: 1, owner: NativeLoader1111111111111111111111111111111, executable: true, rent_epoch: 0 }, data: [115, 121, 115, 116, 101, 109, 95, 112, 114, 111, 103, 114, 97, 109], offset: 396536, stored_size: 152, hash: APNoNXDhUuQ8CLG8xRaQUUQxof1jsA8qLrhaQAM3NyJp }

This leads me to believe that some of the AppendVec’s are not being stored properly/corrupted, maybe I’ve introduced a bug in append_accounts()