solana: Possible bug when leader> cluster duplicate slot hash mismatch
Problem
Today at slot 131436226 we were leader, the following log occured:
2022-04-26T17:59:42.895915125Z WARN solana_core::cluster_slot_state_verifier] Cluster duplicate confirmed slot 131436226 with hash 7q2ZaPHLnUp2Gegnf5gyF35bQZmwJG6ziZafjshaRvqw, but our version has hash 6aw1Hbc1Sj7FAeCCFDEX3Z9DPFSuCCGmZFWRd1cVoGTe
The validator stalled at that point and began falling behind. The only fix was to restart. I first loaded a new incremental snapshot before restarting, start and catchup from that point was normal.
@bji pointed me towards this log and has experienced this issue as well.
A major concern is the requirement for manual intervention for recovery.
Validator identity: 1aine15iEqZxYySNwcHtQFt4Sgc75cbEi9wks8YgNCa
More complete logs (+40 lines from the first duplicate confirmed). Metrics were open the entire time.
slot: 131436226,
root: 131436194,
slot_state_update: DuplicateConfirmed(DuplicateConfirmedState { duplicate_confirmed_hash: 7q2ZaPHLnUp2Gegnf5gyF35bQZmwJG6ziZafjshaRvqw, bank_status: Frozen(6aw1Hbc1Sj7FAeCCFDEX3Z9DPFSuCCGmZFWRd1cVoGTe) })
[2022-04-26T17:59:42.895915125Z WARN solana_core::cluster_slot_state_verifier] Cluster duplicate confirmed slot 131436226 with hash 7q2ZaPHLnUp2Gegnf5gyF35bQZmwJG6ziZafjshaRvqw, but our version has hash 6aw1Hbc1Sj7FAeCCFDEX3Z9DPFSuCCGmZFWRd1cVoGTe
[2022-04-26T17:59:42.895922215Z INFO solana_core::heaviest_subtree_fork_choice] marking fork starting at: (131436226, 6aw1Hbc1Sj7FAeCCFDEX3Z9DPFSuCCGmZFWRd1cVoGTe) invalid candidate
[2022-04-26T17:59:42.895940195Z INFO solana_core::heaviest_subtree_fork_choice] Fork choice for (131436227, ABpYtu2XmAa1TaQzqk3EcEwG27ykM1eX2U6Kqw6yVeDr) setting latest invalid ancestor from None to 131436226
[2022-04-26T17:59:42.895945765Z INFO solana_core::heaviest_subtree_fork_choice] Fork choice for (131436226, 6aw1Hbc1Sj7FAeCCFDEX3Z9DPFSuCCGmZFWRd1cVoGTe) setting latest invalid ancestor from None to 131436226
[2022-04-26T17:59:42.896281224Z INFO solana_core::replay_stage] Waiting to switch vote to 131436225, resetting to slot Some(131436225) for now, latest duplicate ancestor: 131436226
[2022-04-26T17:59:42.896285784Z INFO solana_core::replay_stage] vote bank: None reset bank: 131436225
[2022-04-26T17:59:42.896295855Z INFO solana_metrics::metrics] datapoint: blocks_produced num_blocks_on_fork=271772i num_dropped_blocks_on_fork=31682i
[2022-04-26T17:59:42.896326126Z INFO solana_poh::poh_recorder] reset poh from: DL6hz29Bxy6jrrxbAyMqBDc7biHM89zuE7PxLd1crKFt,8411918814,131436227 to: DL6hz29Bxy6jrrxbAyMqBDc7biHM89zuE7PxLd1crKFt,131436225
[2022-04-26T17:59:42.896334536Z INFO solana_core::replay_stage] 1aine15iEqZxYySNwcHtQFt4Sgc75cbEi9wks8YgNCa reset PoH to tick 8411918464 (within slot 131436225). My next leader slot is 131441800
[2022-04-26T17:59:42.896339136Z WARN solana_core::replay_stage] PARTITION DETECTED waiting to join heaviest fork: 131436225 last vote: 131436227, reset slot: 131436225
[2022-04-26T17:59:42.896349536Z WARN solana_core::replay_stage] purging slot 131436226
[2022-04-26T17:59:42.896354546Z INFO solana_metrics::metrics] datapoint: replay_stage-partition slot=131436225i
[2022-04-26T17:59:42.896386597Z INFO solana_runtime::accounts_db] remove_dead_slots_metadata: slots [131436228]
[2022-04-26T17:59:42.902433371Z INFO solana_runtime::accounts_db] remove_dead_slots_metadata: slots [131436227]
[2022-04-26T17:59:42.907774046Z INFO solana_runtime::accounts_db] remove_dead_slots_metadata: slots [131436226]
[2022-04-26T17:59:42.909369919Z WARN solana_core::replay_stage] purging descendant: 131436228 with slot_id 293496, of slot 131436226
[2022-04-26T17:59:42.909699088Z WARN solana_core::replay_stage] purging descendant: 131436227 with slot_id 293495, of slot 131436226
[2022-04-26T17:59:42.910475439Z INFO solana_ledger::blockstore] Pruning expired primary index 0 up to slot 0 (max requested: 131436228)
[2022-04-26T17:59:42.910878430Z WARN solana_core::replay_stage] purging descendant: 131436226 with slot_id 293494, of slot 131436226
[2022-04-26T17:59:42.911572889Z INFO solana_ledger::blockstore] Pruning expired primary index 1 up to slot 0 (max requested: 131436227)
[2022-04-26T17:59:42.911893278Z WARN solana_core::replay_stage] Notifying repair service to repair duplicate slot: 131436226
[2022-04-26T17:59:42.912386421Z INFO solana_core::replay_stage] new fork:131436226 parent:131436225 root:131436194
[2022-04-26T17:59:42.913697896Z INFO solana_metrics::metrics] datapoint: bank-timestamp get_timestamp_estimate_us=1121i
[2022-04-26T17:59:42.913714877Z INFO solana_metrics::metrics] datapoint: bank-timestamp-correction slot=131436226i from_genesis=1636943430i corrected=1650995979i ancestor_timestamp=1650995978i
[2022-04-26T17:59:42.913783919Z INFO solana_metrics::metrics] datapoint: bank-new_from_parent-heights slot_height=131436226i block_height=119217852i parent_slot=131436225i bank_rc_creation_us=1i total_elapsed_us=1379i status_cache_rc_us=0i fee_components_us=0i blockhash_queue_us=8i stakes_cache_us=0i epoch_stakes_time_us=0i builtin_programs_us=0i rewards_pool_pubkeys_us=0i cached_executors_us=0i transaction_debug_keys_us=0i transaction_log_collector_config_us=0i feature_set_us=0i ancestors_us=4i update_epoch_us=0i update_sysvars_us=1291i fill_sysvar_cache_us=64i
[2022-04-26T17:59:42.913788929Z INFO solana_metrics::metrics] datapoint: bank-executor-cache-stats slot=131436225i hits=818i misses=9i evictions=9i insertions=9i replacements=0i
[2022-04-26T17:59:42.913911902Z INFO solana_core::replay_stage] Waiting to switch vote to 131436225, resetting to slot Some(131436225) for now, latest duplicate ancestor: 131436226
[2022-04-26T17:59:42.944235743Z INFO solana_metrics::metrics] datapoint: sigverify_stage-total_verify_time recv_batches_us_90pct=3i recv_batches_us_min=0i recv_batches_us_max=7i recv_batches_us_mean=1i verify_batches_pp_us_90pct=30622i verify_batches_pp_us_min=9i verify_batches_pp_us_max=114230i verify_batches_pp_us_mean=7574i discard_packets_pp_us_90pct=0i discard_packets_pp_us_min=0i discard_packets_pp_us_max=0i discard_packets_pp_us_mean=0i dedup_packets_pp_us_90pct=1i dedup_packets_pp_us_min=0i dedup_packets_pp_us_max=1i dedup_packets_pp_us_mean=1i batches_90pct=8i batches_min=1i batches_max=18i batches_mean=3i packets_90pct=67i packets_min=1i packets_max=820i packets_mean=40i total_batches=214i total_packets=3606i total_dedup=34i total_excess_fail=0i total_shrink_time=33i total_shrinks=0i total_valid_packets=3572i
[2022-04-26T17:59:42.957398320Z INFO solana_metrics::metrics] datapoint: shred_fetch_repair index_overrun=0i shred_count=1i slot_bad_deserialize=0i index_bad_deserialize=0i index_out_of_bounds=0i slot_out_of_range=0i duplicate_shred=0i
[2022-04-26T17:59:42.983136027Z INFO solana_metrics::metrics] datapoint: memory-stats total=540776697856i swap_total=414654780i free_percent=2.2384597176602794 used_bytes=99770486784i avail_percent=81.55052035719052 buffers_percent=0.5844519759321454 cached_percent=77.03033505761813 swap_free_percent=99.99271490370857
[2022-04-26T17:59:43.014917177Z INFO solana_core::replay_stage] Waiting to switch vote to 131436225, resetting to slot Some(131436225) for now, latest duplicate ancestor: 131436226
[2022-04-26T17:59:43.021546887Z INFO solana_metrics::metrics] datapoint: qos-service-stats compute_cost_time=0i compute_cost_count=0i cost_tracking_time=0i selected_txs_count=0i retried_txs_per_block_limit_count=0i retried_txs_per_vote_limit_count=0i retried_txs_per_account_limit_count=0i estimated_signature_cu=0i estimated_write_lock_cu=0i estimated_data_bytes_cu=0i estimated_builtins_execute_cu=0i estimated_bpf_execute_cu=0i actual_bpf_execute_cu=0i actual_execute_time_us=0i
[2022-04-26T17:59:43.027587231Z INFO solana_metrics::metrics] datapoint: poh-service ticks=183i hashes=2307968i elapsed_us=351876i total_sleep_us=780058i total_tick_time_us=1191i total_lock_time_us=2013i total_hash_time_us=194809i total_record_time_us=0i total_send_record_result_us=0i
[2022-04-26T17:59:43.087092862Z INFO solana_metrics::metrics] datapoint: optimistic_slot_elapsed average_elapsed_ms=205i
[2022-04-26T17:59:43.087106463Z INFO solana_metrics::metrics] datapoint: optimistic_slot slot=131436226i
[2022-04-26T17:59:43.116111039Z INFO solana_core::replay_stage] Waiting to switch vote to 131436225, resetting to slot Some(131436225) for now, latest duplicate ancestor: 131436226
[2022-04-26T17:59:43.121525425Z INFO solana_metrics::metrics] datapoint: shred_fetch index_overrun=0i shred_count=2i slot_bad_deserialize=0i index_bad_deserialize=0i index_out_of_bounds=0i slot_out_of_range=0i duplicate_shred=0i
[2022-04-26T17:59:43.136415028Z INFO solana_metrics::metrics] datapoint: cost-update-service-stats total_elapsed_us=1001000i update_cost_model_count=0i update_cost_model_elapsed=6i
[2022-04-26T17:59:43.171459287Z INFO solana_metrics::metrics] datapoint: banking_stage-loop-stats id=2i receive_and_buffer_packets_count=24243i dropped_packet_batches_count=0i dropped_packets_count=0i newly_buffered_packets_count=17408i current_buffered_packet_batches_count=1i current_buffered_packets_count=49i rebuffered_packets_count=0i consumed_buffered_packets_count=0i end_of_slot_filtered_invalid_count=0i consume_buffered_packets_elapsed=0i receive_and_buffer_packets_elapsed=720i handle_retryable_packets_elapsed=0i filter_pending_packets_elapsed=0i packet_conversion_elapsed=0i unprocessed_packet_conversion_elapsed=0i transaction_processing_elapsed=0i packet_batch_indices_len_min=1i packet_batch_indices_len_max=128i packet_batch_indices_len_mean=20i packet_batch_indices_len_90pct=49i
[2022-04-26T17:59:43.217625338Z INFO solana_core::replay_stage] Waiting to switch vote to 131436225, resetting to slot Some(131436225) for now, latest duplicate ancestor: 131436226
[2022-04-26T17:59:43.263589563Z INFO solana_metrics::metrics] datapoint: shred_insert_is_full total_time_ms=344i slot=131436228i last_index=125i num_repaired=2i num_recovered=47i```
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 25 (22 by maintainers)
@steviez Our command line arguments when we were hit by the leader bugs
Done, branch laine_performance_1.9.18
No there are other changes (mostly in voting_service), I don’t think they would/should affect anything relating to block packing or consensus, it’s just a change to send votes to multiple upcoming leaders
yep!
here you go, 500k lines before + 50k after. sorry for delay, wasn’t at my PC again the past few hours
log-leader-crash-before.tar.gz