solana: panic creating transient file in accounts hash calc

Problem

mcb5i4hCFKHK26z1pCbAYBB2efHnS2EZMv2bRfFaL7x panicked at ‘Unable to write file within /home/sol/ledger/accounts_hash_cache/transient: Cannot allocate memory (os error 12)’, runtime/src/accounts_hash.rs:108:21)

memory use low, 3TB disk free folder exists hash calc had just completed within last minute and had completed many times back to back forever.

not too many open file descriptors:

/home/sol/logs/solana-validator.log.5-[2023-09-01T07:04:22.712541850Z INFO  solana_metrics::metrics] datapoint: os-config vm.max_map_count=2000000i
/home/sol/logs/solana-validator.log.5-[2023-09-01T07:04:22.712541630Z INFO  solana_core::system_monitor_service]   vm.max_map_count: recommended=1000000 current=2000000
/home/sol/logs/solana-validator.log.5-[2023-09-01T07:04:23.968300709Z INFO  solana_ledger::blockstore] Maximum open file descriptors: 1000000

Backtrace

thread 'solAccountsLo06' panicked at 'Unable to write file within /home/sol/ledger/accounts_hash_cache/transient: Cannot allocate memory (os error 12)', runtime/src/accounts_hash.rs:108:21
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at ./rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: solana_runtime::accounts_hash::AccountHashesFile::write
   3: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut
   4: rayon::iter::plumbing::Folder::consume_iter
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::job::StackJob<L,F,R>::run_inline
   7: rayon_core::join::join_context::{{closure}}
   8: rayon_core::registry::in_worker
   9: rayon::iter::plumbing::bridge_producer_consumer::helper
  10: rayon_core::join::join_context::{{closure}}
  11: rayon_core::registry::in_worker
  12: rayon::iter::plumbing::bridge_producer_consumer::helper
  13: rayon_core::join::join_context::{{closure}}
  14: rayon_core::registry::in_worker
  15: rayon::iter::plumbing::bridge_producer_consumer::helper
  16: rayon_core::join::join_context::{{closure}}
  17: rayon_core::registry::in_worker
  18: rayon::iter::plumbing::bridge_producer_consumer::helper
  19: rayon_core::join::join_context::{{closure}}
  20: rayon_core::registry::in_worker
  21: rayon::iter::plumbing::bridge_producer_consumer::helper
  22: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  23: rayon_core::registry::WorkerThread::wait_until_cold
  24: rayon_core::join::join_context::{{closure}}
  25: rayon_core::registry::in_worker
  26: rayon::iter::plumbing::bridge_producer_consumer::helper
  27: rayon_core::job::StackJob<L,F,R>::run_inline
  28: rayon_core::join::join_context::{{closure}}
  29: rayon_core::registry::in_worker
  30: rayon::iter::plumbing::bridge_producer_consumer::helper
  31: rayon_core::join::join_context::{{closure}}
  32: rayon_core::registry::in_worker
  33: rayon::iter::plumbing::bridge_producer_consumer::helper
  34: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  35: rayon_core::registry::WorkerThread::wait_until_cold
  36: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2023-09-06T21:19:35.927609053Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solAccountsLo06" one=1i message="panicked at 'Unable to write file within /home/sol/ledger/accounts_hash_cache/transient: Cannot allocate memory (os error 12)', runtime/src/accounts_hash.rs:108:21" location="runtime/src/accounts_hash.rs:108:21" version="1.16.12 (src:f81349cb; feat:3949673676, client:SolanaLabs)"

Proposed Solution

debug and fix

About this issue

Original URL
State: closed
Created 10 months ago
Comments: 22 (22 by maintainers)

Most upvoted comments

I think if we increase the number of pubkeybin and the buffer size, we will have a higher chance of reproduce this failure.

I can definitely experiment with this. Let’s see, what’s the max power of 2 that fits in a u64… Ha. I’ll run some of these.

jeffwashington on Sep 7, 2023

@steviez Thanks! Yeah. The slab cache usage is high. https://github.com/solana-labs/solana/pull/33178 should help to reduce the active slab objects hold by the writer.

re @jeffwashington yeah. reduce the number of bins or the buffer size will helps too.

re @brooksprumo I think the oom-killer didn’t track that slab cache is full. I don’t think slab cache size can be configured. It is managed by the allocator of the kernel.

HaoranYi on Sep 7, 2023

I think there might be a problem with AccountHashFile.

Since there are 65536 bins, there could be 65K hash files. Inside each AccountHashFile, we currently cache the buf writer, which will allocate 8Kb * 65K buffer. That could lead to slab cache OOM.

I am working on the PR to fix this by caching the File descriptor instead.

HaoranYi on Sep 7, 2023