solana: panic creating transient file in accounts hash calc
Problem
mcb5i4hCFKHK26z1pCbAYBB2efHnS2EZMv2bRfFaL7x panicked at ‘Unable to write file within /home/sol/ledger/accounts_hash_cache/transient: Cannot allocate memory (os error 12)’, runtime/src/accounts_hash.rs:108:21)
memory use low, 3TB disk free folder exists hash calc had just completed within last minute and had completed many times back to back forever.
not too many open file descriptors:
/home/sol/logs/solana-validator.log.5-[2023-09-01T07:04:22.712541850Z INFO solana_metrics::metrics] datapoint: os-config vm.max_map_count=2000000i
/home/sol/logs/solana-validator.log.5-[2023-09-01T07:04:22.712541630Z INFO solana_core::system_monitor_service] vm.max_map_count: recommended=1000000 current=2000000
/home/sol/logs/solana-validator.log.5-[2023-09-01T07:04:23.968300709Z INFO solana_ledger::blockstore] Maximum open file descriptors: 1000000
Backtrace
thread 'solAccountsLo06' panicked at 'Unable to write file within /home/sol/ledger/accounts_hash_cache/transient: Cannot allocate memory (os error 12)', runtime/src/accounts_hash.rs:108:21
stack backtrace:
0: rust_begin_unwind
at ./rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
1: core::panicking::panic_fmt
at ./rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
2: solana_runtime::accounts_hash::AccountHashesFile::write
3: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &F>::call_mut
4: rayon::iter::plumbing::Folder::consume_iter
5: rayon::iter::plumbing::bridge_producer_consumer::helper
6: rayon_core::job::StackJob<L,F,R>::run_inline
7: rayon_core::join::join_context::{{closure}}
8: rayon_core::registry::in_worker
9: rayon::iter::plumbing::bridge_producer_consumer::helper
10: rayon_core::join::join_context::{{closure}}
11: rayon_core::registry::in_worker
12: rayon::iter::plumbing::bridge_producer_consumer::helper
13: rayon_core::join::join_context::{{closure}}
14: rayon_core::registry::in_worker
15: rayon::iter::plumbing::bridge_producer_consumer::helper
16: rayon_core::join::join_context::{{closure}}
17: rayon_core::registry::in_worker
18: rayon::iter::plumbing::bridge_producer_consumer::helper
19: rayon_core::join::join_context::{{closure}}
20: rayon_core::registry::in_worker
21: rayon::iter::plumbing::bridge_producer_consumer::helper
22: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
23: rayon_core::registry::WorkerThread::wait_until_cold
24: rayon_core::join::join_context::{{closure}}
25: rayon_core::registry::in_worker
26: rayon::iter::plumbing::bridge_producer_consumer::helper
27: rayon_core::job::StackJob<L,F,R>::run_inline
28: rayon_core::join::join_context::{{closure}}
29: rayon_core::registry::in_worker
30: rayon::iter::plumbing::bridge_producer_consumer::helper
31: rayon_core::join::join_context::{{closure}}
32: rayon_core::registry::in_worker
33: rayon::iter::plumbing::bridge_producer_consumer::helper
34: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
35: rayon_core::registry::WorkerThread::wait_until_cold
36: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
[2023-09-06T21:19:35.927609053Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solAccountsLo06" one=1i message="panicked at 'Unable to write file within /home/sol/ledger/accounts_hash_cache/transient: Cannot allocate memory (os error 12)', runtime/src/accounts_hash.rs:108:21" location="runtime/src/accounts_hash.rs:108:21" version="1.16.12 (src:f81349cb; feat:3949673676, client:SolanaLabs)"
Proposed Solution
debug and fix
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 22 (22 by maintainers)
I can definitely experiment with this. Let’s see, what’s the max power of 2 that fits in a u64… Ha. I’ll run some of these.
@steviez Thanks! Yeah. The slab cache usage is high. https://github.com/solana-labs/solana/pull/33178 should help to reduce the active slab objects hold by the writer.
re @jeffwashington yeah. reduce the number of bins or the buffer size will helps too.
re @brooksprumo I think the oom-killer didn’t track that slab cache is full. I don’t think slab cache size can be configured. It is managed by the allocator of the kernel.
I think there might be a problem with AccountHashFile.
Since there are 65536 bins, there could be 65K hash files. Inside each AccountHashFile, we currently cache the buf writer, which will allocate 8Kb * 65K buffer. That could lead to slab cache OOM.
I am working on the PR to fix this by caching the File descriptor instead.