moka: Segmentation faults in moka-cht under heavy workloads on a many-core machine
I have seen segmentation faults a few times when I am running mokabench on Moka v0.5.1. It seems it is randomly happening while get_or_insert_with
method is heavily called concurrently from many threads.
+ ./target/release/mokabench --enable-invalidate-entries-if --enable-insert-once
Cache, Max Capacity, Clients, Inserts, Reads, Hit Rate, Duration Secs
Moka Unsync Cache, 100000, -, 14696832, 31104534, 52.750, 8.575
Moka Cache, 100000, 16, 15550290, 31954711, 51.336, 17.365
Moka Cache, 100000, 24, 15543954, 31948375, 51.347, 17.743
Moka Cache, 100000, 32, 15527876, 31932297, 51.373, 17.877
./run-tests.sh: line 36: 21740 Segmentation fault (core dumped) ./target/release/mokabench --enable-invalidate-entries-if --enable-insert-once
I am using Amazon EC2 for running mokabench. After spending few days, I found it is related to the version of crossbeam-epoch and number of CPU cores.
Segfaults? | Moka | cht/moka-cht | crossbeam-epoch | EC2 Instance Type | Arch | vCPUs | OS |
---|---|---|---|---|---|---|---|
Yes | v0.5.1 | moka-cht v0.5.0 | v0.9.5 | c5.9xlarge | x86_64 | 36 | Amazon Linux 2 |
No | v0.5.1 | cht v0.4.1 | v0.8.2 | c5.9xlarge | x86_64 | 36 | Amazon Linux 2 |
No | v0.5.1 | moka-cht v0.5.0 | v0.9.5 | c5.4xlarge | x86_64 | 16 | Amazon Linux 2 |
crossbeam-epoch is used by moka-cht, the concurrent hash table use by Moka.
I examined stack traces from core dumps and found there are two patterns. I have not identified the root cause yet. Perhaps a crossbeam_epoch::Owned<T>
, which is very similar to Box<T>
, stored in moka-cht became a dangling pointer by some reason?
Pattern 1: At Arc::ne()
(Click to expand)
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055cd7249862e in <alloc::sync::Arc<T> as alloc::sync::ArcEqIdent<T>>::ne ()
at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:2095
2095 /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs: No such file or directory.
[Current thread is 1 (Thread 0x7fe61d1e8700 (LWP 7009))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /data/core-dumps/mokabench-copy/target/release/mokabench.
Use `info auto-load python-scripts [REGEXP]' to list them.
Missing separate debuginfos, use: debuginfo-install glibc-2.26-48.amzn2.x86_64 libgcc-7.3.1-13.amzn2.x86_64
(gdb) bt
#0 0x000055cd7249862e in <alloc::sync::Arc<T> as alloc::sync::ArcEqIdent<T>>::ne ()
at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:2095
#1 <alloc::sync::Arc<T> as core::cmp::PartialEq>::ne () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:2141
#2 core::cmp::impls::<impl core::cmp::PartialEq<&B> for &A>::ne () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/cmp.rs:1356
#3 moka_cht::map::bucket::BucketArray<K,V>::insert_or_modify::{{closure}} ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:255
#4 moka_cht::map::bucket::BucketArray<K,V>::probe_loop ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:367
#5 moka_cht::map::bucket::BucketArray<K,V>::insert_or_modify ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:248
#6 0x000055cd72476961 in moka_cht::map::bucket_array_ref::BucketArrayRef<K,V,S>::insert_with_or_modify_entry_and ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket_array_ref.rs:191
#7 0x000055cd7248d19a in moka_cht::segment::map::HashMap<K,V,S>::insert_with_or_modify_entry_and ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/segment/map.rs:933
#8 moka_cht::segment::map::HashMap<K,V,S>::insert_with_or_modify ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/segment/map.rs:798
#9 moka::sync::value_initializer::ValueInitializer<K,V,S>::try_insert_waiter ()
at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/value_initializer.rs:108
#10 0x000055cd7248cdf8 in moka::sync::value_initializer::ValueInitializer<K,V,S>::init_or_read ()
at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/value_initializer.rs:42
#11 0x000055cd72492f74 in moka::sync::cache::Cache<K,V,S>::get_or_insert_with_hash_and_fun ()
at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/cache.rs:277
#12 moka::sync::cache::Cache<K,V,S>::get_or_insert_with () at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/cache.rs:264
#13 0x000055cd7248f90d in mokabench::cache::sync_cache::SyncCache::get_or_insert_with () at src/cache/sync_cache.rs:43
#14 <mokabench::cache::sync_cache::SyncCache as mokabench::cache::CacheSet<mokabench::parser::ArcTraceEntry>>::get_or_insert_once ()
at src/cache/sync_cache.rs:79
#15 0x000055cd7246eb87 in <mokabench::cache::sync_cache::SharedSyncCache as mokabench::cache::CacheSet<mokabench::parser::ArcTraceEntry>>::get_or_insert_once
() at src/cache/sync_cache.rs:125
#16 mokabench::process_commands () at src/lib.rs:107
...
Pattern 2: At atomic_sub()
in Arc::drop()
(Click to expand)
Program terminated with signal SIGSEGV, Segmentation fault.
#0 core::sync::atomic::atomic_sub () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs:2401
2401 /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs: No such file or directory.
[Current thread is 1 (Thread 0x7f6e0f9b2900 (LWP 32108))]
Missing separate debuginfos, use: debuginfo-install glibc-2.26-48.amzn2.x86_64 libgcc-7.3.1-13.amzn2.x86_64
(gdb) bt
#0 core::sync::atomic::atomic_sub () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs:2401
#1 core::sync::atomic::AtomicUsize::fetch_sub () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs:1769
#2 <alloc::sync::Arc<T> as core::ops::drop::Drop>::drop () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:1558
#3 core::ptr::drop_in_place<alloc::sync::Arc<usize>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#4 core::ptr::drop_in_place<moka_cht::map::bucket::Bucket<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#5 core::ptr::drop_in_place<alloc::boxed::Box<moka_cht::map::bucket::Bucket<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#6 core::mem::drop () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/mem/mod.rs:889
#7 <T as crossbeam_epoch::atomic::Pointable>::drop ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-epoch-0.9.5/src/atomic.rs:212
#8 <crossbeam_epoch::atomic::Owned<T> as core::ops::drop::Drop>::drop ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-epoch-0.9.5/src/atomic.rs:1087
#9 core::ptr::drop_in_place<crossbeam_epoch::atomic::Owned<moka_cht::map::bucket::Bucket<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#10 core::mem::drop () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/mem/mod.rs:889
#11 moka_cht::map::bucket::defer_acquire_destroy::{{closure}} ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:684
#12 crossbeam_epoch::guard::Guard::defer_unchecked ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-epoch-0.9.5/src/guard.rs:195
#13 moka_cht::map::bucket::defer_acquire_destroy () at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:682
#14 <moka_cht::segment::map::HashMap<K,V,S> as core::ops::drop::Drop>::drop ()
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/segment/map.rs:1032
#15 0x000055db206daf73 in core::ptr::drop_in_place<moka_cht::segment::map::HashMap<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#16 core::ptr::drop_in_place<moka::future::value_initializer::ValueInitializer<usize,alloc::sync::Arc<alloc::boxed::Box<[u8]>>,std::collections::hash::map::RandomState>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#17 alloc::sync::Arc<T>::drop_slow () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:1051
#18 0x000055db206ea837 in mokabench::run_multi_tasks::{{closure}} () at /home/ec2-user/mokabench/src/lib.rs:314
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (12 by maintainers)
Commits related to this issue
- Increase the num segments of the waiters hash table from 16 to 64 This will reduce the chance of issue #34 occurring. — committed to moka-rs/moka by tatsuya6502 2 years ago
- Prevent segmentation fault in `sync` and `future` caches (#34) - Add a lock to the rehash function of the concurrent hash table (`moka::cht`) to ensure only one thread can participate rehashing at ... — committed to moka-rs/moka by tatsuya6502 2 years ago
Finally, I believe I fixed this issue via #157.
Last week, I got a new x86_64 based Linux PC with 20 logical cores (Intel Core i7-12700F), and it helped me a lot to reproduce and investigate the issue. I found the cause of the issue last night and fixed it. After the fix, I have never been able to reproduce the issue again on both the PC (Linux x86_64) and Mac (macOS arm64).
The cause was race conditions when many threads are concurrently rehashing (extending or shrinking) internal hash table
moka::cht
. The creator of the originalcht
designed it to work fine in such a situation but it is not working as expected. So I added a lock to ensure only one thread can participate rehashing at a time. This actually increased performance in my load tests as it will prevent heavy retries on an atomic CAS operationcompare_exhance_weak
.Also I found the memory ordering used for
compare_exchange_weak
will be too weak for non x86 platforms, and may cause inconsistency between threads. So I changed it to the one that I believe strong enough.#157 also upgrades crossbeam-epoch to the latest version (v0.9.9).
Hi @SimonSapin,
Thank you for the information.
No. I do not think so, unfortunately.
I have another Moka repository here and it has crossbeam-epoch upgraded to v0.9.9:
and I ran the same test on both Moka with crossbeam-epoch v0.8.2 and v0.9.9. I found Moka with crossbeam-epoch v0.9.9 is still having the same issue.
Moka with crossbeam-epoch v0.9.9
Had segfault four times in about four hours.
Moka with crossbeam-epoch v0.8.2
Had segfault three times in about four hours.
NOTE: To make segfault occurs more often, I used modified Moka to set the number of
moka::cht::HashMap
segments to 1. (The release versions have it set to 64)Anyway, I will continue evaluating crossbeam-epoch v0.9.9 in parallel to v0.8.2, and will upgrade Moka’s dependency with v0.9.9 once I feel v0.9.9 will not increase the chance of segfaults.
I am also watching every releases of crossbeam-* and parking_lot crates, and testing them if they have any fixes on memory safety issues. I am reviewing Moka and their source codes when I have time. I hope I can isolate the code causing the issue.
FYI, I created a draft pull request #157 to upgrade crossbeam-epoch from v0.8.2 to v0.9.9. I scheduled it for next patch release Moka v0.8.7.
As I wrote in the PR, I will run some mokabench tests before merging it. I will be able to run mokabench for 6 hours a day (during night), so if everything goes well, the test will complete in 4 days (total 24 hours).