moka: Segmentation faults in moka-cht under heavy workloads on a many-core machine

I have seen segmentation faults a few times when I am running mokabench on Moka v0.5.1. It seems it is randomly happening while get_or_insert_with method is heavily called concurrently from many threads.

+ ./target/release/mokabench --enable-invalidate-entries-if --enable-insert-once
Cache, Max Capacity, Clients, Inserts, Reads, Hit Rate, Duration Secs
Moka Unsync Cache, 100000, -, 14696832, 31104534, 52.750, 8.575
Moka Cache, 100000, 16, 15550290, 31954711, 51.336, 17.365
Moka Cache, 100000, 24, 15543954, 31948375, 51.347, 17.743
Moka Cache, 100000, 32, 15527876, 31932297, 51.373, 17.877
./run-tests.sh: line 36: 21740 Segmentation fault      (core dumped) ./target/release/mokabench --enable-invalidate-entries-if --enable-insert-once

I am using Amazon EC2 for running mokabench. After spending few days, I found it is related to the version of crossbeam-epoch and number of CPU cores.

Segfaults? Moka cht/moka-cht crossbeam-epoch EC2 Instance Type Arch vCPUs OS
Yes v0.5.1 moka-cht v0.5.0 v0.9.5 c5.9xlarge x86_64 36 Amazon Linux 2
No v0.5.1 cht v0.4.1 v0.8.2 c5.9xlarge x86_64 36 Amazon Linux 2
No v0.5.1 moka-cht v0.5.0 v0.9.5 c5.4xlarge x86_64 16 Amazon Linux 2

crossbeam-epoch is used by moka-cht, the concurrent hash table use by Moka.

I examined stack traces from core dumps and found there are two patterns. I have not identified the root cause yet. Perhaps a crossbeam_epoch::Owned<T>, which is very similar to Box<T>, stored in moka-cht became a dangling pointer by some reason?

Pattern 1: At Arc::ne() (Click to expand)
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055cd7249862e in <alloc::sync::Arc<T> as alloc::sync::ArcEqIdent<T>>::ne ()
    at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:2095
2095	/rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs: No such file or directory.
[Current thread is 1 (Thread 0x7fe61d1e8700 (LWP 7009))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /data/core-dumps/mokabench-copy/target/release/mokabench.
Use `info auto-load python-scripts [REGEXP]' to list them.
Missing separate debuginfos, use: debuginfo-install glibc-2.26-48.amzn2.x86_64 libgcc-7.3.1-13.amzn2.x86_64
(gdb) bt
#0  0x000055cd7249862e in <alloc::sync::Arc<T> as alloc::sync::ArcEqIdent<T>>::ne ()
    at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:2095
#1  <alloc::sync::Arc<T> as core::cmp::PartialEq>::ne () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:2141
#2  core::cmp::impls::<impl core::cmp::PartialEq<&B> for &A>::ne () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/cmp.rs:1356
#3  moka_cht::map::bucket::BucketArray<K,V>::insert_or_modify::{{closure}} ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:255
#4  moka_cht::map::bucket::BucketArray<K,V>::probe_loop ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:367
#5  moka_cht::map::bucket::BucketArray<K,V>::insert_or_modify ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:248
#6  0x000055cd72476961 in moka_cht::map::bucket_array_ref::BucketArrayRef<K,V,S>::insert_with_or_modify_entry_and ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket_array_ref.rs:191
#7  0x000055cd7248d19a in moka_cht::segment::map::HashMap<K,V,S>::insert_with_or_modify_entry_and ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/segment/map.rs:933
#8  moka_cht::segment::map::HashMap<K,V,S>::insert_with_or_modify ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/segment/map.rs:798
#9  moka::sync::value_initializer::ValueInitializer<K,V,S>::try_insert_waiter ()
    at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/value_initializer.rs:108
#10 0x000055cd7248cdf8 in moka::sync::value_initializer::ValueInitializer<K,V,S>::init_or_read ()
    at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/value_initializer.rs:42
#11 0x000055cd72492f74 in moka::sync::cache::Cache<K,V,S>::get_or_insert_with_hash_and_fun ()
    at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/cache.rs:277
#12 moka::sync::cache::Cache<K,V,S>::get_or_insert_with () at /home/ec2-user/.cargo/git/checkouts/moka-6ea430727379b61e/1bf28ed/src/sync/cache.rs:264
#13 0x000055cd7248f90d in mokabench::cache::sync_cache::SyncCache::get_or_insert_with () at src/cache/sync_cache.rs:43
#14 <mokabench::cache::sync_cache::SyncCache as mokabench::cache::CacheSet<mokabench::parser::ArcTraceEntry>>::get_or_insert_once ()
    at src/cache/sync_cache.rs:79
#15 0x000055cd7246eb87 in <mokabench::cache::sync_cache::SharedSyncCache as mokabench::cache::CacheSet<mokabench::parser::ArcTraceEntry>>::get_or_insert_once
    () at src/cache/sync_cache.rs:125
#16 mokabench::process_commands () at src/lib.rs:107
...
Pattern 2: At atomic_sub() in Arc::drop() (Click to expand)
Program terminated with signal SIGSEGV, Segmentation fault.
#0  core::sync::atomic::atomic_sub () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs:2401
2401	/rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs: No such file or directory.
[Current thread is 1 (Thread 0x7f6e0f9b2900 (LWP 32108))]
Missing separate debuginfos, use: debuginfo-install glibc-2.26-48.amzn2.x86_64 libgcc-7.3.1-13.amzn2.x86_64
(gdb) bt
#0  core::sync::atomic::atomic_sub () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs:2401
#1  core::sync::atomic::AtomicUsize::fetch_sub () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/sync/atomic.rs:1769
#2  <alloc::sync::Arc<T> as core::ops::drop::Drop>::drop () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:1558
#3  core::ptr::drop_in_place<alloc::sync::Arc<usize>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#4  core::ptr::drop_in_place<moka_cht::map::bucket::Bucket<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#5  core::ptr::drop_in_place<alloc::boxed::Box<moka_cht::map::bucket::Bucket<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#6  core::mem::drop () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/mem/mod.rs:889
#7  <T as crossbeam_epoch::atomic::Pointable>::drop ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-epoch-0.9.5/src/atomic.rs:212
#8  <crossbeam_epoch::atomic::Owned<T> as core::ops::drop::Drop>::drop ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-epoch-0.9.5/src/atomic.rs:1087
#9  core::ptr::drop_in_place<crossbeam_epoch::atomic::Owned<moka_cht::map::bucket::Bucket<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#10 core::mem::drop () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/mem/mod.rs:889
#11 moka_cht::map::bucket::defer_acquire_destroy::{{closure}} ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:684
#12 crossbeam_epoch::guard::Guard::defer_unchecked ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-epoch-0.9.5/src/guard.rs:195
#13 moka_cht::map::bucket::defer_acquire_destroy () at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/map/bucket.rs:682
#14 <moka_cht::segment::map::HashMap<K,V,S> as core::ops::drop::Drop>::drop ()
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/moka-cht-0.5.0/src/segment/map.rs:1032
#15 0x000055db206daf73 in core::ptr::drop_in_place<moka_cht::segment::map::HashMap<alloc::sync::Arc<usize>,alloc::sync::Arc<async_lock::rwlock::RwLock<core::option::Option<core::result::Result<alloc::sync::Arc<alloc::boxed::Box<[u8]>>,alloc::sync::Arc<alloc::boxed::Box<dyn std::error::Error+core::marker::Send+core::marker::Sync>>>>>>>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#16 core::ptr::drop_in_place<moka::future::value_initializer::ValueInitializer<usize,alloc::sync::Arc<alloc::boxed::Box<[u8]>>,std::collections::hash::map::RandomState>> () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ptr/mod.rs:192
#17 alloc::sync::Arc<T>::drop_slow () at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/alloc/src/sync.rs:1051
#18 0x000055db206ea837 in mokabench::run_multi_tasks::{{closure}} () at /home/ec2-user/mokabench/src/lib.rs:314

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Finally, I believe I fixed this issue via #157.

Last week, I got a new x86_64 based Linux PC with 20 logical cores (Intel Core i7-12700F), and it helped me a lot to reproduce and investigate the issue. I found the cause of the issue last night and fixed it. After the fix, I have never been able to reproduce the issue again on both the PC (Linux x86_64) and Mac (macOS arm64).

The cause was race conditions when many threads are concurrently rehashing (extending or shrinking) internal hash table moka::cht. The creator of the original cht designed it to work fine in such a situation but it is not working as expected. So I added a lock to ensure only one thread can participate rehashing at a time. This actually increased performance in my load tests as it will prevent heavy retries on an atomic CAS operation compare_exhance_weak.

Also I found the memory ordering used for compare_exchange_weak will be too weak for non x86 platforms, and may cause inconsistency between threads. So I changed it to the one that I believe strong enough.

#157 also upgrades crossbeam-epoch to the latest version (v0.9.9).

Hi @SimonSapin,

crossbeam-epoch 0.8.2 depends on crossbeam-utils 0.7.x, which is affected by https://github.com/advisories/GHSA-qc84-gqf4-9926

Thank you for the information.

Is the work around in #129 to upgrade moka’s dependency of crossbeam-epoch?

No. I do not think so, unfortunately.

I have another Moka repository here and it has crossbeam-epoch upgraded to v0.9.9:

and I ran the same test on both Moka with crossbeam-epoch v0.8.2 and v0.9.9. I found Moka with crossbeam-epoch v0.9.9 is still having the same issue.

Moka with crossbeam-epoch v0.9.9

Had segfault four times in about four hours.

$ rg '(Segmentation fault|Bus error)' epoch09-2022-0618.log 
271:./run-tests-insert-once.sh: line 26: 94446 Segmentation fault: 11  ./target/release/mokabench --invalidate --insert-once
283:./run-tests-insert-once.sh: line 30: 94453 Segmentation fault: 11  ./target/release/mokabench --invalidate-entries-if --insert-once

$ rg '(Segmentation fault|Bus error)' epoch09-2022-0619A.log
243:./run-tests-insert-once.sh: line 18: 99154 Segmentation fault: 11  ./target/release/mokabench --insert-once --size-aware
326:./run-tests-insert-once.sh: line 30: 99301 Segmentation fault: 11  ./target/release/mokabench --invalidate-entries-if --insert-once

$ cat epoch09-2022-0618.log
...
cargo tree --all-features  
...
│   ├── crossbeam-epoch v0.9.9
│   │   ├── cfg-if v1.0.0
│   │   ├── crossbeam-utils v0.8.9 (*)

Moka with crossbeam-epoch v0.8.2

Had segfault three times in about four hours.

$ rg '(Segmentation fault|Bus error)' epoch08-2022-0619.log 
349:./run-tests-insert-once.sh: line 26: 95369 Segmentation fault: 11  ./target/release/mokabench --invalidate --insert-once

$ rg '(Segmentation fault|Bus error)' epoch08-2022-0619B.log
339:./run-tests-insert-once.sh: line 30:   478 Segmentation fault: 11  ./target/release/mokabench --invalidate-entries-if --insert-once
385:./run-tests-insert-once.sh: line 38:   536 Segmentation fault: 11  ./target/release/mokabench --ttl 3 --tti 1 --invalidate --insert-once --size-aware

$ cat epoch08-2022-0619.log
...
cargo tree --all-features  
...
│   ├── crossbeam-epoch v0.8.2
│   │   ├── cfg-if v0.1.10
│   │   ├── crossbeam-utils v0.7.2

NOTE: To make segfault occurs more often, I used modified Moka to set the number of moka::cht::HashMap segments to 1. (The release versions have it set to 64)

Anyway, I will continue evaluating crossbeam-epoch v0.9.9 in parallel to v0.8.2, and will upgrade Moka’s dependency with v0.9.9 once I feel v0.9.9 will not increase the chance of segfaults.

I am also watching every releases of crossbeam-* and parking_lot crates, and testing them if they have any fixes on memory safety issues. I am reviewing Moka and their source codes when I have time. I hope I can isolate the code causing the issue.

FYI, I created a draft pull request #157 to upgrade crossbeam-epoch from v0.8.2 to v0.9.9. I scheduled it for next patch release Moka v0.8.7.

As I wrote in the PR, I will run some mokabench tests before merging it. I will be able to run mokabench for 6 hours a day (during night), so if everything goes well, the test will complete in 4 days (total 24 hours).