regex: sharing one regex across many threads can lead to big slowdowns due to mutex contention
To reproduce, create a Cargo.toml with this:
[package]
name = "regex-contention-repro-work"
version = "0.1.0"
edition = "2021"
[[bin]]
name = "repro"
path = "main.rs"
[dependencies]
anyhow = "1.0.66"
regex = "1.7.0"
And in the same directory, create a main.rs containing:
use std::{
io::Write,
sync::Arc,
time::{Duration, Instant},
};
use regex::{Regex, RegexBuilder};
const ITERS: usize = 100_000;
const PATTERN: &str = "";
const HAYSTACK: &str = "ZQZQZQZQ";
#[derive(Debug)]
struct Benchmark {
re: Regex,
threads: u32,
}
impl Benchmark {
fn cloned(&self) -> anyhow::Result<Duration> {
let start = Instant::now();
let mut handles = vec![];
for _ in 0..self.threads {
// When we clone the regex like this, it does NOT make a complete
// copy of all of its internal state, but it does create an entirely
// fresh pool from which to get mutable scratch space for each
// search. Basically, a 'Regex' internally looks like this:
//
// struct Regex {
// // Among other things, this contains the literal
// // prefilters and the Thompson VM bytecode
// // instructions.
// read_only: Arc<ReadOnly>,
// // Contains space used by the regex matcher
// // during search time. e.g., The DFA transition
// // table for the lazy DFA or the set of active
// // threads for the Thompson NFA simulation.
// pool: Pool<ScratchSpace>,
// }
//
// That is, a regex already internally uses reference counting,
// so cloning it does not create an entirely separate copy of the
// data. It's effectively free. However, cloning it does create
// an entirely fresh 'Pool'. It specifically does not reuse pools
// across cloned regexes, and it does this specifically so that
// callers have a path that permits them to opt out of contention
// on the pool.
//
// Namely, when a fresh pool is created, it activates a special
// optimization for the first thread that accesses the pool. For
// that thread gets access to a special value ONLY accessible to
// that thread, where as all other threads accessing the pool get
// sent through the "slow" path via a mutex. When a lot of threads
// share the same regex **with the same pool**, this mutex comes
// under very heavy contention.
//
// It is worth pointing out that the mutex is NOT held for the
// duration of the search. Effectively what happens is:
//
// is "first" thread optimization active?
// NO: mutex lock
// pop pointer out of the pool
// mutex unlock
// do a search
// is "first" thread optimization active?
// NO: mutex lock
// push pointer back into pool
// mutex unlock
//
// So in the case where "do a search" is extremely fast, i.e., when
// the haystack is tiny, as in this case, the mutex contention ends
// up dominating the runtime. As the number of threads increases,
// the contention gets worse and worse and thus runtime blows up.
//
// But, all of that contention can be avoided by giving each thread
// a fresh regex and thus each one gets its own pool and each
// thread gets the "first" thread optimization applied. So the
// internal access for the mutable scratch space now looks like
// this:
//
// is "first" thread optimization active?
// YES: return pointer to special mutable scratch space
// do a search
// is "first" thread optimization active?
// YES: do nothing
//
// So how to fix this? Well, it's kind of hard. The regex crate
// used to use the 'thread_local' crate that optimized this
// particular access pattern and essentially kept a hash table
// keyed on thread ID. But this led to other issues. Specifically,
// its memory usage scaled with the number of active threads using
// a regex, where as the current approach scales with the number of
// active threads *simultaneously* using a regex.
//
// I am not an expert on concurrent data structures though, so
// there is likely a better approach. But the idea here is indeed
// to make it possible to opt out of contention by being able to
// clone the regex. Once you do that, there are **zero** competing
// resources between the threads.
//
// Why not just do this in all cases? Well, I guess I would if I
// could, but I don't know how. The reason why explicit cloning
// permits one to opt out is that each thread is handed its own
// copy of the regex and its own pool, and that is specifically
// controlled by the caller. I'm not sure how to do that from
// within the regex library itself, since it isn't really aware of
// threads per se.
let re = self.re.clone();
handles.push(std::thread::spawn(move || {
let mut matched = 0;
for _ in 0..ITERS {
if re.is_match(HAYSTACK) {
matched += 1;
}
}
matched
}));
}
let mut matched = 0;
for h in handles {
matched += h.join().unwrap();
}
assert!(matched > 0);
Ok(Instant::now().duration_since(start))
}
fn shared(&self) -> anyhow::Result<Duration> {
let start = Instant::now();
let mut handles = vec![];
// We clone the regex into an Arc but then share it across all threads.
// Each thread in turn competes with the single regex's shared memory
// pool for mutable scratch space to use during a search. This is what
// ultimately caused this 'shared' benchmark to be much slower than the
// 'cloned' benchmark when run with many threads. Indeed, profiling it
// reveals that most of the time is spent in the regex internal 'Pool'
// type's 'get' and 'get_slow' methods.
let re = Arc::new(self.re.clone());
for _ in 0..self.threads {
let re = Arc::clone(&re);
handles.push(std::thread::spawn(move || {
let mut matched = 0;
for _ in 0..ITERS {
if re.is_match(HAYSTACK) {
matched += 1;
}
}
matched
}));
}
let mut matched = 0;
for h in handles {
matched += h.join().unwrap();
}
assert!(matched > 0);
Ok(Instant::now().duration_since(start))
}
}
fn main() -> anyhow::Result<()> {
let threads: u32 = std::env::var("REGEX_BENCH_THREADS")?.parse()?;
let re = RegexBuilder::new(PATTERN)
.unicode(false)
.dfa_size_limit(50 * (1 << 20))
.build()?;
let benchmark = Benchmark { re, threads };
let which = std::env::var("REGEX_BENCH_WHICH")?;
let duration = match &*which {
"cloned" => benchmark.cloned(),
"shared" => benchmark.shared(),
unknown => anyhow::bail!("unrecognized REGEX_BENCH_WHICH={}", unknown),
};
writeln!(std::io::stdout(), "{:?}", duration)?;
Ok(())
}
Now build and run the benchmark:
$ cargo build --release
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 6.5 ms ± 1.2 ms [User: 55.1 ms, System: 3.8 ms]
Range (min … max): 0.1 ms … 10.5 ms 254 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 530.5 ms ± 12.6 ms [User: 1886.3 ms, System: 4994.7 ms]
Range (min … max): 514.2 ms … 552.4 ms 10 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
81.66 ± 15.52 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'
As noted in the comments in the code above, the only difference between these two benchmarks is that cloned creates a fresh Regex for each thread where as shared uses the same Regex across multiple threads. This leads to contention on a single Mutex in the latter case and overall very poor performance. (See the code comments above for more info.)
We can confirm this by looking at a profile. Here is a screenshot from perf for the cloned benchmark:

And now for the shared benchmark:

As we can see, in the shared benchmark, virtually all of the time is being spent locking and unlocking the mutex.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 50 (34 by maintainers)
Links to this issue
Commits related to this issue
- automata: reduce regex contention somewhat > **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared acros... — committed to rust-lang/regex by BurntSushi 10 months ago
- automata: reduce regex contention somewhat > **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared acros... — committed to rust-lang/regex by BurntSushi 10 months ago
- automata: reduce regex contention somewhat > **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared acros... — committed to rust-lang/regex by BurntSushi 10 months ago
- automata: reduce regex contention somewhat > **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared acros... — committed to rust-lang/regex by BurntSushi 10 months ago
I keep seeing this pop up in various contexts. I’m going to take the conn here and try to work on the idea proposed above. I’ve seen way too many folks unfortunately contorting themselves to work-around this.
@BurntSushi:
A single stack shared across all threads!
Thanks so much for the clear feedback! If
AtomicUsizeis enough bits for crossbeam, then it’s probably enough bits for us too, and that avoids the dependency onportable-atomicthat I used for the 128-bit cmpxchg.I do not have enough experience with atomics to expect sizable perf improvements from this approach given the success we’ve just had here, but someone else who’s more familiar with this sort of thing might be able to take the harness I created (see https://github.com/rust-lang/regex/issues/934#issuecomment-1703299897) and improve it.
It’s worth experimenting with, but after re-measuring @Shnatsel’s cloning optimization with a better benchmark, I’m probably going to move on. The contention problem isn’t fully solved, but it’s in a much more palatable state.
But definitely encourage others to keep working on this. I’d be happy to merge simpler patches. (My general prior is that tricky
unsafeshould come with sizeable perf improvements.)Hmmm, just a note: I think false sharing may indeed be relevant for the lock-free stack approach. As I increase the alignment number for individual linked list entries in the lock-free stack (each of which are allocated in a
Box), the performance improves slightly, but reliably. When I make this change (https://github.com/cosmicexplorer/minimal-lock-free-stack/commit/be269d8cc0ea264d5c1629e56d4f560e23cb3f17):We go from 90x -> 80x slower, or 500ms to 470ms (see https://github.com/rust-lang/regex/issues/934#issuecomment-1703299897 to compare to prior results):
I’m having trouble with the bit twiddling necessary to use 64-bit atomics like crossbeam, and probably going to give up, but if the above behavior looks suspicious to anyone, please feel free to follow up.
OK, so it turns out that false sharing appears to be the primary culprit here. Using multiple stacks and ensuring each mutex is in its own cache line leads to considerable improvement. For example, if I try @leviska’s approach as-is, then I get:
But if I add a
repr(C, align(64))wrapper struct around itsTryMutex, then I get:But it turns out I’m able to do as well (even a little better) by sticking with a simple mutex strategy combined with creating new values if there’s contention (i.e.,
Mutex::try_lockfails):(I also tried experimenting with an optimization idea from @Shnatsel where instead of creating a fresh
Cachewhen under high contention, we instead clone an existing one. This has the benefit of potentially reusing some parts of the lazy DFA’s transition table that had been previously computed. But in all my attempts for this, the timings were actually a little slower. This is somewhat surprising, but my best guess is that generating a fresh cache is perhaps not as expensive as I thought, and this optimization tends to add more pressure on a single atomic? Not sure.)@cosmicexplorer
Yeah I have no doubt that it isn’t so complex if you use dependencies to either do it for you or give you access to the requisite primitives to do so (i.e., 128-bit atomics in this case). I really cannot personally justify taking a dependency for this, so it would mean re-rolling a lot of that soup. Not impossible, and can be done, but annoying.
And yeah, interesting how much slower it is. Did you use a single stack shared across all threads? Or 8 sharded stacks? If the former, my guess based on my run-in with false sharing is exactly that…
Yeah I explicitly decided long ago not to go down this route. The lazy DFA isn’t the only thing that needs a mutable cache. And IIRC, RE2’s lazy DFA is not lock free. I’m pretty sure there are mutexes inside of there. The downside of RE2’s approach is that your lazy DFA becomes a lot more complicated IMO. As do the mutable scratch types for other engines (such as the PikeVM). And I don’t actually know whether RE2 suffers from contention issues or not. Possibly not.
Anyway, given what’s in #1080 now (just updated), we’ve gone from 215x slower for 64 threads:
To 6x slower:
That’s good enough of a win for me. I just hope it doesn’t introduce other unforeseen problems. Unfortunately, it can be somewhat difficult to observe them. (For example, excessive memory usage.)
It is not for the faint of heart. If you ever do get interested, you can hmu on twitter (@/drawsmiguel, priv rn but feel free to request) or just poke junyer and ask for Miguel.
You pay a dynamic cross-module call usually. If you want to bound the number of indices, I suggest picking a power of two. If you’re feeling spicy you can do a two-level radix tree (of, say, eight entries each level) where the outer level is allocated eagerly and the inner levels are allocated on-demand and initialized racily (as I do in my post).
Also junyer randomly shared https://www.usenix.org/system/files/atc19-dice.pdf which feels like it has relevant ideas. I haven’t read it fully but a global table of
thread_id ^ lock_addris very interesting.Spinlocks are basically impossible to do in user space. See: https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html
The non-std version of a
Pooluses a spin lock because there just isn’t a better choice (short of a lock free stack, but as mentioned, such things are difficult to implement correctly without a GC). Using it with std enabled would be a cure worse than the disease, even if it were faster. Besides, a mutex is likely already spinning to some degree before yielding to an OS primitive.Fair enough. Apologies. I should have immediately explained myself.
AFAIK from what you’ve described your mental model of the issue is mostly correct, or at least it matches what I’ve seen. Basically, run on hugely multicore machine, saturate every hardware thread (I have 64 hardware threads) and do a lot of (small) regex matches.
I can’t really share the code of my exact case, but I actually might have a reasonable real world test case for you, as it just happens that recently during the weekend I encountered this issue in one of the crates in the wild and put up a PR optimizing it.
src/main.rs:regexfor the hot path (so there’s no lock contention happening):And just for reference, here are the numbers when running on a single thread (basically just replace
into_par_iterwithinto_iter):So when using
regexjumping from 1 thread to 64 threads the program only got 1m39s faster (~54% of what it was) but wasted over an hour and half of CPU time. When not usingregexjumping from 1 thread to 64 threads the program got 3m30s faster (~2% of what it was) and wasted less than two minutes of CPU time. When running single-threaded we can see that my changes didn’t affect the non-contended case too much.Of course these are the results on my machine (Threadripper 3970x, 256GB of 2667 MT/s RAM); your mileage may vary.
And here’s the commit which got rid of
regexusage so that you can see for what exactly that crate is usingregex: https://github.com/koute/lingua-rs/commit/304cb5a113ddbb968fab2ef45e41b686e42e5aa8You could argue that maybe it’s not an appropriate use of
regex, but nevertheless this is what I’ve encountered in the wild.It’s technically an option, but not one I’d ever sign off on personally. The regex API is essentially duplicated for
&strand&[u8]. Duplicating each of those again just to accept scratch space to work around a problem that one can already work around is not worth it.But, yes, I do plan on exposing such an API! Just not in
regex, but rather, inregex-automata. You can see what it might look like here: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/meta/struct.Regex.html#method.try_find (Excuse the lack of docs. I’m actively working on that component. You’ll find more docs on the lower level regex engines. For example: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/hybrid/regex/struct.Regex.html#method.findI’m also planning on exposing the underlying
Pooltype inregex-automatatoo: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/util/pool/struct.Pool.htmlAlthough you will have to do some kind of synchronization if you’re calling a regex from multiple threads and watch to share the same mutable scratch space across multiple threads. And if you’re not running regexes simultaneously then there won’t be any contention and you won’t be bitten by this issue in the first place.
The
regex-automataplans are tracked by #656.Right. The
thread_localcrate will give you speed, as long as you can deal with its memory profile. (AFAIK, its memory profile is only an issue when you have lots of threads. If you keep your thread pool limited to some reasonable number then I don’t think you can run into any problems. Problems only tended to appear in managed environments, e.g., using the regex crate from a Go program.)