regex: sharing one regex across many threads can lead to big slowdowns due to mutex contention

To reproduce, create a Cargo.toml with this:

[package]
name = "regex-contention-repro-work"
version = "0.1.0"
edition = "2021"

[[bin]]
name = "repro"
path = "main.rs"

[dependencies]
anyhow = "1.0.66"
regex = "1.7.0"

And in the same directory, create a main.rs containing:

use std::{
    io::Write,
    sync::Arc,
    time::{Duration, Instant},
};

use regex::{Regex, RegexBuilder};

const ITERS: usize = 100_000;
const PATTERN: &str = "";
const HAYSTACK: &str = "ZQZQZQZQ";

#[derive(Debug)]
struct Benchmark {
    re: Regex,
    threads: u32,
}

impl Benchmark {
    fn cloned(&self) -> anyhow::Result<Duration> {
        let start = Instant::now();
        let mut handles = vec![];
        for _ in 0..self.threads {
            // When we clone the regex like this, it does NOT make a complete
            // copy of all of its internal state, but it does create an entirely
            // fresh pool from which to get mutable scratch space for each
            // search. Basically, a 'Regex' internally looks like this:
            //
            //   struct Regex {
            //     // Among other things, this contains the literal
            //     // prefilters and the Thompson VM bytecode
            //     // instructions.
            //     read_only: Arc<ReadOnly>,
            //     // Contains space used by the regex matcher
            //     // during search time. e.g., The DFA transition
            //     // table for the lazy DFA or the set of active
            //     // threads for the Thompson NFA simulation.
            //     pool: Pool<ScratchSpace>,
            //   }
            //
            // That is, a regex already internally uses reference counting,
            // so cloning it does not create an entirely separate copy of the
            // data. It's effectively free. However, cloning it does create
            // an entirely fresh 'Pool'. It specifically does not reuse pools
            // across cloned regexes, and it does this specifically so that
            // callers have a path that permits them to opt out of contention
            // on the pool.
            //
            // Namely, when a fresh pool is created, it activates a special
            // optimization for the first thread that accesses the pool. For
            // that thread gets access to a special value ONLY accessible to
            // that thread, where as all other threads accessing the pool get
            // sent through the "slow" path via a mutex. When a lot of threads
            // share the same regex **with the same pool**, this mutex comes
            // under very heavy contention.
            //
            // It is worth pointing out that the mutex is NOT held for the
            // duration of the search. Effectively what happens is:
            //
            //   is "first" thread optimization active?
            //   NO: mutex lock
            //       pop pointer out of the pool
            //       mutex unlock
            //   do a search
            //   is "first" thread optimization active?
            //   NO: mutex lock
            //       push pointer back into pool
            //       mutex unlock
            //
            // So in the case where "do a search" is extremely fast, i.e., when
            // the haystack is tiny, as in this case, the mutex contention ends
            // up dominating the runtime. As the number of threads increases,
            // the contention gets worse and worse and thus runtime blows up.
            //
            // But, all of that contention can be avoided by giving each thread
            // a fresh regex and thus each one gets its own pool and each
            // thread gets the "first" thread optimization applied. So the
            // internal access for the mutable scratch space now looks like
            // this:
            //
            //   is "first" thread optimization active?
            //   YES: return pointer to special mutable scratch space
            //   do a search
            //   is "first" thread optimization active?
            //   YES: do nothing
            //
            // So how to fix this? Well, it's kind of hard. The regex crate
            // used to use the 'thread_local' crate that optimized this
            // particular access pattern and essentially kept a hash table
            // keyed on thread ID. But this led to other issues. Specifically,
            // its memory usage scaled with the number of active threads using
            // a regex, where as the current approach scales with the number of
            // active threads *simultaneously* using a regex.
            //
            // I am not an expert on concurrent data structures though, so
            // there is likely a better approach. But the idea here is indeed
            // to make it possible to opt out of contention by being able to
            // clone the regex. Once you do that, there are **zero** competing
            // resources between the threads.
            //
            // Why not just do this in all cases? Well, I guess I would if I
            // could, but I don't know how. The reason why explicit cloning
            // permits one to opt out is that each thread is handed its own
            // copy of the regex and its own pool, and that is specifically
            // controlled by the caller. I'm not sure how to do that from
            // within the regex library itself, since it isn't really aware of
            // threads per se.
            let re = self.re.clone();
            handles.push(std::thread::spawn(move || {
                let mut matched = 0;
                for _ in 0..ITERS {
                    if re.is_match(HAYSTACK) {
                        matched += 1;
                    }
                }
                matched
            }));
        }
        let mut matched = 0;
        for h in handles {
            matched += h.join().unwrap();
        }
        assert!(matched > 0);
        Ok(Instant::now().duration_since(start))
    }

    fn shared(&self) -> anyhow::Result<Duration> {
        let start = Instant::now();
        let mut handles = vec![];
        // We clone the regex into an Arc but then share it across all threads.
        // Each thread in turn competes with the single regex's shared memory
        // pool for mutable scratch space to use during a search. This is what
        // ultimately caused this 'shared' benchmark to be much slower than the
        // 'cloned' benchmark when run with many threads. Indeed, profiling it
        // reveals that most of the time is spent in the regex internal 'Pool'
        // type's 'get' and 'get_slow' methods.
        let re = Arc::new(self.re.clone());
        for _ in 0..self.threads {
            let re = Arc::clone(&re);
            handles.push(std::thread::spawn(move || {
                let mut matched = 0;
                for _ in 0..ITERS {
                    if re.is_match(HAYSTACK) {
                        matched += 1;
                    }
                }
                matched
            }));
        }
        let mut matched = 0;
        for h in handles {
            matched += h.join().unwrap();
        }
        assert!(matched > 0);
        Ok(Instant::now().duration_since(start))
    }
}

fn main() -> anyhow::Result<()> {
    let threads: u32 = std::env::var("REGEX_BENCH_THREADS")?.parse()?;
    let re = RegexBuilder::new(PATTERN)
        .unicode(false)
        .dfa_size_limit(50 * (1 << 20))
        .build()?;
    let benchmark = Benchmark { re, threads };
    let which = std::env::var("REGEX_BENCH_WHICH")?;
    let duration = match &*which {
        "cloned" => benchmark.cloned(),
        "shared" => benchmark.shared(),
        unknown => anyhow::bail!("unrecognized REGEX_BENCH_WHICH={}", unknown),
    };
    writeln!(std::io::stdout(), "{:?}", duration)?;
    Ok(())
}

Now build and run the benchmark:

$ cargo build --release 
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       6.5 ms ±   1.2 ms    [User: 55.1 ms, System: 3.8 ms]
  Range (min … max):     0.1 ms …  10.5 ms    254 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):     530.5 ms ±  12.6 ms    [User: 1886.3 ms, System: 4994.7 ms]
  Range (min … max):   514.2 ms … 552.4 ms    10 runs
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
   81.66 ± 15.52 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'

As noted in the comments in the code above, the only difference between these two benchmarks is that cloned creates a fresh Regex for each thread where as shared uses the same Regex across multiple threads. This leads to contention on a single Mutex in the latter case and overall very poor performance. (See the code comments above for more info.)

We can confirm this by looking at a profile. Here is a screenshot from perf for the cloned benchmark:

cloned

And now for the shared benchmark:

shared

As we can see, in the shared benchmark, virtually all of the time is being spent locking and unlocking the mutex.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 50 (34 by maintainers)

Links to this issue

Update Rust crate regex to 1.10 (!79) · Merge requests · Kevin Cox / ecl · GitLab

Commits related to this issue

automata: reduce regex contention somewhat > **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared acros... — committed to rust-lang/regex by BurntSushi 10 months ago
automata: reduce regex contention somewhat > **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared acros... — committed to rust-lang/regex by BurntSushi 10 months ago
automata: reduce regex contention somewhat > **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared acros... — committed to rust-lang/regex by BurntSushi 10 months ago
automata: reduce regex contention somewhat > **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared acros... — committed to rust-lang/regex by BurntSushi 10 months ago

Most upvoted comments

I keep seeing this pop up in various contexts. I’m going to take the conn here and try to work on the idea proposed above. I’ve seen way too many folks unfortunately contorting themselves to work-around this.

BurntSushi on Aug 29, 2023

@BurntSushi:

Did you use a single stack shared across all threads? Or 8 sharded stacks? If the former, my guess based on my run-in with false sharing is exactly that…

A single stack shared across all threads!

Yeah I have no doubt that it isn’t so complex if you use dependencies to either do it for you or give you access to the requisite primitives to do so (i.e., 128-bit atomics in this case). I really cannot personally justify taking a dependency for this, so it would mean re-rolling a lot of that soup. Not impossible, and can be done, but annoying.

Thanks so much for the clear feedback! If AtomicUsize is enough bits for crossbeam, then it’s probably enough bits for us too, and that avoids the dependency on portable-atomic that I used for the 128-bit cmpxchg.

(My general prior is that tricky unsafe should come with sizeable perf improvements.)

I do not have enough experience with atomics to expect sizable perf improvements from this approach given the success we’ve just had here, but someone else who’s more familiar with this sort of thing might be able to take the harness I created (see https://github.com/rust-lang/regex/issues/934#issuecomment-1703299897) and improve it.

cosmicexplorer on Sep 1, 2023

It’s worth experimenting with, but after re-measuring @Shnatsel’s cloning optimization with a better benchmark, I’m probably going to move on. The contention problem isn’t fully solved, but it’s in a much more palatable state.

But definitely encourage others to keep working on this. I’d be happy to merge simpler patches. (My general prior is that tricky unsafe should come with sizeable perf improvements.)

BurntSushi on Sep 1, 2023

Hmmm, just a note: I think false sharing may indeed be relevant for the lock-free stack approach. As I increase the alignment number for individual linked list entries in the lock-free stack (each of which are allocated in a Box), the performance improves slightly, but reliably. When I make this change (https://github.com/cosmicexplorer/minimal-lock-free-stack/commit/be269d8cc0ea264d5c1629e56d4f560e23cb3f17):

/* Align to 512 to decrease false sharing. */
#[repr(C, align(512))]
struct Node<T> {
  pub value: mem::ManuallyDrop<T>,
  pub next: *mut Node<T>,
}

We go from 90x -> 80x slower, or 500ms to 470ms (see https://github.com/rust-lang/regex/issues/934#issuecomment-1703299897 to compare to prior results):

# cosmicexplorer@lightning-strike-teravolts: ~/tools/regex/tmp-benchmark 18:46:30 
; hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       5.9 ms ±   0.5 ms    [User: 68.3 ms, System: 2.0 ms]
  Range (min … max):     4.7 ms …   8.3 ms    418 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):     471.5 ms ±  13.2 ms    [User: 6521.3 ms, System: 52.6 ms]
  Range (min … max):   449.8 ms … 487.3 ms    10 runs
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
   79.69 ± 6.75 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'

I’m having trouble with the bit twiddling necessary to use 64-bit atomics like crossbeam, and probably going to give up, but if the above behavior looks suspicious to anyone, please feel free to follow up.

cosmicexplorer on Sep 1, 2023

OK, so it turns out that false sharing appears to be the primary culprit here. Using multiple stacks and ensuring each mutex is in its own cache line leads to considerable improvement. For example, if I try @leviska’s approach as-is, then I get:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       8.6 ms ±   0.5 ms    [User: 118.0 ms, System: 6.3 ms]
  Range (min … max):     7.4 ms …  10.5 ms    277 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):     178.7 ms ±  17.0 ms    [User: 3865.6 ms, System: 4.7 ms]
  Range (min … max):   146.0 ms … 213.7 ms    19 runs

But if I add a repr(C, align(64)) wrapper struct around its TryMutex, then I get:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       8.6 ms ±   0.6 ms    [User: 118.6 ms, System: 6.8 ms]
  Range (min … max):     7.4 ms …  10.6 ms    308 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      61.2 ms ±   8.8 ms    [User: 1272.5 ms, System: 4.9 ms]
  Range (min … max):    38.4 ms …  81.9 ms    47 runs

But it turns out I’m able to do as well (even a little better) by sticking with a simple mutex strategy combined with creating new values if there’s contention (i.e., Mutex::try_lock fails):

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.1 ms ±   0.6 ms    [User: 127.7 ms, System: 6.0 ms]
  Range (min … max):     7.9 ms …  11.3 ms    279 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      53.8 ms ±   3.6 ms    [User: 1037.8 ms, System: 10.9 ms]
  Range (min … max):    48.4 ms …  62.7 ms    50 runs

(I also tried experimenting with an optimization idea from @Shnatsel where instead of creating a fresh Cache when under high contention, we instead clone an existing one. This has the benefit of potentially reusing some parts of the lazy DFA’s transition table that had been previously computed. But in all my attempts for this, the timings were actually a little slower. This is somewhat surprising, but my best guess is that generating a fresh cache is perhaps not as expensive as I thought, and this optimization tends to add more pressure on a single atomic? Not sure.)

@cosmicexplorer

I think this approach actually isn’t as complex as feared, but unfortunately my initial attempt was much slower than the mutex approach, taking about 500ms in hyperfine:

Yeah I have no doubt that it isn’t so complex if you use dependencies to either do it for you or give you access to the requisite primitives to do so (i.e., 128-bit atomics in this case). I really cannot personally justify taking a dependency for this, so it would mean re-rolling a lot of that soup. Not impossible, and can be done, but annoying.

And yeah, interesting how much slower it is. Did you use a single stack shared across all threads? Or 8 sharded stacks? If the former, my guess based on my run-in with false sharing is exactly that…

modifying Cache to allow lock-free updates like RE2 does

Yeah I explicitly decided long ago not to go down this route. The lazy DFA isn’t the only thing that needs a mutable cache. And IIRC, RE2’s lazy DFA is not lock free. I’m pretty sure there are mutexes inside of there. The downside of RE2’s approach is that your lazy DFA becomes a lot more complicated IMO. As do the mutable scratch types for other engines (such as the PikeVM). And I don’t actually know whether RE2 suffers from contention issues or not. Possibly not.

Anyway, given what’s in #1080 now (just updated), we’ve gone from 215x slower for 64 threads:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 128.3 ms, System: 5.7 ms]
  Range (min … max):     7.7 ms …  11.1 ms    278 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master
  Time (mean ± σ):      1.938 s ±  0.036 s    [User: 4.827 s, System: 41.401 s]
  Range (min … max):    1.885 s …  1.992 s    10 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
  215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master'

To 6x slower:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 127.6 ms, System: 6.2 ms]
  Range (min … max):     7.9 ms …  11.7 ms    287 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      55.0 ms ±   5.1 ms    [User: 1050.4 ms, System: 12.0 ms]
  Range (min … max):    46.1 ms …  67.3 ms    57 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
    6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro'

That’s good enough of a win for me. I just hope it doesn’t introduce other unforeseen problems. Unfortunately, it can be somewhat difficult to observe them. (For example, excessive memory usage.)

BurntSushi on Sep 1, 2023

I think restartable sequences are probably a bit too much for me to stomach at the moment

It is not for the faint of heart. If you ever do get interested, you can hmu on twitter (@/drawsmiguel, priv rn but feel free to request) or just poke junyer and ask for Miguel.

I’m still curious whether it will be cheap enough, but there’s only one way to find out!

You pay a dynamic cross-module call usually. If you want to bound the number of indices, I suggest picking a power of two. If you’re feeling spicy you can do a two-level radix tree (of, say, eight entries each level) where the outer level is allocated eagerly and the inner levels are allocated on-demand and initialized racily (as I do in my post).

Also junyer randomly shared https://www.usenix.org/system/files/atc19-dice.pdf which feels like it has relevant ideas. I haven’t read it fully but a global table of thread_id ^ lock_addr is very interesting.

mcy on Aug 31, 2023

Spinlocks are basically impossible to do in user space. See: https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html

The non-std version of a Pool uses a spin lock because there just isn’t a better choice (short of a lock free stack, but as mentioned, such things are difficult to implement correctly without a GC). Using it with std enabled would be a cure worse than the disease, even if it were faster. Besides, a mutex is likely already spinning to some degree before yielding to an OS primitive.

BurntSushi on Aug 31, 2023

It would just be better to focus on what’s actually happening instead of just saying “broken.” Pretty much everyone has a reason why they use the word “broken,” but the word “broken” doesn’t actually help anyone understand anything.

Fair enough. Apologies. I should have immediately explained myself.

I agree it’s a trap that one can fall into, but again and I’ve said now, my understanding is that “just using rayon” isn’t enough to provoke this issue. But maybe my mental model needs updating. Without reproductions, that’s hard.

AFAIK from what you’ve described your mental model of the issue is mostly correct, or at least it matches what I’ve seen. Basically, run on hugely multicore machine, saturate every hardware thread (I have 64 hardware threads) and do a lot of (small) regex matches.

I can’t really share the code of my exact case, but I actually might have a reasonable real world test case for you, as it just happens that recently during the weekend I encountered this issue in one of the crates in the wild and put up a PR optimizing it.

Set up a reproduction and download some test data:

$ cargo new --bin repro
$ cd repro
$ cargo add lingua --git https://github.com/koute/lingua-rs.git --rev 92f8d5d1e078cfafcbc60d52b6f6821a65ded422
$ cargo add rayon
$ curl "https://www.gutenberg.org/cache/epub/345/pg345.txt" -o src/dracula.txt

Paste this in src/main.rs:

use rayon::prelude::*;

fn main() {
    let data = include_str!("dracula.txt");
    let det = lingua::LanguageDetectorBuilder::from_all_languages().build();
    let xs: Vec<_> = std::iter::repeat(data).take(100000).collect();
    xs.into_par_iter().for_each(|s| {
        let len = s.chars().map(|ch| ch.len_utf8()).take(2048).sum::<usize>();
        std::hint::black_box(det.detect_language_of(&s[..len]));
    });
}

Run it:

$ cargo build --release && time cargo run --release

real	2m0.792s
user	23m49.315s
sys	96m4.691s

Now replace the dependency with one which doesn’t use regex for the hot path (so there’s no lock contention happening):

$ cargo remove lingua
$ cargo add lingua --git https://github.com/koute/lingua-rs.git --rev 304cb5a113ddbb968fab2ef45e41b686e42e5aa8

Rerun it:

$ cargo build --release && time cargo run --release

real	0m5.737s
user	5m23.120s
sys	0m0.247s

And just for reference, here are the numbers when running on a single thread (basically just replace into_par_iter with into_iter):

# Before my optimizations (uses regex):
real	3m39.142s
user	3m39.041s
sys	0m0.100s

# After my optimizations (doesn't use regex):
real	3m35.342s
user	3m35.204s
sys	0m0.043s

So when using regex jumping from 1 thread to 64 threads the program only got 1m39s faster (~54% of what it was) but wasted over an hour and half of CPU time. When not using regex jumping from 1 thread to 64 threads the program got 3m30s faster (~2% of what it was) and wasted less than two minutes of CPU time. When running single-threaded we can see that my changes didn’t affect the non-contended case too much.

Of course these are the results on my machine (Threadripper 3970x, 256GB of 2667 MT/s RAM); your mileage may vary.

And here’s the commit which got rid of regex usage so that you can see for what exactly that crate is using regex: https://github.com/koute/lingua-rs/commit/304cb5a113ddbb968fab2ef45e41b686e42e5aa8

You could argue that maybe it’s not an appropriate use of regex, but nevertheless this is what I’ve encountered in the wild.

koute on May 8, 2023

Is exposing lower-level API that allows one to pass in scratch space an option at all?

It’s technically an option, but not one I’d ever sign off on personally. The regex API is essentially duplicated for &str and &[u8]. Duplicating each of those again just to accept scratch space to work around a problem that one can already work around is not worth it.

But, yes, I do plan on exposing such an API! Just not in regex, but rather, in regex-automata. You can see what it might look like here: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/meta/struct.Regex.html#method.try_find (Excuse the lack of docs. I’m actively working on that component. You’ll find more docs on the lower level regex engines. For example: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/hybrid/regex/struct.Regex.html#method.find

I’m also planning on exposing the underlying Pool type in regex-automata too: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/util/pool/struct.Pool.html

Although you will have to do some kind of synchronization if you’re calling a regex from multiple threads and watch to share the same mutable scratch space across multiple threads. And if you’re not running regexes simultaneously then there won’t be any contention and you won’t be bitten by this issue in the first place.

The regex-automata plans are tracked by #656.

only creating new scratch space for the DFA for each thread, so the overhead is actually low (negligible depending on your use-case? I haven’t seen it show up in profiling).

Right. The thread_local crate will give you speed, as long as you can deal with its memory profile. (AFAIK, its memory profile is only an issue when you have lots of threads. If you keep your thread pool limited to some reasonable number then I don’t think you can run into any problems. Problems only tended to appear in managed environments, e.g., using the regex crate from a Go program.)

BurntSushi on Dec 14, 2022