parking_lot: Heavily degraded performance while in extreme contention on some processors
This blog post Mutexes Are Faster Than Spinlocks generated a reddit discussion about lock implementatons. This made some users benchmark parking_lot
on windows and some results shown are very problematic, others could just be better (since parking_lot is always faster than spin on linux and mac, but not on windows for some reason).
I did not run this benchmark since I don’t currently have windows installed, but since the user hadn’t filed an issue I decided to post it here, their comment was: https://www.reddit.com/r/rust/comments/ejx7y8/blog_post_mutexes_are_faster_than_spinlocks/fd31um3/
The results:
Benchmark code: https://github.com/matklad/lock-bench Windows 10 Pro Intel Core i7-5930k @ 3.5 GHz stable-x86_64-pc-windows-msvc (default) rustc 1.40.0 (73528e339 2019-12-16)
extreme contention
cargo run --release 32 2 10000 100
Finished release [optimized] target(s) in 0.03s
Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
n_threads: 32,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 32.452982ms min 20.4146ms max 45.2767ms
parking_lot::Mutex avg 154.509064ms min 111.2522ms max 180.4367ms
spin::Mutex avg 46.3496ms min 33.5478ms max 56.1689ms
AmdSpinlock avg 45.725299ms min 32.1936ms max 54.4236ms
std::sync::Mutex avg 33.383154ms min 18.2827ms max 46.0634ms
parking_lot::Mutex avg 134.983307ms min 95.5948ms max 176.1896ms
spin::Mutex avg 43.402769ms min 31.9209ms max 55.0075ms
AmdSpinlock avg 39.572361ms min 28.1705ms max 50.2935ms
heavy contention
cargo run --release 32 64 10000 100
Finished release [optimized] target(s) in 0.03s
Running `target\release\lock-bench.exe 32 64 10000 100`
Options {
n_threads: 32,
n_locks: 64,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 12.8268ms min 6.4807ms max 14.174ms
parking_lot::Mutex avg 8.470518ms min 3.6558ms max 10.0896ms
spin::Mutex avg 6.356252ms min 4.6299ms max 8.1838ms
AmdSpinlock avg 7.147972ms min 5.7731ms max 9.2027ms
std::sync::Mutex avg 12.790879ms min 3.7349ms max 14.4933ms
parking_lot::Mutex avg 8.526535ms min 6.7143ms max 10.0845ms
spin::Mutex avg 5.730139ms min 2.8063ms max 7.6221ms
AmdSpinlock avg 7.082415ms min 5.2678ms max 8.2064ms
light contention
cargo run --release 32 1000 10000 100
Finished release [optimized] target(s) in 0.05s
Running `target\release\lock-bench.exe 32 1000 10000 100`
Options {
n_threads: 32,
n_locks: 1000,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 7.736325ms min 4.3287ms max 9.194ms
parking_lot::Mutex avg 4.912407ms min 4.1386ms max 5.9617ms
spin::Mutex avg 3.787679ms min 3.2468ms max 4.8136ms
AmdSpinlock avg 4.229783ms min 1.0404ms max 5.2414ms
std::sync::Mutex avg 7.791248ms min 6.2809ms max 8.9858ms
parking_lot::Mutex avg 4.933393ms min 4.3319ms max 6.1515ms
spin::Mutex avg 3.782046ms min 3.3339ms max 5.4954ms
AmdSpinlock avg 4.22442ms min 3.1285ms max 5.3338ms
no contention
cargo run --release 32 1000000 10000 100
Finished release [optimized] target(s) in 0.03s
Running `target\release\lock-bench.exe 32 1000000 10000 100`
Options {
n_threads: 32,
n_locks: 1000000,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 12.465917ms min 8.8088ms max 13.6216ms
parking_lot::Mutex avg 5.164135ms min 4.2478ms max 6.1451ms
spin::Mutex avg 4.112927ms min 3.1624ms max 5.599ms
AmdSpinlock avg 4.302528ms min 4.0533ms max 5.4168ms
std::sync::Mutex avg 11.765036ms min 3.3567ms max 13.5108ms
parking_lot::Mutex avg 3.992219ms min 2.4974ms max 5.5604ms
spin::Mutex avg 3.425334ms min 2.0133ms max 4.7788ms
AmdSpinlock avg 3.813034ms min 2.2009ms max 5.0947ms
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 43 (7 by maintainers)
You can read linus thoughts about thread_yield in this post: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752.
Basically what this test is measuring is how effective the spin back-off is. parking_lot uses an exponential back-off strategy where it doubles the about of time it pauses between spins every time it misses the lock. This effectively allows the thread which currently holds the lock to rapidly lock/unlock without any cacheline interference from other threads.
The reason why you are seeing different results on different microarchitectures basically boils down to a change Intel made in the Skylake and later microarchitectures which changed the duration of the PAUSE x86 instruction from 10 cycles to 140 cycles (relevant issue in .NET). Since I tuned this delay loop on my personal computer, which has a Skylake CPU, the delay loop is much shorter (by about 10x) on other microarchitectures.
Hmm, the man page for
sched_yield
says:The rust documentation for
std::thread::yield_now()
reads very different https://doc.rust-lang.org/std/thread/fn.yield_now.html. Clearly there seems to be a disconnect between what kernel developers want this to be used for, and what many people are really using it for. I’m not aware of a more appropriate API for those use cases, short of sticking to kernel supported synchronization. Is this an API hole, or a more fundamental design issue?@cynecx Lowering the value in the condition does not seem to have an impact on the results.
self.counter <= 1
:always using thread_yield (ignoring the counter):
Parking_lot seams to be very good when running on a single CCX:
4cores, 8threads, 1CCX
8 cores, 8 threads, 2CCX
Windows 10 - R7 1800X (8cores 16threads) cacheline results (see my previous post for non cacheline results)
Avg is a bit worse.
That’s the first thing that came to mind but I checked the timer interval and it’s held at 1ms regardless of Firefox (I guess there’s a driver or service keeping it at that, because I have everything in the task bar and tray closed). I manually set the interval to the minimum (0.5ms), re-ran and got the same results.
Doesn’t seem to make any significant difference.
I ran the bench on the same machine on Linux (Manjaro, 5.4). Times are consistently like these, regardless of what’s open:
I can reproduce on ThreadRipper 3970X.
Ran a modified version running only one
std::sync::Mutex
loop underperf stat
:And with
parking_log::Mutex
only:The GHz numbers caught my attention, on top of everything else, and empirically the CPU usage levels during a run in htop seems to indicate less CPU used overall while the
parking_lot::Mutex
test runs, compared tostd::sync::Mutex
. I haven’t looked to get more accurate data, though. Anyways, that got me wondering what the result would look like if the cpu frequency governor was changed from ondemand to performance, and this results in:FWIW.
I’m not going to dive deeper into this right now, but I hope the data above can be useful somehow.
SwitchToThread
vsSleep
makes no difference for me.I got a large improvement by increasing the amount of spins before a sleep. But it took really a very big increase, so my concern is I thereby just degraded it to the Spinlock:
Result:
So this goes from 2^3 spins to 2^15. Even with 2^9 there was no noticable improvement. Increasing the number of
thread_parker::thread_yield()
calls always decreases the performance. Therefore it seems like that kind of yielding to the OS might be inefficient.E.g. going to
which avoids all thread yield calls just changes the result to