parking_lot: Heavily degraded performance while in extreme contention on some processors

This blog post Mutexes Are Faster Than Spinlocks generated a reddit discussion about lock implementatons. This made some users benchmark parking_lot on windows and some results shown are very problematic, others could just be better (since parking_lot is always faster than spin on linux and mac, but not on windows for some reason).

I did not run this benchmark since I don’t currently have windows installed, but since the user hadn’t filed an issue I decided to post it here, their comment was: https://www.reddit.com/r/rust/comments/ejx7y8/blog_post_mutexes_are_faster_than_spinlocks/fd31um3/

The results:

Benchmark code: https://github.com/matklad/lock-bench Windows 10 Pro Intel Core i7-5930k @ 3.5 GHz stable-x86_64-pc-windows-msvc (default) rustc 1.40.0 (73528e339 2019-12-16)

extreme contention

cargo run --release 32 2 10000 100
    Finished release [optimized] target(s) in 0.03s
     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 32.452982ms  min 20.4146ms    max 45.2767ms
parking_lot::Mutex   avg 154.509064ms min 111.2522ms   max 180.4367ms
spin::Mutex          avg 46.3496ms    min 33.5478ms    max 56.1689ms
AmdSpinlock          avg 45.725299ms  min 32.1936ms    max 54.4236ms

std::sync::Mutex     avg 33.383154ms  min 18.2827ms    max 46.0634ms
parking_lot::Mutex   avg 134.983307ms min 95.5948ms    max 176.1896ms
spin::Mutex          avg 43.402769ms  min 31.9209ms    max 55.0075ms
AmdSpinlock          avg 39.572361ms  min 28.1705ms    max 50.2935ms

heavy contention

cargo run --release 32 64 10000 100
    Finished release [optimized] target(s) in 0.03s
     Running `target\release\lock-bench.exe 32 64 10000 100`
Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 12.8268ms    min 6.4807ms     max 14.174ms
parking_lot::Mutex   avg 8.470518ms   min 3.6558ms     max 10.0896ms
spin::Mutex          avg 6.356252ms   min 4.6299ms     max 8.1838ms
AmdSpinlock          avg 7.147972ms   min 5.7731ms     max 9.2027ms

std::sync::Mutex     avg 12.790879ms  min 3.7349ms     max 14.4933ms
parking_lot::Mutex   avg 8.526535ms   min 6.7143ms     max 10.0845ms
spin::Mutex          avg 5.730139ms   min 2.8063ms     max 7.6221ms
AmdSpinlock          avg 7.082415ms   min 5.2678ms     max 8.2064ms

light contention

cargo run --release 32 1000 10000 100
    Finished release [optimized] target(s) in 0.05s
     Running `target\release\lock-bench.exe 32 1000 10000 100`
Options {
    n_threads: 32,
    n_locks: 1000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 7.736325ms   min 4.3287ms     max 9.194ms
parking_lot::Mutex   avg 4.912407ms   min 4.1386ms     max 5.9617ms
spin::Mutex          avg 3.787679ms   min 3.2468ms     max 4.8136ms
AmdSpinlock          avg 4.229783ms   min 1.0404ms     max 5.2414ms

std::sync::Mutex     avg 7.791248ms   min 6.2809ms     max 8.9858ms
parking_lot::Mutex   avg 4.933393ms   min 4.3319ms     max 6.1515ms
spin::Mutex          avg 3.782046ms   min 3.3339ms     max 5.4954ms
AmdSpinlock          avg 4.22442ms    min 3.1285ms     max 5.3338ms

no contention

cargo run --release 32 1000000 10000 100
    Finished release [optimized] target(s) in 0.03s
     Running `target\release\lock-bench.exe 32 1000000 10000 100`
Options {
    n_threads: 32,
    n_locks: 1000000,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 12.465917ms  min 8.8088ms     max 13.6216ms
parking_lot::Mutex   avg 5.164135ms   min 4.2478ms     max 6.1451ms
spin::Mutex          avg 4.112927ms   min 3.1624ms     max 5.599ms
AmdSpinlock          avg 4.302528ms   min 4.0533ms     max 5.4168ms

std::sync::Mutex     avg 11.765036ms  min 3.3567ms     max 13.5108ms
parking_lot::Mutex   avg 3.992219ms   min 2.4974ms     max 5.5604ms
spin::Mutex          avg 3.425334ms   min 2.0133ms     max 4.7788ms
AmdSpinlock          avg 3.813034ms   min 2.2009ms     max 5.0947ms

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 43 (7 by maintainers)

Most upvoted comments

You can read linus thoughts about thread_yield in this post: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752.

Basically what this test is measuring is how effective the spin back-off is. parking_lot uses an exponential back-off strategy where it doubles the about of time it pauses between spins every time it misses the lock. This effectively allows the thread which currently holds the lock to rapidly lock/unlock without any cacheline interference from other threads.

The reason why you are seeing different results on different microarchitectures basically boils down to a change Intel made in the Skylake and later microarchitectures which changed the duration of the PAUSE x86 instruction from 10 cycles to 140 cycles (relevant issue in .NET). Since I tuned this delay loop on my personal computer, which has a Skylake CPU, the delay loop is much shorter (by about 10x) on other microarchitectures.

Hmm, the man page for sched_yield says:

sched_yield() is intended for use with real-time scheduling policies
(i.e., SCHED_FIFO or SCHED_RR).  Use of sched_yield() with
nondeterministic scheduling policies such as SCHED_OTHER is
unspecified and very likely means your application design is broken.

The rust documentation for std::thread::yield_now() reads very different https://doc.rust-lang.org/std/thread/fn.yield_now.html. Clearly there seems to be a disconnect between what kernel developers want this to be used for, and what many people are really using it for. I’m not aware of a more appropriate API for those use cases, short of sticking to kernel supported synchronization. Is this an API hole, or a more fundamental design issue?

@cynecx Lowering the value in the condition does not seem to have an impact on the results. self.counter <= 1:

     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 49.474712ms  min 46.0883ms    max 69.2954ms
parking_lot::Mutex   avg 191.411675ms min 187.1015ms   max 194.4445ms
spin::Mutex          avg 8.319929ms   min 8.0446ms     max 9.2257ms
AmdSpinlock          avg 7.870013ms   min 7.4975ms     max 8.8647ms

std::sync::Mutex     avg 48.940684ms  min 46.2259ms    max 68.8908ms
parking_lot::Mutex   avg 191.403276ms min 187.6034ms   max 194.5025ms
spin::Mutex          avg 8.540062ms   min 7.9655ms     max 33.9991ms
AmdSpinlock          avg 7.869302ms   min 7.6215ms     max 8.4093ms

always using thread_yield (ignoring the counter):

     Running `target\release\lock-bench.exe 32 2 10000 100`
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 48.54949ms   min 46.0273ms    max 68.669ms
parking_lot::Mutex   avg 189.396441ms min 184.3732ms   max 193.476ms
spin::Mutex          avg 8.366757ms   min 7.4263ms     max 10.4656ms
AmdSpinlock          avg 7.893677ms   min 7.4782ms     max 9.6092ms

std::sync::Mutex     avg 48.805392ms  min 46.4548ms    max 68.7367ms
parking_lot::Mutex   avg 189.496029ms min 184.8406ms   max 194.1087ms
spin::Mutex          avg 8.632158ms   min 7.6492ms     max 40.5276ms
AmdSpinlock          avg 7.874248ms   min 7.5315ms     max 8.7431ms

Parking_lot seams to be very good when running on a single CCX:

4cores, 8threads, 1CCX

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 11.732156ms  min 4.5348ms     max 38.7124ms
parking_lot::Mutex   avg 17.733915ms  min 12.6413ms    max 20.9893ms
spin::Mutex          avg 21.637979ms  min 679.1µs      max 22.6627ms
AmdSpinlock          avg 23.66785ms   min 2.2282ms     max 53.0323ms

std::sync::Mutex     avg 10.308772ms  min 1.1871ms     max 12.5605ms
parking_lot::Mutex   avg 17.617576ms  min 12.0187ms    max 20.9852ms
spin::Mutex          avg 21.621965ms  min 1.3302ms     max 51.0216ms
AmdSpinlock          avg 24.25325ms   min 1.9652ms     max 175.5796ms

8 cores, 8 threads, 2CCX

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 22.664372ms  min 19.3603ms    max 30.7216ms   
parking_lot::Mutex   avg 124.191566ms min 96.57ms      max 145.1499ms  
spin::Mutex          avg 33.303393ms  min 29.5157ms    max 92.0333ms   
AmdSpinlock          avg 32.245088ms  min 27.7896ms    max 88.293ms    

std::sync::Mutex     avg 20.921484ms  min 16.1338ms    max 25.4639ms   
parking_lot::Mutex   avg 111.182503ms min 83.1871ms    max 130.6418ms  
spin::Mutex          avg 35.753433ms  min 28.6262ms    max 94.8696ms   
AmdSpinlock          avg 32.314534ms  min 29.1613ms    max 64.7282ms  

Windows 10 - R7 1800X (8cores 16threads) cacheline results (see my previous post for non cacheline results)

Avg is a bit worse.

Options {
    n_threads: 32,
    n_locks: 2,   
    n_ops: 10000, 
    n_rounds: 100,
}

std::sync::Mutex     avg 30.831742ms  min 25.8507ms    max 40.243ms    
parking_lot::Mutex   avg 137.364687ms min 102.3781ms   max 182.8287ms  
spin::Mutex          avg 52.551899ms  min 47.5073ms    max 67.3358ms   
AmdSpinlock          avg 55.114334ms  min 50.4422ms    max 88.9065ms   

std::sync::Mutex     avg 30.898776ms  min 25.6411ms    max 39.9243ms   
parking_lot::Mutex   avg 131.558342ms min 99.9603ms    max 161.9975ms  
spin::Mutex          avg 53.440152ms  min 48.1445ms    max 94.0101ms   
AmdSpinlock          avg 55.660354ms  min 51.6334ms    max 58.8772ms  
Options {
    n_threads: 32,
    n_locks: 64,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 6.814591ms   min 6.37ms       max 7.815ms     
parking_lot::Mutex   avg 2.859985ms   min 2.0373ms     max 5.2382ms    
spin::Mutex          avg 1.909338ms   min 841.2µs      max 4.7601ms    
AmdSpinlock          avg 2.875636ms   min 468.6µs      max 35.1337ms   

std::sync::Mutex     avg 6.794524ms   min 6.3382ms     max 7.9424ms    
parking_lot::Mutex   avg 2.868905ms   min 1.9351ms     max 5.3661ms    
spin::Mutex          avg 1.846593ms   min 1.393ms      max 10.4112ms   
AmdSpinlock          avg 2.205968ms   min 1.2394ms     max 3.2763ms 

That’s nifty. On Windows, applications can ask the OS to increase the timer granularity. Media playback programs often do it, and probably browsers too.

That’s the first thing that came to mind but I checked the timer interval and it’s held at 1ms regardless of Firefox (I guess there’s a driver or service keeping it at that, because I have everything in the task bar and tray closed). I manually set the interval to the minimum (0.5ms), re-ran and got the same results.

I don’t expect it to have the same effect, but can you also test using different power management profiles (Balanced/Performance)?

Doesn’t seem to make any significant difference.

I ran the bench on the same machine on Linux (Manjaro, 5.4). Times are consistently like these, regardless of what’s open:

std::sync::Mutex     avg 41.888257ms  min 37.328722ms  max 44.560946ms
parking_lot::Mutex   avg 50.668306ms  min 33.37132ms   max 57.84472ms
spin::Mutex          avg 36.433262ms  min 10.723163ms  max 72.244387ms
AmdSpinlock          avg 23.923037ms  min 9.067332ms   max 76.638706ms

std::sync::Mutex     avg 41.824524ms  min 35.817135ms  max 44.253448ms
parking_lot::Mutex   avg 50.186056ms  min 36.960654ms  max 55.576347ms
spin::Mutex          avg 39.525809ms  min 10.008313ms  max 94.704114ms
AmdSpinlock          avg 26.139379ms  min 9.452384ms   max 74.966394ms
std::sync::Mutex     avg 42.090291ms  min 37.524273ms  max 45.824792ms
parking_lot::Mutex   avg 51.351734ms  min 35.260994ms  max 56.448228ms
spin::Mutex          avg 38.948821ms  min 10.116644ms  max 81.041639ms
AmdSpinlock          avg 26.802719ms  min 9.325822ms   max 79.673338ms

std::sync::Mutex     avg 42.327898ms  min 37.470312ms  max 45.371616ms
parking_lot::Mutex   avg 51.054137ms  min 42.471562ms  max 55.764639ms
spin::Mutex          avg 36.216242ms  min 10.552043ms  max 99.071695ms
AmdSpinlock          avg 26.746589ms  min 9.888954ms   max 72.403114ms

I can reproduce on ThreadRipper 3970X.

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 43.973765ms  min 36.940827ms  max 48.005445ms 
parking_lot::Mutex   avg 287.913565ms min 258.751948ms max 304.911923ms
spin::Mutex          avg 25.721195ms  min 19.751268ms  max 48.85489ms  
AmdSpinlock          avg 24.204215ms  min 17.663769ms  max 49.178313ms 

std::sync::Mutex     avg 44.113866ms  min 39.096522ms  max 48.283289ms 
parking_lot::Mutex   avg 289.014664ms min 267.429467ms max 305.30055ms 
spin::Mutex          avg 26.11682ms   min 19.584014ms  max 60.147676ms 
AmdSpinlock          avg 24.284636ms  min 17.043627ms  max 67.09198ms  

Ran a modified version running only one std::sync::Mutex loop under perf stat:

std::sync::Mutex     avg 43.501474ms  min 38.28496ms   max 47.539912ms 

 Performance counter stats for 'cargo run --release 32 2 10000 100':

        119,327.91 msec task-clock                #    8.225 CPUs utilized          
         2,388,322      context-switches          #    0.020 M/sec                  
            25,909      cpu-migrations            #    0.217 K/sec                  
             4,142      page-faults               #    0.035 K/sec                  
   438,819,580,111      cycles                    #    3.677 GHz                      (84.48%)
   332,632,165,301      stalled-cycles-frontend   #   75.80% frontend cycles idle     (82.91%)
    10,515,010,024      stalled-cycles-backend    #    2.40% backend cycles idle      (82.22%)
    50,102,474,576      instructions              #    0.11  insn per cycle         
                                                  #    6.64  stalled cycles per insn  (82.12%)
    11,335,523,432      branches                  #   94.995 M/sec                    (83.25%)
       157,822,216      branch-misses             #    1.39% of all branches          (85.02%)

      14.508207241 seconds time elapsed

       9.607140000 seconds user
     109.520591000 seconds sys

And with parking_log::Mutex only:

parking_lot::Mutex   avg 273.396674ms min 254.347881ms max 289.119739ms

 Performance counter stats for 'cargo run --release 32 2 10000 100':

        222,159.62 msec task-clock                #    5.920 CPUs utilized          
        21,304,902      context-switches          #    0.096 M/sec                  
           378,781      cpu-migrations            #    0.002 M/sec                  
             4,141      page-faults               #    0.019 K/sec                  
   444,159,684,024      cycles                    #    1.999 GHz                      (83.12%)
    85,009,073,534      stalled-cycles-frontend   #   19.14% frontend cycles idle     (83.38%)
   135,860,444,447      stalled-cycles-backend    #   30.59% backend cycles idle      (83.66%)
   184,364,949,242      instructions              #    0.42  insn per cycle         
                                                  #    0.74  stalled cycles per insn  (83.62%)
    41,792,087,226      branches                  #  188.117 M/sec                    (83.24%)
       615,614,410      branch-misses             #    1.47% of all branches          (82.98%)

      37.525030261 seconds time elapsed

      49.254354000 seconds user
     174.546567000 seconds sys

The GHz numbers caught my attention, on top of everything else, and empirically the CPU usage levels during a run in htop seems to indicate less CPU used overall while the parking_lot::Mutex test runs, compared to std::sync::Mutex. I haven’t looked to get more accurate data, though. Anyways, that got me wondering what the result would look like if the cpu frequency governor was changed from ondemand to performance, and this results in:

Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 41.136917ms  min 33.999931ms  max 49.160068ms 
parking_lot::Mutex   avg 208.116968ms min 196.38674ms  max 216.458502ms
spin::Mutex          avg 18.496924ms  min 13.221013ms  max 43.77934ms  
AmdSpinlock          avg 15.954826ms  min 11.889577ms  max 39.562233ms 

std::sync::Mutex     avg 40.888678ms  min 36.336717ms  max 44.511702ms 
parking_lot::Mutex   avg 206.536087ms min 189.072666ms max 215.691531ms
spin::Mutex          avg 18.482265ms  min 14.164115ms  max 63.214092ms 
AmdSpinlock          avg 15.733487ms  min 12.418109ms  max 38.475479ms 

FWIW.

I’m not going to dive deeper into this right now, but I hope the data above can be useful somehow.

SwitchToThread vs Sleep makes no difference for me.

I got a large improvement by increasing the amount of spins before a sleep. But it took really a very big increase, so my concern is I thereby just degraded it to the Spinlock:

pub fn spin(&mut self) -> bool {
    if self.counter >= 20 {
        return false;
    }
    self.counter += 1;
    if self.counter <= 15 {
        cpu_relax(1 << self.counter);
    } else {
        thread_parker::thread_yield();
    }
    true
}

Result:

parking_lot::Mutex avg 12.178126ms min 11.6343ms max 14.1958ms

So this goes from 2^3 spins to 2^15. Even with 2^9 there was no noticable improvement. Increasing the number of thread_parker::thread_yield() calls always decreases the performance. Therefore it seems like that kind of yielding to the OS might be inefficient.

E.g. going to

if self.counter >= 16 {
    return false;
 }

which avoids all thread yield calls just changes the result to

parking_lot::Mutex avg 12.370461ms min 11.8578ms max 14.3432ms