runtime: Locks slow on Linux with lots of threads

A failing unit test as part of a separate project led me to investigate some odd performance characteristics. At first glance - and I’m not an expert on this - it looks like locks are really expensive on Linux when there are lots of threads trying to acquire the same lock.

I’ve put together a small test app to demonstrate this. No extra dependencies required - just dotnet new console (with netcoreapp1.1) and then edit Program.cs:

using System;
using System.Diagnostics;
using System.Linq;
using System.Threading;

class Program
{
    static long counter = 0;
    
    static void Main(string[] args)
    {
        int threadCount = int.Parse(args[0]);
        long totalWork = long.Parse(args[1]);
        int incrementsPerIteration = int.Parse(args[2]);
        
        int iterationsPerThread = (int) (totalWork / (incrementsPerIteration * threadCount));
        
        object monitor = new object();
        Run(monitor, 1, 1);
        counter = 0;
        
        var threads = Enumerable
            .Range(0, threadCount)
            .Select(_ => new Thread(() => Run(monitor, iterationsPerThread, incrementsPerIteration)))
            .ToList();
        
        var stopwatch = Stopwatch.StartNew();
        threads.ForEach(t => t.Start());
        threads.ForEach(t => t.Join());
        stopwatch.Stop();
        
        Console.WriteLine($"Threads: {threadCount}");
        Console.WriteLine($"Elapsed time: {stopwatch.Elapsed.TotalSeconds:0.00}");
    }
    
    static void Run(object monitor, int iterations, int increments)
    {
        for (int i = 0; i < iterations; i++)
        {
            lock (monitor)
            {
                for (int j = 0; j < increments; j++)
                {
                    counter++;
                }
            }
        }
    }
}

Windows machine dotnet --info:

.NET Command Line Tools (1.0.4)

Product Information: Version: 1.0.4 Commit SHA-1 hash: af1e6684fd

Runtime Environment: OS Name: Windows OS Version: 10.0.15063 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\1.0.4

Linux machine dotnet --info:

.NET Command Line Tools (1.0.4)

Product Information: Version: 1.0.4 Commit SHA-1 hash: af1e6684fd

Runtime Environment: OS Name: ubuntu OS Version: 16.04 OS Platform: Linux RID: ubuntu.16.04-x64 Base Path: /usr/share/dotnet/sdk/1.0.4

The Windows machine and Linux machine below have identical CPUs (Core i5 IIRC; I can go into more details if necessary). But as the number of threads goes up, the Windows timing stays roughly constant, but the Linux timing gets much, much worse.

Windows results

dotnet run 10 1000000000 100
Threads: 10
Elapsed time: 10.89

dotnet run 20 1000000000 100
Threads: 20
Elapsed time: 10.86

dotnet run 30 1000000000 100
Threads: 30
Elapsed time: 10.90

dotnet run 40 1000000000 100
Threads: 40
Elapsed time: 10.86

dotnet run 50 1000000000 100
Threads: 50
Elapsed time: 10.85

Linux results (same inputs)

dotnet run 10 1000000000 100
Threads: 10
Elapsed time: 17.05

dotnet run 20 1000000000 100
Threads: 20
Elapsed time: 40.21

dotnet run 30 1000000000 100
Threads: 30
Elapsed time: 113.84

dotnet run 40 1000000000 100
... didn't complete after a long time. Killed to avoid generating pointless heat

I realize this is a very unscientific test - proper diagnostics would be rather more nuanced. But it seems like enough of an alarm bell to be worth raising for smarter people to look closely.

I have no idea whether this is an inherent Linux issue, or a CoreCLR implementation issue. I haven’t yet tried with the .NET Core 2.0 preview, and would rather not, but could do so if necessary.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 22 (18 by maintainers)

Commits related to this issue

Wasm: Align classes and arrays with 64 bit types at 8 byte (#8255) * enabled 8 byte alignment for arrays and classes that require it. * remove enum, use constant * Remove copy of RawEEType and use ... — committed to MichalStrehovsky/runtime by yowl 4 years ago

Most upvoted comments

@jkotas, @stephentoub, @vancem, what are your thoughts on this? Is there anything else I should test to see if I’m missing something?

I have a few comments/observations

I realize that historically we have had significant problem with respect to lock performance (in particular spinning WAY too long)< and I would like to take this issue as an opportunity for me to do some inventory and figure out exactly what we do now, and whether we believe it is a good idea or now (historically we were much too risk-averse to entertain changing anything). Let me review the code and maybe do some experiments (maybe make some perf tests) and get back to you. The rest of this is my general observations.
I DON’T view the scenario as a primary scenario to tune for. This test shows what happens when MANY threads all beat on a single lock (and do nothing outside the lock). Frankly, when any lock has a persistent queue of waiters on it, you are ALREADY have a perf problem (you have a hot lock and you have a scalability problem to fix). Thus we are only making a very bad problem just a bad problem, which is a secondary scenario.
HOWEVER the SPINNING is a real problem because it is STARVING real work from happening, and creates a ‘Meltdown’ issue (the more contention there is, the more inefficient you are which makes the problem worse). This can lead to VERY bad behavior (we have seen this in real scenarios and it is not pretty). Notice that if there was NO spinning, there would have been no problem above because N-1 threads would be waiting and one thread would always be making progress (which is the best you can do).
Generally speaking, an ideal algorithm for locks is to be completely unfair in small time scales (a thread that has the processor should have preference to getting the lock, since its cache lines are ‘hot’), but fair at long time intervals (e.g. 1 - 10 msec), to avoid starvation (e.g. if a waiter has been waiting more than 1msec, then it should get priority).
Spinning should be kept well in check. It should be on the order of the cost of context switching out and back (which is the cost of the context switch (call it 3K instrs ~= 1usec), and some estimate of cost of ‘warming up the cache again’, guessing about < 30 k instructions ~= 10usec).
Note that you should to NO spinning if the likely wait time is > context switch cost. Interestingly you CAN come up with a very good estimate of whether this is true or not by looking at what you did before (if you had to wait previously odd are you will this time, and so you should dispense with spinning).
If you do NO spinning, it is not that bad (you pay the context switch cost, but that is bounded and not huge).

Arguably, WE should not be doing the spinning, WE call a OS lock and that lock should be the thing that figures out this out for everyone. We should not have to do things ‘on top’ of the OS lock.

Anyway, the idea of being unfair at small time scales (which is what Windows priority boost does), and/or not spinning much (which avoids the meltdown), should solve things. Let me look at the code before I make a more specific recommendation. however.

vancem on Jun 29, 2017

I believe it’s running into a lock convoy problem. Spinners are not able to take the lock while there are waiting threads, and a waiting thread is not able to wake up before a significant delay due to spinners hoarding the CPU.

This works on Windows because when a waiting thread is signaled to wake up, I believe it also gives that thread a priority boost. As such, it’s able to wake up asap and acquire the lock.

kouvel on Jun 27, 2017