runtime: Locks slow on Linux with lots of threads
A failing unit test as part of a separate project led me to investigate some odd performance characteristics. At first glance - and I’m not an expert on this - it looks like locks are really expensive on Linux when there are lots of threads trying to acquire the same lock.
I’ve put together a small test app to demonstrate this. No extra dependencies required - just dotnet new console (with netcoreapp1.1) and then edit Program.cs:
using System;
using System.Diagnostics;
using System.Linq;
using System.Threading;
class Program
{
static long counter = 0;
static void Main(string[] args)
{
int threadCount = int.Parse(args[0]);
long totalWork = long.Parse(args[1]);
int incrementsPerIteration = int.Parse(args[2]);
int iterationsPerThread = (int) (totalWork / (incrementsPerIteration * threadCount));
object monitor = new object();
Run(monitor, 1, 1);
counter = 0;
var threads = Enumerable
.Range(0, threadCount)
.Select(_ => new Thread(() => Run(monitor, iterationsPerThread, incrementsPerIteration)))
.ToList();
var stopwatch = Stopwatch.StartNew();
threads.ForEach(t => t.Start());
threads.ForEach(t => t.Join());
stopwatch.Stop();
Console.WriteLine($"Threads: {threadCount}");
Console.WriteLine($"Elapsed time: {stopwatch.Elapsed.TotalSeconds:0.00}");
}
static void Run(object monitor, int iterations, int increments)
{
for (int i = 0; i < iterations; i++)
{
lock (monitor)
{
for (int j = 0; j < increments; j++)
{
counter++;
}
}
}
}
}
Windows machine dotnet --info:
.NET Command Line Tools (1.0.4)
Product Information: Version: 1.0.4 Commit SHA-1 hash: af1e6684fd
Runtime Environment: OS Name: Windows OS Version: 10.0.15063 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\1.0.4
Linux machine dotnet --info:
.NET Command Line Tools (1.0.4)
Product Information: Version: 1.0.4 Commit SHA-1 hash: af1e6684fd
Runtime Environment: OS Name: ubuntu OS Version: 16.04 OS Platform: Linux RID: ubuntu.16.04-x64 Base Path: /usr/share/dotnet/sdk/1.0.4
The Windows machine and Linux machine below have identical CPUs (Core i5 IIRC; I can go into more details if necessary). But as the number of threads goes up, the Windows timing stays roughly constant, but the Linux timing gets much, much worse.
Windows results
dotnet run 10 1000000000 100
Threads: 10
Elapsed time: 10.89
dotnet run 20 1000000000 100
Threads: 20
Elapsed time: 10.86
dotnet run 30 1000000000 100
Threads: 30
Elapsed time: 10.90
dotnet run 40 1000000000 100
Threads: 40
Elapsed time: 10.86
dotnet run 50 1000000000 100
Threads: 50
Elapsed time: 10.85
Linux results (same inputs)
dotnet run 10 1000000000 100
Threads: 10
Elapsed time: 17.05
dotnet run 20 1000000000 100
Threads: 20
Elapsed time: 40.21
dotnet run 30 1000000000 100
Threads: 30
Elapsed time: 113.84
dotnet run 40 1000000000 100
... didn't complete after a long time. Killed to avoid generating pointless heat
I realize this is a very unscientific test - proper diagnostics would be rather more nuanced. But it seems like enough of an alarm bell to be worth raising for smarter people to look closely.
I have no idea whether this is an inherent Linux issue, or a CoreCLR implementation issue. I haven’t yet tried with the .NET Core 2.0 preview, and would rather not, but could do so if necessary.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 22 (18 by maintainers)
I have a few comments/observations
Arguably, WE should not be doing the spinning, WE call a OS lock and that lock should be the thing that figures out this out for everyone. We should not have to do things ‘on top’ of the OS lock.
Anyway, the idea of being unfair at small time scales (which is what Windows priority boost does), and/or not spinning much (which avoids the meltdown), should solve things. Let me look at the code before I make a more specific recommendation. however.
I believe it’s running into a lock convoy problem. Spinners are not able to take the lock while there are waiting threads, and a waiting thread is not able to wake up before a significant delay due to spinners hoarding the CPU.
This works on Windows because when a waiting thread is signaled to wake up, I believe it also gives that thread a priority boost. As such, it’s able to wake up asap and acquire the lock.