runtime: Possible Performance Regression from 2.0.7 to 2.1-RC1 on Windows Server 2016 w/ Hyper-V Role
@Mike-EEE commented on Wed May 09 2018
Hello .NET Core Team and Community,
I have a very interesting performance issue for you here. I have produced some code that demonstrates a regressing difference in performance when switching from the .NET Core 2.0.x runtime to 2.1-RC1 on my Windows Server 2016 machine that I am using as a Hyper-V server. Please note that I am also using Visual Studio 2017 15.7.1 in producing my output here.
To start, this issue has context that is most recently rooted in a Benchmark.NET issue that you can view here. I have created a series of five benchmarks using the exact same underlying test code. Each have a slight modification that produce different results in 2.0.7 when run on my Windows Server host machine (note: the host machine and not a VM instance). You can see an overview of this information here.
I took the code that is in the 3rd example, and by simply rebuilding it and running it in the 2.1 runtime, it produced slower results. I have created this 2.1 branch here for your convenience.
(Please do note that the results of each scenario are found in the README.md of their respective branches.)
Additionally, with a small change, I was once again able to produce the faster result times as seen in 2.0.7 but within the 2.1 runtime. I have published this branch here. This is consistent with the bizarre behavior referenced in the five earlier examples, but seen here in a different form (or configuration, for lack of a better term) in 2.1.
I feel it’s worthy to note that I believe that this is a gremlin of a problem that I have been chasing in .NET Core tooling since 1.1. I have encountered the most weird of circumstances around this behavior while trying to benchmark code using Benchmark.NET. You can see previous iterations of my previous, more complicated attempts at wrestling this beast here for your reference: https://github.com/dotnet/BenchmarkDotNet/issues/330 https://github.com/dotnet/BenchmarkDotNet/issues/433
Now that I have been able to show a (regressing) discrepancy in performance between 2.0.7 and 2.1 RC1 with a simple recompile, I believe this could be due to a tooling/SDK problem that has been around for some time now. Fortunately, I have been able to capture it using a very simple project this time and am more than happy to share it now in hopes to finally tracking down this very weird issue. 😄
Finally, I will provide an overview of my Windows Server specifications here.
Summary
Operating System
Windows Server 2016 Standard 64-bit
CPU
Intel Core i7 4820K @ 3.70GHz 38 °C
Ivy Bridge-E 22nm Technology
RAM
48.0GB DDR3 @ 792MHz (9-9-9-27)
Motherboard
ASUSTeK COMPUTER INC. RAMPAGE IV EXTREME (LGA2011) 30 °C
Graphics
VW246 (1920x1080@60Hz)
G246HL (1920x1080@60Hz)
VW246 (1920x1080@60Hz)
VW246 (1920x1080@60Hz)
2047MB NVIDIA GeForce GTX 660 (Gigabyte) 30 °C
Storage
238GB Samsung SSD 840 PRO Series (SSD) 25 °C
892GB Microsoft Storage Space Device (SSD) 28 °C
463GB Microsoft Storage Space Device (SSD) 28 °C
1536GB SYNOLOGY iSCSI Storage SCSI Disk Device (iSCSI)
Optical Drives
No optical disk drives detected
Audio
High Definition Audio Device
Operating System
Windows Server 2016 Standard 64-bit
Computer type: Virtual
Installation Date: 5/8/2017 4:41:12 PM
CPU
Intel Core i7 4820K
Cores 4
Threads 8
Name Intel Core i7 4820K
Code Name Ivy Bridge-E
Package Socket 2011 LGA
Technology 22nm
Specification Intel Core i7-4820K CPU @ 3.70GHz
Family 6
Extended Family 6
Model E
Extended Model 3E
Stepping 4
Revision S0/S1
Instructions MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, Intel 64, NX, AES, AVX
Virtualization Not supported
Hyperthreading Supported, Enabled
Fan Speed 594 RPM
Rated Bus Speed 3167.2 MHz
Stock Core Speed 3700 MHz
Stock Bus Speed 100 MHz
Average Temperature 38 °C
Caches
L1 Data Cache Size 4 x 32 KBytes
L1 Instructions Cache Size 4 x 32 KBytes
L2 Unified Cache Size 4 x 256 KBytes
L3 Unified Cache Size 10240 KBytes
Cores
Core 0
Core Speed 4552.8 MHz
Multiplier x 46.0
Bus Speed 99.0 MHz
Rated Bus Speed 3167.2 MHz
Temperature 35 °C
Threads APIC ID: 0, 1
Core 1
Core Speed 4651.8 MHz
Multiplier x 47.0
Bus Speed 99.0 MHz
Rated Bus Speed 3167.2 MHz
Temperature 36 °C
Threads APIC ID: 2, 3
Core 2
Core Speed 4552.8 MHz
Multiplier x 46.0
Bus Speed 99.0 MHz
Rated Bus Speed 3167.2 MHz
Temperature 42 °C
Threads APIC ID: 4, 5
Core 3
Core Speed 4552.8 MHz
Multiplier x 46.0
Bus Speed 99.0 MHz
Rated Bus Speed 3167.2 MHz
Temperature 38 °C
Threads APIC ID: 6, 7
RAM
Memory slots
Total memory slots 8
Used memory slots 6
Free memory slots 2
Memory
Type DDR3
Size 49152 MBytes
DRAM Frequency 791.8 MHz
CAS# Latency (CL) 9 clocks
RAS# to CAS# Delay (tRCD) 9 clocks
RAS# Precharge (tRP) 9 clocks
Cycle Time (tRAS) 27 clocks
Command Rate (CR) 2T
Physical Memory
Memory Usage 64 %
Total Physical 48 GB
Available Physical 17 GB
Total Virtual 96 GB
Available Virtual 64 GB
SPD
Number Of SPD Modules 6
Slot dotnet/coreclr#1
Type DDR3
Size 8192 MBytes
Manufacturer Kingston
Max Bandwidth PC3-10700 (667 MHz)
Part Number KHX2133C11D3/8GX
Week/year 42 / 13
SPD Ext. XMP
Timing table
JEDEC dotnet/coreclr#1
Frequency 457.1 MHz
CAS# Latency 6.0
RAS# To CAS# 6
RAS# Precharge 6
tRAS 17
tRC 23
Voltage 1.500 V
JEDEC dotnet/coreclr#2
Frequency 533.3 MHz
CAS# Latency 7.0
RAS# To CAS# 7
RAS# Precharge 7
tRAS 20
tRC 27
Voltage 1.500 V
JEDEC dotnet/coreclr#3
Frequency 609.5 MHz
CAS# Latency 8.0
RAS# To CAS# 8
RAS# Precharge 8
tRAS 22
tRC 30
Voltage 1.500 V
JEDEC dotnet/coreclr#4
Frequency 666.7 MHz
CAS# Latency 9.0
RAS# To CAS# 9
RAS# Precharge 9
tRAS 24
tRC 33
Voltage 1.500 V
XMP-2132
Frequency 1066 MHz
CAS# Latency 11.0
RAS# To CAS# 12
RAS# Precharge 11
tRAS 30
Voltage 1.600 V
Slot dotnet/coreclr#2
Type DDR3
Size 8192 MBytes
Manufacturer Kingston
Max Bandwidth PC3-10700 (667 MHz)
Part Number KHX2133C11D3/8GX
Week/year 42 / 13
SPD Ext. XMP
Timing table
JEDEC dotnet/coreclr#1
Frequency 457.1 MHz
CAS# Latency 6.0
RAS# To CAS# 6
RAS# Precharge 6
tRAS 17
tRC 23
Voltage 1.500 V
JEDEC dotnet/coreclr#2
Frequency 533.3 MHz
CAS# Latency 7.0
RAS# To CAS# 7
RAS# Precharge 7
tRAS 20
tRC 27
Voltage 1.500 V
JEDEC dotnet/coreclr#3
Frequency 609.5 MHz
CAS# Latency 8.0
RAS# To CAS# 8
RAS# Precharge 8
tRAS 22
tRC 30
Voltage 1.500 V
JEDEC dotnet/coreclr#4
Frequency 666.7 MHz
CAS# Latency 9.0
RAS# To CAS# 9
RAS# Precharge 9
tRAS 24
tRC 33
Voltage 1.500 V
XMP-2132
Frequency 1066 MHz
CAS# Latency 11.0
RAS# To CAS# 12
RAS# Precharge 11
tRAS 30
Voltage 1.600 V
Slot dotnet/coreclr#3
Type DDR3
Size 8192 MBytes
Manufacturer Kingston
Max Bandwidth PC3-10700 (667 MHz)
Part Number KHX2133C11D3/8GX
Week/year 42 / 13
SPD Ext. XMP
Timing table
JEDEC dotnet/coreclr#1
Frequency 457.1 MHz
CAS# Latency 6.0
RAS# To CAS# 6
RAS# Precharge 6
tRAS 17
tRC 23
Voltage 1.500 V
JEDEC dotnet/coreclr#2
Frequency 533.3 MHz
CAS# Latency 7.0
RAS# To CAS# 7
RAS# Precharge 7
tRAS 20
tRC 27
Voltage 1.500 V
JEDEC dotnet/coreclr#3
Frequency 609.5 MHz
CAS# Latency 8.0
RAS# To CAS# 8
RAS# Precharge 8
tRAS 22
tRC 30
Voltage 1.500 V
JEDEC dotnet/coreclr#4
Frequency 666.7 MHz
CAS# Latency 9.0
RAS# To CAS# 9
RAS# Precharge 9
tRAS 24
tRC 33
Voltage 1.500 V
XMP-2132
Frequency 1066 MHz
CAS# Latency 11.0
RAS# To CAS# 12
RAS# Precharge 11
tRAS 30
Voltage 1.600 V
Slot dotnet/coreclr#4
Type DDR3
Size 8192 MBytes
Manufacturer Kingston
Max Bandwidth PC3-10700 (667 MHz)
Part Number KHX2133C11D3/8GX
Week/year 42 / 13
SPD Ext. XMP
Timing table
JEDEC dotnet/coreclr#1
Frequency 457.1 MHz
CAS# Latency 6.0
RAS# To CAS# 6
RAS# Precharge 6
tRAS 17
tRC 23
Voltage 1.500 V
JEDEC dotnet/coreclr#2
Frequency 533.3 MHz
CAS# Latency 7.0
RAS# To CAS# 7
RAS# Precharge 7
tRAS 20
tRC 27
Voltage 1.500 V
JEDEC dotnet/coreclr#3
Frequency 609.5 MHz
CAS# Latency 8.0
RAS# To CAS# 8
RAS# Precharge 8
tRAS 22
tRC 30
Voltage 1.500 V
JEDEC dotnet/coreclr#4
Frequency 666.7 MHz
CAS# Latency 9.0
RAS# To CAS# 9
RAS# Precharge 9
tRAS 24
tRC 33
Voltage 1.500 V
XMP-2132
Frequency 1066 MHz
CAS# Latency 11.0
RAS# To CAS# 12
RAS# Precharge 11
tRAS 30
Voltage 1.600 V
Slot dotnet/coreclr#5
Type DDR3
Size 8192 MBytes
Manufacturer Kingston
Max Bandwidth PC3-10700 (667 MHz)
Part Number KHX2133C11D3/8GX
Week/year 42 / 13
SPD Ext. XMP
Timing table
JEDEC dotnet/coreclr#1
Frequency 457.1 MHz
CAS# Latency 6.0
RAS# To CAS# 6
RAS# Precharge 6
tRAS 17
tRC 23
Voltage 1.500 V
JEDEC dotnet/coreclr#2
Frequency 533.3 MHz
CAS# Latency 7.0
RAS# To CAS# 7
RAS# Precharge 7
tRAS 20
tRC 27
Voltage 1.500 V
JEDEC dotnet/coreclr#3
Frequency 609.5 MHz
CAS# Latency 8.0
RAS# To CAS# 8
RAS# Precharge 8
tRAS 22
tRC 30
Voltage 1.500 V
JEDEC dotnet/coreclr#4
Frequency 666.7 MHz
CAS# Latency 9.0
RAS# To CAS# 9
RAS# Precharge 9
tRAS 24
tRC 33
Voltage 1.500 V
XMP-2132
Frequency 1066 MHz
CAS# Latency 11.0
RAS# To CAS# 12
RAS# Precharge 11
tRAS 30
Voltage 1.600 V
Slot dotnet/coreclr#6
Type DDR3
Size 8192 MBytes
Manufacturer Kingston
Max Bandwidth PC3-10700 (667 MHz)
Part Number KHX2133C11D3/8GX
Week/year 42 / 13
SPD Ext. XMP
Timing table
JEDEC dotnet/coreclr#1
Frequency 457.1 MHz
CAS# Latency 6.0
RAS# To CAS# 6
RAS# Precharge 6
tRAS 17
tRC 23
Voltage 1.500 V
JEDEC dotnet/coreclr#2
Frequency 533.3 MHz
CAS# Latency 7.0
RAS# To CAS# 7
RAS# Precharge 7
tRAS 20
tRC 27
Voltage 1.500 V
JEDEC dotnet/coreclr#3
Frequency 609.5 MHz
CAS# Latency 8.0
RAS# To CAS# 8
RAS# Precharge 8
tRAS 22
tRC 30
Voltage 1.500 V
JEDEC dotnet/coreclr#4
Frequency 666.7 MHz
CAS# Latency 9.0
RAS# To CAS# 9
RAS# Precharge 9
tRAS 24
tRC 33
Voltage 1.500 V
XMP-2132
Frequency 1066 MHz
CAS# Latency 11.0
RAS# To CAS# 12
RAS# Precharge 11
tRAS 30
Voltage 1.600 V
Please let me know if you require any further information around this issue or my environment and I will be happy to get it for you.
Thank you for any assistance you can provide towards solving this issue!
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 41 (20 by maintainers)
I created an issue for that https://github.com/dotnet/BenchmarkDotNet/issues/756
@AndyAyersMS is there any way to enforce the alignment today?
Are you still using the in-process toolchain? It might not work in this case.
It would still worth a try. 16 byte alignment would imply that the loop head will be either at the start of a 32 bytes chunk or right in the middle, it will never be near the end like it seems to happen in this case.
If you see or suspect you see this happening in programs that are not microbenchmarky, then we’d appreciate hearing more.
There are many microarchitectural issues that can show themselves when the overall performance you’re measuring depends critically on one loop or a few branches, or the speed of one or two memory loads. These exhibit similar behavior patterns – multiple “stable” values for measurements, changing from machine to machine, day to day, etc. Performance can vary based on a whole host of subtle things like ASLR, installation of service packs, the total volume of strings in your environment, the number of characters in the current working directory, which versions of hardware and drivers you have installed, etc.
It is often hard or impossible to control for all this. I once spent a lot of time chasing down an odd regression that turned out to be 4K aliasing between a stack variable and a global variable-- which was “fixed” by reordering the declaration of global variables. But thankfully these cases tend to be rare.
Super-robust benchmarking schemes like Stabilizer try and use randomization to effectively average out this “noise”. But these techniques have not yet made it into the mainstream. Perhaps BenchmarkDotNet will get there someday.
So in the meantime, it’s just something we have to live with. Measuring performance by measuring time is imperfect and subject to noise. Sometimes this noise is surprisingly large and influenced in unexpected and subtle ways.
In bigger and more “realistic” applications we generally expect that we get a similar kind of averaging out and so the overall performance should not be sensitive to fine grained details.
It might. Deleting the folder might involve different code paths that result in different/additional methods being JIT compiled. This will affect the alignment of methods that are being benchmarked.
And likely newer CPU generations such as Skylake.
You’d need to buy many laptops having whatever CPUs you care about. That’s probably not feasible. Even for a large company it would be rather difficult to have a ton of hardware around, run all kinds of benchmarks on them and gather and analyze all results.
Tiering (in preview in 2.1) gives us a potential future avenue to address the problem. It does not address it today.
x86 is just as vulnerable to this as x64.
You typically don’t need to worry about the impact of code alignment on perf in realistic applications. It usually only shows up prominently in microbenchmarks.
This alignment issue is probably a “feature” on most modern CPUs. It will show up in cases where performance depends critically on one small loop. So it’s less of a problem for realistic applications where there are many key loops and/or the loops do more work per iteration.
We have a few issues open on this already (#11607, dotnet/runtime#9912) and other examples of the annoying impact of not having alignment (#16854).
The subtlety here in your case is that innocuous looking non-code things like attributes can alter alignment.
The challenge we face is that aligning all methods and labels mainly bloats code size and generally won’t improve performance. Alignment is only helpful when the particular bits of code that might end up misaligned are also frequently executed.
Perhaps tiered jitting can come to the rescue.
Another more likely to be true guess is that this is caused by code alignment issues. I can reproduce the unexpected speed-up by simply adding an empty for loop before the actual loop (yeah, the JIT does not eliminate empty for loops 😁):
Does adding/removing a class attribute can have the same effect? Yes, it can. If some code is looking for attributes the attribute constructor needs to run so it has to be JIT compiled. Methods that are compiled after the attribute constructor is run will end up on a different address than before.
In 5 runs with and without the
Description
attribute theforeach
loop ends up at these addresses:These addresses are remarkably stable. They vary only by multiple of 64KB from run to run and that’s unlikely to matter.
CPUs (at least Intel ones) decode instructions in 16 byte blocks so failure to align loops (and more generally, any “hot” branch target) to 16 byte boundaries can lead to performance problems. But this loops is never 16 byte aligned so that’s likely not the reason for the strange speed-up.
But newer CPUs (Haswell? I need to check) also have some loop buffers that cache decoded loop instructions. Those operate on 32 byte blocks. So? The loop is never 32 byte aligned just like it wasn’t 16 byte aligned.
Yes, but the slow loop (at 27BC) happens to start near the end of a 32 byte block. That might be a problem, a loop iteration just started executing and then almost immediately a new 32 byte block needs to be fetched possibly delaying the execution of the rest of the loop.
I’m not sure if the above is a realistic explanation of what is going on but there’s very good chance that the issue is one way or another related to code alignment.
Further evidence that is code alignment related:
new DescriptionAttribute();
statement with the same speed-up effectnew DescriptionAttribute
statement after the foreach loop again has no speed-up effect. This makes it much less likely that the issue is GC/memory relatedHrm, I played a bit with this on my Haswell desktop and it does look like there’s something fishy going on but I don’t know exactly what. The issues seems to be that the benchmark numbers are impacted by the presence of attributes on your
Array
class.With 2.1 and
[Config(typeof(Config))]
I get ~24us. If I add back theDescription
attribute (or any other attribute it seems) then the time drops to ~21us. If add more attributes the time goes back to 24us.With 2.0 it seems to work the opposite way and that would explain how come 2.1 seems faster in this case.
Right now I have no idea why those attributes impact the timings. One possibility is that BenchmarkDotNet queries them and the different number of attributes result in a different allocation pattern that has a certain impact on GC. But that’s pure guesswork.
As before, the issue can be reproduce by using a normal for loop instead of
ToArray
. In that case it’s easy to check the generate assembly code - it’s identical (including loop head alignment) for a given runtime version.Until I get to my Surface I ran the benchmark on an older CPU/OS and got a slightly larger difference between 2.0 and 2.1, larger enough that it may matter:
5 repeated runs persistently show a 3us difference.
The same difference can be reproduced without LINQ’s
ToArray
:Turns out that there is a small difference in the code generated by the JIT between 2.0 and 2.1:
I don’t know yet if this codegen difference has anything to do with the slowdown. It might, especially considering that it appears to impact only older CPUs.
Even if it is the reason for the 2.0-2.1 slowdown this is unlikely to have anything to do with the artifacts folder or any other difference you have seen.
Nope, it’s the current Win10 insider preview.
Well, in 5 runs I got one result that was more than 1us away from the rest:
I think you’re looking too much into this. A real difference may exist but it’s too small and the numbers are too noisy to tell. And if you drag in different hardware, OS versions and even virtual machines then you’ll soon end up nowhere. You may as well try to count the atoms in the universe.
Moving this here to start the investigation. @valenis @jaredpar @livarcocc . I dont know if this is going to end up being a runtime, compiler, or SDK issue.