runtime: Possible Performance Regression from 2.0.7 to 2.1-RC1 on Windows Server 2016 w/ Hyper-V Role

@Mike-EEE commented on Wed May 09 2018

Hello .NET Core Team and Community,

I have a very interesting performance issue for you here. I have produced some code that demonstrates a regressing difference in performance when switching from the .NET Core 2.0.x runtime to 2.1-RC1 on my Windows Server 2016 machine that I am using as a Hyper-V server. Please note that I am also using Visual Studio 2017 15.7.1 in producing my output here.

To start, this issue has context that is most recently rooted in a Benchmark.NET issue that you can view here. I have created a series of five benchmarks using the exact same underlying test code. Each have a slight modification that produce different results in 2.0.7 when run on my Windows Server host machine (note: the host machine and not a VM instance). You can see an overview of this information here.

I took the code that is in the 3rd example, and by simply rebuilding it and running it in the 2.1 runtime, it produced slower results. I have created this 2.1 branch here for your convenience.

(Please do note that the results of each scenario are found in the README.md of their respective branches.)

Additionally, with a small change, I was once again able to produce the faster result times as seen in 2.0.7 but within the 2.1 runtime. I have published this branch here. This is consistent with the bizarre behavior referenced in the five earlier examples, but seen here in a different form (or configuration, for lack of a better term) in 2.1.

I feel it’s worthy to note that I believe that this is a gremlin of a problem that I have been chasing in .NET Core tooling since 1.1. I have encountered the most weird of circumstances around this behavior while trying to benchmark code using Benchmark.NET. You can see previous iterations of my previous, more complicated attempts at wrestling this beast here for your reference: https://github.com/dotnet/BenchmarkDotNet/issues/330 https://github.com/dotnet/BenchmarkDotNet/issues/433

Now that I have been able to show a (regressing) discrepancy in performance between 2.0.7 and 2.1 RC1 with a simple recompile, I believe this could be due to a tooling/SDK problem that has been around for some time now. Fortunately, I have been able to capture it using a very simple project this time and am more than happy to share it now in hopes to finally tracking down this very weird issue. 😄

Finally, I will provide an overview of my Windows Server specifications here.

Summary
		Operating System
			Windows Server 2016 Standard 64-bit
		CPU
			Intel Core i7 4820K @ 3.70GHz	38 °C
			Ivy Bridge-E 22nm Technology
		RAM
			48.0GB DDR3 @ 792MHz (9-9-9-27)
		Motherboard
			ASUSTeK COMPUTER INC. RAMPAGE IV EXTREME (LGA2011)	30 °C
		Graphics
			VW246 (1920x1080@60Hz)
			G246HL (1920x1080@60Hz)
			VW246 (1920x1080@60Hz)
			VW246 (1920x1080@60Hz)
			2047MB NVIDIA GeForce GTX 660 (Gigabyte)	30 °C
		Storage
			238GB Samsung SSD 840 PRO Series (SSD)	25 °C
			892GB Microsoft Storage Space Device (SSD)	28 °C
			463GB Microsoft Storage Space Device (SSD)	28 °C
			1536GB SYNOLOGY iSCSI Storage SCSI Disk Device (iSCSI)
		Optical Drives
			No optical disk drives detected
		Audio
			High Definition Audio Device
Operating System
	Windows Server 2016 Standard 64-bit
	Computer type: Virtual
	Installation Date: 5/8/2017 4:41:12 PM

CPU
		Intel Core i7 4820K
			Cores	4
			Threads	8
			Name	Intel Core i7 4820K
			Code Name	Ivy Bridge-E
			Package	Socket 2011 LGA
			Technology	22nm
			Specification	Intel Core i7-4820K CPU @ 3.70GHz
			Family	6
			Extended Family	6
			Model	E
			Extended Model	3E
			Stepping	4
			Revision	S0/S1
			Instructions	MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, Intel 64, NX, AES, AVX
			Virtualization	Not supported
			Hyperthreading	Supported, Enabled
			Fan Speed	594 RPM
			Rated Bus Speed	3167.2 MHz
			Stock Core Speed	3700 MHz
			Stock Bus Speed	100 MHz
			Average Temperature	38 °C
				Caches
					L1 Data Cache Size	4 x 32 KBytes
					L1 Instructions Cache Size	4 x 32 KBytes
					L2 Unified Cache Size	4 x 256 KBytes
					L3 Unified Cache Size	10240 KBytes
				Cores
						Core 0
							Core Speed	4552.8 MHz
							Multiplier	x 46.0
							Bus Speed	99.0 MHz
							Rated Bus Speed	3167.2 MHz
							Temperature	35 °C
							Threads	APIC ID: 0, 1
						Core 1
							Core Speed	4651.8 MHz
							Multiplier	x 47.0
							Bus Speed	99.0 MHz
							Rated Bus Speed	3167.2 MHz
							Temperature	36 °C
							Threads	APIC ID: 2, 3
						Core 2
							Core Speed	4552.8 MHz
							Multiplier	x 46.0
							Bus Speed	99.0 MHz
							Rated Bus Speed	3167.2 MHz
							Temperature	42 °C
							Threads	APIC ID: 4, 5
						Core 3
							Core Speed	4552.8 MHz
							Multiplier	x 46.0
							Bus Speed	99.0 MHz
							Rated Bus Speed	3167.2 MHz
							Temperature	38 °C
							Threads	APIC ID: 6, 7
RAM
		Memory slots
			Total memory slots	8
			Used memory slots	6
			Free memory slots	2
		Memory
			Type	DDR3
			Size	49152 MBytes
			DRAM Frequency	791.8 MHz
			CAS# Latency (CL)	9 clocks
			RAS# to CAS# Delay (tRCD)	9 clocks
			RAS# Precharge (tRP)	9 clocks
			Cycle Time (tRAS)	27 clocks
			Command Rate (CR)	2T
		Physical Memory
			Memory Usage	64 %
			Total Physical	48 GB
			Available Physical	17 GB
			Total Virtual	96 GB
			Available Virtual	64 GB
		SPD
			Number Of SPD Modules	6
				Slot dotnet/coreclr#1
					Type	DDR3
					Size	8192 MBytes
					Manufacturer	Kingston
					Max Bandwidth	PC3-10700 (667 MHz)
					Part Number	KHX2133C11D3/8GX
					Week/year	42 / 13
					SPD Ext.	XMP
						Timing table
								JEDEC dotnet/coreclr#1
									Frequency	457.1 MHz
									CAS# Latency	6.0
									RAS# To CAS#	6
									RAS# Precharge	6
									tRAS	17
									tRC	23
									Voltage	1.500 V
								JEDEC dotnet/coreclr#2
									Frequency	533.3 MHz
									CAS# Latency	7.0
									RAS# To CAS#	7
									RAS# Precharge	7
									tRAS	20
									tRC	27
									Voltage	1.500 V
								JEDEC dotnet/coreclr#3
									Frequency	609.5 MHz
									CAS# Latency	8.0
									RAS# To CAS#	8
									RAS# Precharge	8
									tRAS	22
									tRC	30
									Voltage	1.500 V
								JEDEC dotnet/coreclr#4
									Frequency	666.7 MHz
									CAS# Latency	9.0
									RAS# To CAS#	9
									RAS# Precharge	9
									tRAS	24
									tRC	33
									Voltage	1.500 V
								XMP-2132
									Frequency	1066 MHz
									CAS# Latency	11.0
									RAS# To CAS#	12
									RAS# Precharge	11
									tRAS	30
									Voltage	1.600 V
				Slot dotnet/coreclr#2
					Type	DDR3
					Size	8192 MBytes
					Manufacturer	Kingston
					Max Bandwidth	PC3-10700 (667 MHz)
					Part Number	KHX2133C11D3/8GX
					Week/year	42 / 13
					SPD Ext.	XMP
						Timing table
								JEDEC dotnet/coreclr#1
									Frequency	457.1 MHz
									CAS# Latency	6.0
									RAS# To CAS#	6
									RAS# Precharge	6
									tRAS	17
									tRC	23
									Voltage	1.500 V
								JEDEC dotnet/coreclr#2
									Frequency	533.3 MHz
									CAS# Latency	7.0
									RAS# To CAS#	7
									RAS# Precharge	7
									tRAS	20
									tRC	27
									Voltage	1.500 V
								JEDEC dotnet/coreclr#3
									Frequency	609.5 MHz
									CAS# Latency	8.0
									RAS# To CAS#	8
									RAS# Precharge	8
									tRAS	22
									tRC	30
									Voltage	1.500 V
								JEDEC dotnet/coreclr#4
									Frequency	666.7 MHz
									CAS# Latency	9.0
									RAS# To CAS#	9
									RAS# Precharge	9
									tRAS	24
									tRC	33
									Voltage	1.500 V
								XMP-2132
									Frequency	1066 MHz
									CAS# Latency	11.0
									RAS# To CAS#	12
									RAS# Precharge	11
									tRAS	30
									Voltage	1.600 V
				Slot dotnet/coreclr#3
					Type	DDR3
					Size	8192 MBytes
					Manufacturer	Kingston
					Max Bandwidth	PC3-10700 (667 MHz)
					Part Number	KHX2133C11D3/8GX
					Week/year	42 / 13
					SPD Ext.	XMP
						Timing table
								JEDEC dotnet/coreclr#1
									Frequency	457.1 MHz
									CAS# Latency	6.0
									RAS# To CAS#	6
									RAS# Precharge	6
									tRAS	17
									tRC	23
									Voltage	1.500 V
								JEDEC dotnet/coreclr#2
									Frequency	533.3 MHz
									CAS# Latency	7.0
									RAS# To CAS#	7
									RAS# Precharge	7
									tRAS	20
									tRC	27
									Voltage	1.500 V
								JEDEC dotnet/coreclr#3
									Frequency	609.5 MHz
									CAS# Latency	8.0
									RAS# To CAS#	8
									RAS# Precharge	8
									tRAS	22
									tRC	30
									Voltage	1.500 V
								JEDEC dotnet/coreclr#4
									Frequency	666.7 MHz
									CAS# Latency	9.0
									RAS# To CAS#	9
									RAS# Precharge	9
									tRAS	24
									tRC	33
									Voltage	1.500 V
								XMP-2132
									Frequency	1066 MHz
									CAS# Latency	11.0
									RAS# To CAS#	12
									RAS# Precharge	11
									tRAS	30
									Voltage	1.600 V
				Slot dotnet/coreclr#4
					Type	DDR3
					Size	8192 MBytes
					Manufacturer	Kingston
					Max Bandwidth	PC3-10700 (667 MHz)
					Part Number	KHX2133C11D3/8GX
					Week/year	42 / 13
					SPD Ext.	XMP
						Timing table
								JEDEC dotnet/coreclr#1
									Frequency	457.1 MHz
									CAS# Latency	6.0
									RAS# To CAS#	6
									RAS# Precharge	6
									tRAS	17
									tRC	23
									Voltage	1.500 V
								JEDEC dotnet/coreclr#2
									Frequency	533.3 MHz
									CAS# Latency	7.0
									RAS# To CAS#	7
									RAS# Precharge	7
									tRAS	20
									tRC	27
									Voltage	1.500 V
								JEDEC dotnet/coreclr#3
									Frequency	609.5 MHz
									CAS# Latency	8.0
									RAS# To CAS#	8
									RAS# Precharge	8
									tRAS	22
									tRC	30
									Voltage	1.500 V
								JEDEC dotnet/coreclr#4
									Frequency	666.7 MHz
									CAS# Latency	9.0
									RAS# To CAS#	9
									RAS# Precharge	9
									tRAS	24
									tRC	33
									Voltage	1.500 V
								XMP-2132
									Frequency	1066 MHz
									CAS# Latency	11.0
									RAS# To CAS#	12
									RAS# Precharge	11
									tRAS	30
									Voltage	1.600 V
				Slot dotnet/coreclr#5
					Type	DDR3
					Size	8192 MBytes
					Manufacturer	Kingston
					Max Bandwidth	PC3-10700 (667 MHz)
					Part Number	KHX2133C11D3/8GX
					Week/year	42 / 13
					SPD Ext.	XMP
						Timing table
								JEDEC dotnet/coreclr#1
									Frequency	457.1 MHz
									CAS# Latency	6.0
									RAS# To CAS#	6
									RAS# Precharge	6
									tRAS	17
									tRC	23
									Voltage	1.500 V
								JEDEC dotnet/coreclr#2
									Frequency	533.3 MHz
									CAS# Latency	7.0
									RAS# To CAS#	7
									RAS# Precharge	7
									tRAS	20
									tRC	27
									Voltage	1.500 V
								JEDEC dotnet/coreclr#3
									Frequency	609.5 MHz
									CAS# Latency	8.0
									RAS# To CAS#	8
									RAS# Precharge	8
									tRAS	22
									tRC	30
									Voltage	1.500 V
								JEDEC dotnet/coreclr#4
									Frequency	666.7 MHz
									CAS# Latency	9.0
									RAS# To CAS#	9
									RAS# Precharge	9
									tRAS	24
									tRC	33
									Voltage	1.500 V
								XMP-2132
									Frequency	1066 MHz
									CAS# Latency	11.0
									RAS# To CAS#	12
									RAS# Precharge	11
									tRAS	30
									Voltage	1.600 V
				Slot dotnet/coreclr#6
					Type	DDR3
					Size	8192 MBytes
					Manufacturer	Kingston
					Max Bandwidth	PC3-10700 (667 MHz)
					Part Number	KHX2133C11D3/8GX
					Week/year	42 / 13
					SPD Ext.	XMP
						Timing table
								JEDEC dotnet/coreclr#1
									Frequency	457.1 MHz
									CAS# Latency	6.0
									RAS# To CAS#	6
									RAS# Precharge	6
									tRAS	17
									tRC	23
									Voltage	1.500 V
								JEDEC dotnet/coreclr#2
									Frequency	533.3 MHz
									CAS# Latency	7.0
									RAS# To CAS#	7
									RAS# Precharge	7
									tRAS	20
									tRC	27
									Voltage	1.500 V
								JEDEC dotnet/coreclr#3
									Frequency	609.5 MHz
									CAS# Latency	8.0
									RAS# To CAS#	8
									RAS# Precharge	8
									tRAS	22
									tRC	30
									Voltage	1.500 V
								JEDEC dotnet/coreclr#4
									Frequency	666.7 MHz
									CAS# Latency	9.0
									RAS# To CAS#	9
									RAS# Precharge	9
									tRAS	24
									tRC	33
									Voltage	1.500 V
								XMP-2132
									Frequency	1066 MHz
									CAS# Latency	11.0
									RAS# To CAS#	12
									RAS# Precharge	11
									tRAS	30
									Voltage	1.600 V

Please let me know if you require any further information around this issue or my environment and I will be happy to get it for you.

Thank you for any assistance you can provide towards solving this issue!

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 41 (20 by maintainers)

Most upvoted comments

I created an issue for that https://github.com/dotnet/BenchmarkDotNet/issues/756

@AndyAyersMS is there any way to enforce the alignment today?

FWIW I did try using new EnvironmentVariable(“COMPlus_JitAlignLoops”, “1”)

Are you still using the in-process toolchain? It might not work in this case.

but it seems that there the problem is different and would require 32 byte alignment.

It would still worth a try. 16 byte alignment would imply that the loop head will be either at the start of a 32 bytes chunk or right in the middle, it will never be near the end like it seems to happen in this case.

If you see or suspect you see this happening in programs that are not microbenchmarky, then we’d appreciate hearing more.

There are many microarchitectural issues that can show themselves when the overall performance you’re measuring depends critically on one loop or a few branches, or the speed of one or two memory loads. These exhibit similar behavior patterns – multiple “stable” values for measurements, changing from machine to machine, day to day, etc. Performance can vary based on a whole host of subtle things like ASLR, installation of service packs, the total volume of strings in your environment, the number of characters in the current working directory, which versions of hardware and drivers you have installed, etc.

It is often hard or impossible to control for all this. I once spent a lot of time chasing down an odd regression that turned out to be 4K aliasing between a stack variable and a global variable-- which was “fixed” by reordering the declaration of global variables. But thankfully these cases tend to be rare.

Super-robust benchmarking schemes like Stabilizer try and use randomization to effectively average out this “noise”. But these techniques have not yet made it into the mainstream. Perhaps BenchmarkDotNet will get there someday.

So in the meantime, it’s just something we have to live with. Measuring performance by measuring time is imperfect and subject to noise. Sometimes this noise is surprisingly large and influenced in unexpected and subtle ways.

In bigger and more “realistic” applications we generally expect that we get a similar kind of averaging out and so the overall performance should not be sensitive to fine grained details.

I am curious if the currently working code alignment theory explains the odd discrepancy with executions with and without this folder?

It might. Deleting the folder might involve different code paths that result in different/additional methods being JIT compiled. This will affect the alignment of methods that are being benchmarked.

Is this only impacting Haswell processors?

And likely newer CPU generations such as Skylake.

If so, I might buy a newer laptop specifically to work around this issue. SorryNotSorry, I’m sooooo done dealing with this sucker if I can help it, LOL.

You’d need to buy many laptops having whatever CPUs you care about. That’s probably not feasible. Even for a large company it would be rather difficult to have a ton of hardware around, run all kinds of benchmarks on them and gather and analyze all results.

Tiering (in preview in 2.1) gives us a potential future avenue to address the problem. It does not address it today.

x86 is just as vulnerable to this as x64.

You typically don’t need to worry about the impact of code alignment on perf in realistic applications. It usually only shows up prominently in microbenchmarks.

This alignment issue is probably a “feature” on most modern CPUs. It will show up in cases where performance depends critically on one small loop. So it’s less of a problem for realistic applications where there are many key loops and/or the loops do more work per iteration.

We have a few issues open on this already (#11607, dotnet/runtime#9912) and other examples of the annoying impact of not having alignment (#16854).

The subtlety here in your case is that innocuous looking non-code things like attributes can alter alignment.

The challenge we face is that aligning all methods and labels mainly bloats code size and generally won’t improve performance. Alignment is only helpful when the particular bits of code that might end up misaligned are also frequently executed.

Perhaps tiered jitting can come to the rescue.

But that’s pure guesswork

Another more likely to be true guess is that this is caused by code alignment issues. I can reproduce the unexpected speed-up by simply adding an empty for loop before the actual loop (yeah, the JIT does not eliminate empty for loops 😁):

var i = 0;
var r = new int[_data.Length];
var s = _select;
for (int n = 0; n < 4; n++)
    ;
foreach (string x in _data)
    r[i++] = s(x);
return r;

Does adding/removing a class attribute can have the same effect? Yes, it can. If some code is looking for attributes the attribute constructor needs to run so it has to be JIT compiled. Methods that are compiled after the attribute constructor is run will end up on a different address than before.

In 5 runs with and without the Description attribute the foreach loop ends up at these addresses:

; fast
00007FF9 D298 2D6C ; loop end at 2D9C
00007FF9 D298 2D6C
00007FF9 D299 2D6C
00007FF9 D29B 2D6C
00007FF9 D299 2D6C
; slow
00007FF9 D29A 27BC ; loop end at 27EC
00007FF9 D299 27BC
00007FF9 D29A 27BC
00007FF9 D29A 27BC
00007FF9 D299 27BC

These addresses are remarkably stable. They vary only by multiple of 64KB from run to run and that’s unlikely to matter.

CPUs (at least Intel ones) decode instructions in 16 byte blocks so failure to align loops (and more generally, any “hot” branch target) to 16 byte boundaries can lead to performance problems. But this loops is never 16 byte aligned so that’s likely not the reason for the strange speed-up.

But newer CPUs (Haswell? I need to check) also have some loop buffers that cache decoded loop instructions. Those operate on 32 byte blocks. So? The loop is never 32 byte aligned just like it wasn’t 16 byte aligned.

Yes, but the slow loop (at 27BC) happens to start near the end of a 32 byte block. That might be a problem, a loop iteration just started executing and then almost immediately a new 32 byte block needs to be fetched possibly delaying the execution of the rest of the loop.

I’m not sure if the above is a realistic explanation of what is going on but there’s very good chance that the issue is one way or another related to code alignment.

Further evidence that is code alignment related:

  • placing the empty for loop after the foreach loop has no speed-up effect
  • the empty for loop can be replaced by a new DescriptionAttribute(); statement with the same speed-up effect
  • moving the new DescriptionAttribute statement after the foreach loop again has no speed-up effect. This makes it much less likely that the issue is GC/memory related

Hrm, I played a bit with this on my Haswell desktop and it does look like there’s something fishy going on but I don’t know exactly what. The issues seems to be that the benchmark numbers are impacted by the presence of attributes on your Array class.

With 2.1 and [Config(typeof(Config))] I get ~24us. If I add back the Description attribute (or any other attribute it seems) then the time drops to ~21us. If add more attributes the time goes back to 24us.

With 2.0 it seems to work the opposite way and that would explain how come 2.1 seems faster in this case.

Right now I have no idea why those attributes impact the timings. One possibility is that BenchmarkDotNet queries them and the different number of attributes result in a different allocation pattern that has a certain impact on GC. But that’s pure guesswork.

As before, the issue can be reproduce by using a normal for loop instead of ToArray. In that case it’s easy to check the generate assembly code - it’s identical (including loop head alignment) for a given runtime version.

Until I get to my Surface I ran the benchmark on an older CPU/OS and got a slightly larger difference between 2.0 and 2.1, larger enough that it may matter:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-3770 CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores
Frequency=3312784 Hz, Resolution=301.8609 ns, Timer=TSC
.NET Core SDK=2.1.300-rc1-008673
  [Host] : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=5
WarmupCount=5

  Method |     Mean |     Error |    StdDev |  Gen 0 | Allocated |
-------- |---------:|----------:|----------:|-------:|----------:|
 ToArray | 22.57 us | 0.4560 us | 0.1184 us | 9.5215 |  39.13 KB |
 ToArray | 22.17 us | 1.0280 us | 0.2669 us | 9.5215 |  39.13 KB |
 ToArray | 22.34 us | 0.8832 us | 0.2294 us | 9.5215 |  39.13 KB |
 ToArray | 22.28 us | 0.4800 us | 0.1247 us | 9.5215 |  39.13 KB |
 ToArray | 22.36 us | 0.7602 us | 0.1975 us | 9.5215 |  39.13 KB |

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-3770 CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical and 4 physical cores
Frequency=3312784 Hz, Resolution=301.8609 ns, Timer=TSC
.NET Core SDK=2.1.300-rc1-008673
  [Host] : .NET Core 2.1.0-rc1 (CoreCLR 4.6.26426.02, CoreFX 4.6.26426.04), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=5
WarmupCount=5

  Method |     Mean |    Error  |    StdDev |  Gen 0 | Allocated |
-------- |---------:|----------:|----------:|-------:|----------:|
 ToArray | 25.11 us | 1.5240 us | 0.3957 us | 9.5215 |  39.13 KB |
 ToArray | 25.53 us | 2.1760 us | 0.5652 us | 9.5215 |  39.13 KB |
 ToArray | 24.97 us | 0.5880 us | 0.1527 us | 9.5215 |  39.13 KB |
 ToArray | 24.94 us | 0.4506 us | 0.1170 us | 9.5215 |  39.13 KB |
 ToArray | 24.77 us | 2.0160 us | 0.5235 us | 9.5215 |  39.13 KB |

5 repeated runs persistently show a 3us difference.

The same difference can be reproduced without LINQ’s ToArray:

            var i = 0;
            var r = new int[_data.Length];
            var s = _select;
            foreach (var d in _data)
                r[i++] = s(d);
            return r;

Turns out that there is a small difference in the code generated by the JIT between 2.0 and 2.1:

; 2.0
00007FFC5DA90009 49 63 CF             movsxd      rcx,r15d  
00007FFC5DA9000C 48 8B 54 CB 10       mov         rdx,qword ptr [rbx+rcx*8+10h]  
    55:             {
    56:                 k[i++] = s(x);
00007FFC5DA90011 8D 4F 01             lea         ecx,[rdi+1]  
00007FFC5DA90014 44 8B E1             mov         r12d,ecx  
00007FFC5DA90017 48 8B C6             mov         rax,rsi  
00007FFC5DA9001A 48 8B 48 08          mov         rcx,qword ptr [rax+8]  
00007FFC5DA9001E FF 50 18             call        qword ptr [rax+18h]  
00007FFC5DA90021 41 3B 7E 08          cmp         edi,dword ptr [r14+8]  
00007FFC5DA90025 73 25                jae         00007FFC5DA9004C  
00007FFC5DA90027 48 63 D7             movsxd      rdx,edi  
00007FFC5DA9002A 41 89 44 96 10       mov         dword ptr [r14+rdx*4+10h],eax  
00007FFC5DA9002F 41 FF C7             inc         r15d  
    54:             foreach (string x in _data)
00007FFC5DA90032 41 3B EF             cmp         ebp,r15d  
00007FFC5DA90035 41 8B FC             mov         edi,r12d  
00007FFC5DA90038 7F CF                jg          00007FFC5DA90009 
; 2.1
00007FFC5DAF27BC 49 63 CE             movsxd      rcx,r14d  
00007FFC5DAF27BF 48 8B 54 CB 10       mov         rdx,qword ptr [rbx+rcx*8+10h]  
    55:             {
    56:                 k[i++] = s(x);
00007FFC5DAF27C4 8D 4F 01             lea         ecx,[rdi+1]  
00007FFC5DAF27C7 44 8B E1             mov         r12d,ecx  
00007FFC5DAF27CA 48 8B C6             mov         rax,rsi  
; the lea below wasn't here before
00007FFC5DAF27CD 48 8D 48 08          lea         rcx,[rax+8]  
00007FFC5DAF27D1 48 8B 09             mov         rcx,qword ptr [rcx]  
00007FFC5DAF27D4 FF 50 18             call        qword ptr [rax+18h]  
00007FFC5DAF27D7 3B 7D 08             cmp         edi,dword ptr [rbp+8]  
00007FFC5DAF27DA 73 24                jae         00007FFC5DAF2800  
00007FFC5DAF27DC 48 63 D7             movsxd      rdx,edi  
00007FFC5DAF27DF 89 44 95 10          mov         dword ptr [rbp+rdx*4+10h],eax  
00007FFC5DAF27E3 41 FF C6             inc         r14d  
    54:             foreach (string x in _data)
00007FFC5DAF27E6 45 3B FE             cmp         r15d,r14d  
00007FFC5DAF27E9 41 8B FC             mov         edi,r12d  
00007FFC5DAF27EC 7F CE                jg          00007FFC5DAF27BC 

I don’t know yet if this codegen difference has anything to do with the slowdown. It might, especially considering that it appears to impact only older CPUs.

Even if it is the reason for the 2.0-2.1 slowdown this is unlikely to have anything to do with the artifacts folder or any other difference you have seen.

Anyways, it is not clear from your results, but are you running your benchmarks on a Windows Server 2016?

Nope, it’s the current Win10 insider preview.

That is, subsequent runs never deviate from this number save for maybe within the range of .1us to .25us max.

Well, in 5 runs I got one result that was more than 1us away from the rest:

 ToArray | 24.26 us | 0.0264 us | 0.0069 us | 12.6343 |  39.13 KB |
 ToArray | 24.39 us | 1.107 us  | 0.2877 us | 12.6343 |  39.13 KB |
 ToArray | 25.34 us | 2.444 us  | 0.6349 us | 12.6343 |  39.13 KB |
 ToArray | 24.29 us | 0.2169 us | 0.0563 us | 12.6343 |  39.13 KB |
 ToArray | 24.25 us | 0.1349 us | 0.0350 us | 12.6343 |  39.13 KB |

I think you’re looking too much into this. A real difference may exist but it’s too small and the numbers are too noisy to tell. And if you drag in different hardware, OS versions and even virtual machines then you’ll soon end up nowhere. You may as well try to count the atoms in the universe.

Moving this here to start the investigation. @valenis @jaredpar @livarcocc . I dont know if this is going to end up being a runtime, compiler, or SDK issue.