runtime: Function declaration causes completely unrelated code to perform much worse (2x slower)
Hello there. This is my first issue here, I looked for guidelines but I couldn’t find anything regarding submitting issues, the closest I could find was the Contributing Guide but it’s not about how to file an issue, so I’ll just use common sense here 😛.
I was benchmarking some code when I suddenly noticed some really strange performance deviations in my results, I managed to pinpoint some of the causes, but what is really going on under the hood is still a mystery to me. In any case I managed to make a simplified repro that you can access here. I couldn’t simplify that code further but I believe it’s simple enough. Note that the “Break” preprocessor variable simply guards the declaration of a function that does nothing, but for some weird reason that still causes the benchmark to run two times slower (in release builds) as you can see in the results:
without defining “Break”:
Method | Mean | Error | StdDev |
---|---|---|---|
First | 28.68 ms | 0.5697 ms | 0.6096 ms |
with “Break” defined (and no attached debugger just in case although there’s no difference):
Method | Mean | Error | StdDev |
---|---|---|---|
First | 56.43 ms | 1.081 ms | 1.245 ms |
I’ve tried to investigate that issue further, but not only I failed to remove BenchmarkDotNet (BDN) and get the same results (by using a simple StopWatch), but I also couldn’t pinpoint what difference in the IL or JIT could justify those results, both the IL and JIT for the “First” (boi I should’ve rename it) function are exactly the same (link) what blew my mind since the loop is inside that function. @saucecontrol suggested it could be alignment, I don’t know whether such a difference in performance could happen due to nonalignment, but the code is aligned and all call
instructions in Main
are calling 8-aligned addresses (but they’re calls to BDN).
My first intuition here is that something is wrong with BDN, but I very much doubt it’s a bug in it since it’s the most popular tool for microbenchmarking and that difference is just too large for something like that to go unnoticed this far. My guess is that there’s a bug somewhere that BDN somehow triggers and it isn’t trivial to reproduce externally. That somewhere is the real mystery though, since the native machine code is exactly the same I can only conclude that there’s something really hairy going on here, like the code is being invalidated and re-jitted mid-loop or something, but I honestly don’t think I’m minimally qualified to even make a reasonable guess here.
Thank you.
category:cq theme:alignment skill-level:expert cost:large
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 38 (16 by maintainers)
That would also change the alignment of some assembly stubs. Those are tiny and giving them a generous alignment would probably hurt (code loses locality for no reason).
FWIW, for .NET Native on UWP, we default to 16-byte alignment based on some measurements on microbenchmarks that the UTC backend team did in the past (the C++ compiler also defaults to this number).
We ended up having to offer an opt out that switches alignment to 4 for customers that have tight size budgets (like inbox Windows UWP apps).
I think we should just do the same thing for the CoreRT compiler - default to 16 and have an opt out command line option in ILC that makes it use the minimum allowed (pull requests welcome).
We already started aligning methods at 32B boundary in .NET 5 and implemented loop alignment in .NET 6. I am not able to reproduce the slowness.
I believe we can make a case for selectively increasing alignment demand to 32 bytes for some methods (say based on tiering feedback, or perhaps attributes). This is not the first time we’ve seen things like this (#16873).
Having the jit optimize alignment of internal branch targets (#11607) is a likely prerequisite. And internal alignment is tricky to do for 16 bytes, so will be even trickier to do for 32.
Am going to label this as future as I don’t see us making any CoreCLR changes here for 3.0.
@adamsitnik wow, thank you for the input! 😃 I’m relieved somebody else could repro this tbh… it’s really strange though that you can’t reproduce it consistently, here it happens every single time. As I said I could repro the “phenomena” 😏 without BDN so it’s definitely not the case for a bug report about it. Also, the minor difficulty I was having was because it needs the full symbols to print out the code along the asm and I passed the
true
that activates that by mistake… the tool is solid and very easy to use, you guys did a great job! I’ll take a look at that PR @adamsitnik, thank you.I think the new branch that doesn’t run BDN could be more helpful to debug that though… that script will build separate binaries for each function declaration count, and since the code is simpler it should be easier to figure what exactly is happening differently… I just don’t know what kind of tools would be appropriate for that… maybe some tools that allows you to not only dump the whole jitted code but maybe also step through and whatnot… and perhaps similar tools for IL code too.
OK guys, I think I’m getting somewhere with this… I’ve compiled variations with one up to 64 declared functions and managed to repro that with simple stopwatch… if anybody wants to give it a try, I made a handy script that should compile and run all the variations so you can see if any of them causes the problem in your machine (atm just compare the numbers manually, should be easy to spot the slow ones… automation coming someday™). Later on I’ll run some more tests but at least now we can rule out BDN as the source of the problem.
It’s really starting to look like a weird alignment issue although I haven’t checked the addresses, in this machine the builds that are slow are: 26,27,28,29,30,31,46,47,62,63… there’s no clear pattern I can see there.
Boi I just wanted to make some games 😢
cc @adamsitnik
This is a case where it would be nice to have access to “CPI” results (cycles per instruction). If the root cause really is loop alignment, we’d expect the fast and slow versions to run the same number of instructions, and the time difference would then just be from internal CPU stalls.
BDN is already measuring cycles (or a reasonable proxy).
To measure instructions, BDN can make user of hardware counters so perhaps we can try this out? If you add in an
InstructionRetired
hardware counter, what does it tell you?HW counter access might require running as admin, at least on windows. And I think it won’t work if you are running in a VM.