runtime: CRT Pow function has bad performance on Windows
During benchmarking AoS/SoA ray-tracer https://github.com/dotnet/coreclr/pull/18839, we found that the Vector3
benchmark (RayTracer) is much slower on Windows than Linux.
Execution time | Windows | Linux |
---|---|---|
Baseline (RayTracer ) | 6.00s | 4.13s |
PacketTracer | 1.20s | 1.35s |
Performance Gains | 5.00x | 3.06x |
According to VTune analysis, this gap is caused by the CRT math library, which RayTracer uses Math.Pow
at https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/SIMD/RayTracer/Raytracer.cs#L153
Windows
Linux
On the left side (AoS means RayTracer), we can see ucrtbase.dll
on Windows has much more time consuming and instruction retired than libm-2.23.so
on Linux.
The data is collected on Core i9 + VS2017, but Core i7+ VS2015 has the same performance gap.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 2
- Comments: 18 (17 by maintainers)
Looks like glibc recently (07 AUG 2017) made a few changes: https://sourceware.org/git/?p=glibc.git;a=commit;h=57a72fa3502673754d14707da02c7c44e83b8d20
Namely, they still use the
IBM Accurate Mathematical Library
as their root source code, however, they now have some new logic which additionally compiles that code with the-mfma
and-mavx2
flags, which provides some automatic transformations/optimizations (it looks like they do a cached CPUID check at runtime and jump to the appropriate code).Additionally, it looks like, since the calling conventions map up, they generally end up calling
libm-2.27.so~__pow
directly, rather than having an intermediate call throughCOMDouble::Pow
.CC. @CarolEidt, @AndyAyersMS, @jkotas