runtime: CRT Pow function has bad performance on Windows

During benchmarking AoS/SoA ray-tracer https://github.com/dotnet/coreclr/pull/18839, we found that the Vector3 benchmark (RayTracer) is much slower on Windows than Linux.

Execution time Windows Linux
Baseline (RayTracer ) 6.00s 4.13s
PacketTracer 1.20s 1.35s
Performance Gains 5.00x 3.06x

According to VTune analysis, this gap is caused by the CRT math library, which RayTracer uses Math.Pow at https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/SIMD/RayTracer/Raytracer.cs#L153

Windows

image

Linux

image

On the left side (AoS means RayTracer), we can see ucrtbase.dll on Windows has much more time consuming and instruction retired than libm-2.23.so on Linux.

The data is collected on Core i9 + VS2017, but Core i7+ VS2015 has the same performance gap.

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 2
  • Comments: 18 (17 by maintainers)

Most upvoted comments

Looks like glibc recently (07 AUG 2017) made a few changes: https://sourceware.org/git/?p=glibc.git;a=commit;h=57a72fa3502673754d14707da02c7c44e83b8d20

Namely, they still use the IBM Accurate Mathematical Library as their root source code, however, they now have some new logic which additionally compiles that code with the -mfma and -mavx2 flags, which provides some automatic transformations/optimizations (it looks like they do a cached CPUID check at runtime and jump to the appropriate code).

Additionally, it looks like, since the calling conventions map up, they generally end up calling libm-2.27.so~__pow directly, rather than having an intermediate call through COMDouble::Pow.

CC. @CarolEidt, @AndyAyersMS, @jkotas