runtime: [Perf] Math functions are significantly slower on Ubuntu
The general performance of the System.Math
and System.MathF
functions on Ubuntu are pretty bad as compared to Windows.
I don’t have exact numbers for MacOS right now, but last time I ran numbers they were consistent with the Windows performance (https://github.com/dotnet/coreclr/pull/4847#issuecomment-220888873 – note that those numbers of just for the double precision functions).
Perf Numbers
All performance tests are implemented as follows:
- 100,000 iterations are executed
- The time of all iterations are aggregated to compute the
Total Time
- The time of all iterations are averaged to compute the
Average Time
- A single iteration executes some simple operation, using the function under test, 5000 times
The execution time below is the Total Time
for all 100,000 iterations, measured in seconds.
The improvement below is the how much faster the Ubuntu implementation is over the Windows implementation.
Hardware: Azure Standard D3 v2 (4 cores, 14 GB Memory) - Same as Jenkins
Function | Improvment | Execution Time - Windows | Execution Time - Ubuntu |
---|---|---|---|
absdouble | 13.890679% | 0.6268844s | 0.5398059s |
abssingle | 4.31980625% | 0.5741739s | 0.5493707s |
acosdouble | -4.67198461% | 7.7831699s | 8.1467984s |
acossingle | 20.5938899% | 5.9848033s | 4.7522995s |
asindouble | 25.3777524% | 10.4698488s | 7.8128365s |
asinsingle | 40.3112132% | 6.2957351s | 3.7578479s |
atandouble | -44.7552681% | 6.6574728s | 9.6370426s |
atansingle | -10.3842566% | 5.0162551s | 5.5371559s |
atan2double | 16.1878259% | 15.3990765s | 12.9063008s |
atan2single | 13.8083532% | 10.7394211s | 9.2564839s |
ceilingdouble | -26.5576214% | 1.4910876s | 1.887085s |
ceilingsingle | -21.4092256% | 1.3302228s | 1.6150132s |
cosdouble | -119.725433% | 5.5959633s | 12.2957546s |
cossingle | 13.955364% | 4.5950439s | 3.9537888s |
coshdouble | -5.46886513% | 9.9931702s | 10.5396832s |
coshsingle | 18.137377% | 8.5860905s | 7.0287989s |
expdouble | -68.7967436% | 5.1415544s | 8.6787764s |
expsingle | -15.2262683% | 3.7621641s | 4.3350013s |
floordouble | -23.3676423% | 1.4269253s | 1.7603641s |
floorsingle | -11.0591731% | 1.4640751s | 1.6259897s |
logdouble | -215.202268% | 4.5492266s | 14.3392654s |
logsingle | -47.2025199% | 3.7204357s | 5.4765751s |
log10double | -219.553632% | 5.0886356s | 16.2609199s |
log10single | -103.115881% | 4.0351799s | 8.1960912s |
powdouble | -49.8224443% | 26.5690144s | 39.8063468s |
powsingle | -330.357796% | 11.5863701s | 49.862847s |
rounddouble | 7.0386177% | 3.3553449s | 3.119175s |
roundsingle | 1.13396115% | 3.2015118s | 3.1652079s |
sindouble | -70.4327108% | 4.5421357s | 7.741285s |
sinsingle | 9.77591546% | 4.1445295s | 3.7393638s |
sinhdouble | -17.2300933% | 10.204286s | 11.962494s |
sinhsingle | -23.8942492% | 8.9245106s | 11.0569554s |
sqrtdouble | -57.2807184% | 2.5168265s | 3.9584828s |
sqrtsingle | -73.445483% | 1.591266s | 2.759979s |
tandouble | -141.578221% | 5.6910206s | 13.7482663s |
tansingle | -74.7429528% | 4.2112797s | 7.3589145s |
tanhdouble | -162.741805% | 4.83917s | 12.7145226s |
tanhsingle | -93.3864207% | 5.4909087s | 10.6186718s |
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 23 (18 by maintainers)
All performance tests are implemented as follows:
100,000 iterations are executed The time of all iterations are aggregated to compute the Total Time The time of all iterations are averaged to compute the Average Time A single iteration executes some simple operation, using the function under test, 5000 times The execution time below is the Total Time for all 100,000 iterations, measured in seconds.
The improvement below is the how much faster the Ubuntu implementation is over the Windows implementation.
Hardware: Azure Standard D4s v3 (4 cores, 16 GB Memory)
Well,
Vector<T>
operations are translated to SSE instructions but not all SSE instructions are exposed viaVector<T>
. If you happen to need such an instruction then you’re out of luck…And even if you can generate SSE instructions it doesn’t mean you can match the performance of hand written assembly code. You may run into perf issues due to less than ideal register allocation and lack of instruction scheduling.
That said, maybe it’s worth a try if someone has enough time to spend on this. Perhaps the perf issues aren’t significant.
Ultimately I think that the worst issue of this approach is that currently there’s no support for SIMD on ARM.