runtime: [Perf] Math functions are significantly slower on Ubuntu

The general performance of the System.Math and System.MathF functions on Ubuntu are pretty bad as compared to Windows.

I don’t have exact numbers for MacOS right now, but last time I ran numbers they were consistent with the Windows performance (https://github.com/dotnet/coreclr/pull/4847#issuecomment-220888873 – note that those numbers of just for the double precision functions).

Perf Numbers

All performance tests are implemented as follows:

  • 100,000 iterations are executed
  • The time of all iterations are aggregated to compute the Total Time
  • The time of all iterations are averaged to compute the Average Time
  • A single iteration executes some simple operation, using the function under test, 5000 times

The execution time below is the Total Time for all 100,000 iterations, measured in seconds.

The improvement below is the how much faster the Ubuntu implementation is over the Windows implementation.

Hardware: Azure Standard D3 v2 (4 cores, 14 GB Memory) - Same as Jenkins

Function Improvment Execution Time - Windows Execution Time - Ubuntu
absdouble 13.890679% 0.6268844s 0.5398059s
abssingle 4.31980625% 0.5741739s 0.5493707s
acosdouble -4.67198461% 7.7831699s 8.1467984s
acossingle 20.5938899% 5.9848033s 4.7522995s
asindouble 25.3777524% 10.4698488s 7.8128365s
asinsingle 40.3112132% 6.2957351s 3.7578479s
atandouble -44.7552681% 6.6574728s 9.6370426s
atansingle -10.3842566% 5.0162551s 5.5371559s
atan2double 16.1878259% 15.3990765s 12.9063008s
atan2single 13.8083532% 10.7394211s 9.2564839s
ceilingdouble -26.5576214% 1.4910876s 1.887085s
ceilingsingle -21.4092256% 1.3302228s 1.6150132s
cosdouble -119.725433% 5.5959633s 12.2957546s
cossingle 13.955364% 4.5950439s 3.9537888s
coshdouble -5.46886513% 9.9931702s 10.5396832s
coshsingle 18.137377% 8.5860905s 7.0287989s
expdouble -68.7967436% 5.1415544s 8.6787764s
expsingle -15.2262683% 3.7621641s 4.3350013s
floordouble -23.3676423% 1.4269253s 1.7603641s
floorsingle -11.0591731% 1.4640751s 1.6259897s
logdouble -215.202268% 4.5492266s 14.3392654s
logsingle -47.2025199% 3.7204357s 5.4765751s
log10double -219.553632% 5.0886356s 16.2609199s
log10single -103.115881% 4.0351799s 8.1960912s
powdouble -49.8224443% 26.5690144s 39.8063468s
powsingle -330.357796% 11.5863701s 49.862847s
rounddouble 7.0386177% 3.3553449s 3.119175s
roundsingle 1.13396115% 3.2015118s 3.1652079s
sindouble -70.4327108% 4.5421357s 7.741285s
sinsingle 9.77591546% 4.1445295s 3.7393638s
sinhdouble -17.2300933% 10.204286s 11.962494s
sinhsingle -23.8942492% 8.9245106s 11.0569554s
sqrtdouble -57.2807184% 2.5168265s 3.9584828s
sqrtsingle -73.445483% 1.591266s 2.759979s
tandouble -141.578221% 5.6910206s 13.7482663s
tansingle -74.7429528% 4.2112797s 7.3589145s
tanhdouble -162.741805% 4.83917s 12.7145226s
tanhsingle -93.3864207% 5.4909087s 10.6186718s

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 23 (18 by maintainers)

Most upvoted comments

All performance tests are implemented as follows:

100,000 iterations are executed The time of all iterations are aggregated to compute the Total Time The time of all iterations are averaged to compute the Average Time A single iteration executes some simple operation, using the function under test, 5000 times The execution time below is the Total Time for all 100,000 iterations, measured in seconds.

The improvement below is the how much faster the Ubuntu implementation is over the Windows implementation.

Hardware: Azure Standard D4s v3 (4 cores, 16 GB Memory)

Function Improvement (%) Execution Time - Windows (s) Execution Time - Ubuntu (s)
absdouble 3.546181638 0.5606763 0.5407937
abssingle 8.480824498 0.5905947 0.5405074
acosdouble 23.23729332 8.8021891 6.7567986
acossingle 36.07494481 6.8471822 4.377065
asindouble 32.04241391 9.5848378 6.5136244
asinsingle 48.45349467 7.5545682 3.8941159
atandouble -27.47165151 6.9858301 8.904953
atansingle 15.17043945 5.8293196 4.9449862
atan2double 25.37952497 15.9935921 11.9344944
atan2single 12.95986626 12.0436945 10.4828478
ceilingdouble 8.114504593 0.8137071 0.7476788
ceilingsingle 1.43315031 0.7909917 0.7796556
cosdouble -27.04654657 6.9580288 8.8399353
cossingle 45.21672442 6.0522668 3.31563
coshdouble 22.86077168 11.1648917 8.6125113
coshsingle 41.8002351 10.5539471 6.1423724
expdouble -15.65069716 7.5727847 8.7579783
expsingle 31.85693379 5.0553795 3.4448906
floordouble 6.572571637 0.783357 0.7318703
floorsingle 7.733552501 0.8161773 0.7530578
logdouble -71.7799955 5.7604477 9.8952968
logsingle 39.33867461 5.1587676 3.1293768
log10double -115.8565365 5.986779 12.9228538
log10single -11.42509124 4.8094233 5.3589043
powdouble -12.82530235 25.8802195 29.1994359
powsingle 37.09252843 9.8087886 6.1704609
rounddouble 7.914597043 0.8141779 0.749739
roundsingle 7.193275175 0.7884934 0.7317749
sindouble -31.31642635 5.5328101 7.2654885
sinsingle 33.30900612 5.0377057 3.359696
sinhdouble 7.783566656 10.7251256 9.8903283
sinhsingle -8.468141754 9.1540461 9.9292237
sqrtdouble 3.148131859 1.4699289 1.4236536
sqrtsingle -7.038505206 0.772441 0.8268093
tandouble -84.81924424 6.703659 12.3896519
tansingle -43.26011744 4.9282832 7.0602643
tanhdouble -78.9103324 5.3309266 9.5375785
tanhsingle -56.22609881 6.0341919 9.4269826

Well the C# Vector does not uses compiler intrinsic written in assembler?

Well, Vector<T> operations are translated to SSE instructions but not all SSE instructions are exposed via Vector<T>. If you happen to need such an instruction then you’re out of luck…

And even if you can generate SSE instructions it doesn’t mean you can match the performance of hand written assembly code. You may run into perf issues due to less than ideal register allocation and lack of instruction scheduling.

That said, maybe it’s worth a try if someone has enough time to spend on this. Perhaps the perf issues aren’t significant.

Ultimately I think that the worst issue of this approach is that currently there’s no support for SIMD on ARM.