runtime: [Perf] Math functions are significantly slower on Ubuntu

The general performance of the System.Math and System.MathF functions on Ubuntu are pretty bad as compared to Windows.

I don’t have exact numbers for MacOS right now, but last time I ran numbers they were consistent with the Windows performance (https://github.com/dotnet/coreclr/pull/4847#issuecomment-220888873 – note that those numbers of just for the double precision functions).

Perf Numbers

All performance tests are implemented as follows:

100,000 iterations are executed
The time of all iterations are aggregated to compute the Total Time
The time of all iterations are averaged to compute the Average Time
A single iteration executes some simple operation, using the function under test, 5000 times

The execution time below is the Total Time for all 100,000 iterations, measured in seconds.

The improvement below is the how much faster the Ubuntu implementation is over the Windows implementation.

Hardware: Azure Standard D3 v2 (4 cores, 14 GB Memory) - Same as Jenkins

Function	Improvment	Execution Time - Windows	Execution Time - Ubuntu
absdouble	13.890679%	0.6268844s	0.5398059s
abssingle	4.31980625%	0.5741739s	0.5493707s
acosdouble	-4.67198461%	7.7831699s	8.1467984s
acossingle	20.5938899%	5.9848033s	4.7522995s
asindouble	25.3777524%	10.4698488s	7.8128365s
asinsingle	40.3112132%	6.2957351s	3.7578479s
atandouble	-44.7552681%	6.6574728s	9.6370426s
atansingle	-10.3842566%	5.0162551s	5.5371559s
atan2double	16.1878259%	15.3990765s	12.9063008s
atan2single	13.8083532%	10.7394211s	9.2564839s
ceilingdouble	-26.5576214%	1.4910876s	1.887085s
ceilingsingle	-21.4092256%	1.3302228s	1.6150132s
cosdouble	-119.725433%	5.5959633s	12.2957546s
cossingle	13.955364%	4.5950439s	3.9537888s
coshdouble	-5.46886513%	9.9931702s	10.5396832s
coshsingle	18.137377%	8.5860905s	7.0287989s
expdouble	-68.7967436%	5.1415544s	8.6787764s
expsingle	-15.2262683%	3.7621641s	4.3350013s
floordouble	-23.3676423%	1.4269253s	1.7603641s
floorsingle	-11.0591731%	1.4640751s	1.6259897s
logdouble	-215.202268%	4.5492266s	14.3392654s
logsingle	-47.2025199%	3.7204357s	5.4765751s
log10double	-219.553632%	5.0886356s	16.2609199s
log10single	-103.115881%	4.0351799s	8.1960912s
powdouble	-49.8224443%	26.5690144s	39.8063468s
powsingle	-330.357796%	11.5863701s	49.862847s
rounddouble	7.0386177%	3.3553449s	3.119175s
roundsingle	1.13396115%	3.2015118s	3.1652079s
sindouble	-70.4327108%	4.5421357s	7.741285s
sinsingle	9.77591546%	4.1445295s	3.7393638s
sinhdouble	-17.2300933%	10.204286s	11.962494s
sinhsingle	-23.8942492%	8.9245106s	11.0569554s
sqrtdouble	-57.2807184%	2.5168265s	3.9584828s
sqrtsingle	-73.445483%	1.591266s	2.759979s
tandouble	-141.578221%	5.6910206s	13.7482663s
tansingle	-74.7429528%	4.2112797s	7.3589145s
tanhdouble	-162.741805%	4.83917s	12.7145226s
tanhsingle	-93.3864207%	5.4909087s	10.6186718s

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 23 (18 by maintainers)

Most upvoted comments

All performance tests are implemented as follows:

100,000 iterations are executed The time of all iterations are aggregated to compute the Total Time The time of all iterations are averaged to compute the Average Time A single iteration executes some simple operation, using the function under test, 5000 times The execution time below is the Total Time for all 100,000 iterations, measured in seconds.

The improvement below is the how much faster the Ubuntu implementation is over the Windows implementation.

Hardware: Azure Standard D4s v3 (4 cores, 16 GB Memory)

Function	Improvement (%)	Execution Time - Windows (s)	Execution Time - Ubuntu (s)
absdouble	3.546181638	0.5606763	0.5407937
abssingle	8.480824498	0.5905947	0.5405074
acosdouble	23.23729332	8.8021891	6.7567986
acossingle	36.07494481	6.8471822	4.377065
asindouble	32.04241391	9.5848378	6.5136244
asinsingle	48.45349467	7.5545682	3.8941159
atandouble	-27.47165151	6.9858301	8.904953
atansingle	15.17043945	5.8293196	4.9449862
atan2double	25.37952497	15.9935921	11.9344944
atan2single	12.95986626	12.0436945	10.4828478
ceilingdouble	8.114504593	0.8137071	0.7476788
ceilingsingle	1.43315031	0.7909917	0.7796556
cosdouble	-27.04654657	6.9580288	8.8399353
cossingle	45.21672442	6.0522668	3.31563
coshdouble	22.86077168	11.1648917	8.6125113
coshsingle	41.8002351	10.5539471	6.1423724
expdouble	-15.65069716	7.5727847	8.7579783
expsingle	31.85693379	5.0553795	3.4448906
floordouble	6.572571637	0.783357	0.7318703
floorsingle	7.733552501	0.8161773	0.7530578
logdouble	-71.7799955	5.7604477	9.8952968
logsingle	39.33867461	5.1587676	3.1293768
log10double	-115.8565365	5.986779	12.9228538
log10single	-11.42509124	4.8094233	5.3589043
powdouble	-12.82530235	25.8802195	29.1994359
powsingle	37.09252843	9.8087886	6.1704609
rounddouble	7.914597043	0.8141779	0.749739
roundsingle	7.193275175	0.7884934	0.7317749
sindouble	-31.31642635	5.5328101	7.2654885
sinsingle	33.30900612	5.0377057	3.359696
sinhdouble	7.783566656	10.7251256	9.8903283
sinhsingle	-8.468141754	9.1540461	9.9292237
sqrtdouble	3.148131859	1.4699289	1.4236536
sqrtsingle	-7.038505206	0.772441	0.8268093
tandouble	-84.81924424	6.703659	12.3896519
tansingle	-43.26011744	4.9282832	7.0602643
tanhdouble	-78.9103324	5.3309266	9.5375785
tanhsingle	-56.22609881	6.0341919	9.4269826

tannergooding on Jul 31, 2018

Well the C# Vector does not uses compiler intrinsic written in assembler?

Well, Vector<T> operations are translated to SSE instructions but not all SSE instructions are exposed via Vector<T>. If you happen to need such an instruction then you’re out of luck…

And even if you can generate SSE instructions it doesn’t mean you can match the performance of hand written assembly code. You may run into perf issues due to less than ideal register allocation and lack of instruction scheduling.

That said, maybe it’s worth a try if someone has enough time to spend on this. Perhaps the perf issues aren’t significant.

Ultimately I think that the worst issue of this approach is that currently there’s no support for SIMD on ARM.

mikedn on Feb 9, 2017