OpenBLAS: `LAPACKE_dsytrd` performance degradation with multiple threads on Windows
(This follows from https://github.com/JuliaLang/julia/issues/47211)
On Windows, LAPACKE_dsytrd seems to slow down considerably in multi-threaded mode compared to single-threaded. The test code is here. For a 50x50 matrix, the results are (Windows 10 machine with i7-6700K (4 cores, 8 threads)):
> set OPENBLAS_NUM_THREADS=1
> lapack-test.exe 50 1000
dsytrd:
Average of 1000 runs for a 50x50 matrix = 124.9 us
zhetrd:
Average of 1000 runs for a 50x50 matrix = 203.1 us
> set OPENBLAS_NUM_THREADS=4
> lapack-test.exe 50 1000
dsytrd:
Average of 1000 runs for a 50x50 matrix = 1296.5 us
zhetrd:
Average of 1000 runs for a 50x50 matrix = 249.9 us
I’ve also included LAPACKE_zhetrd for comparison; it is apparent that in the 4-threaded case dsytrd is even slower than zhetrd, which seems counterintuitive.
For larger matrices, dsytrd becomes faster than zhetrd, but multithreding still leads to a slowdown:
> set OPENBLAS_NUM_THREADS=1
> lapack-test.exe 500 100
dsytrd:
Average of 100 runs for a 500x500 matrix = 10831.1 us
zhetrd:
Average of 100 runs for a 500x500 matrix = 32492.4 us
> set OPENBLAS_NUM_THREADS=4
> lapack-test.exe 500 100
dsytrd:
Average of 100 runs for a 500x500 matrix = 30951.7 us
zhetrd:
Average of 100 runs for a 500x500 matrix = 59309.2 us
I could reproduce these results on four different Windows 10 machines (brief system info here). I used OpenBLAS 0.3.21 which I compiled using GCC 7.2.0 and cmake (with the default options).
For comparison, on an intel Mac (macOS 10.14.6, i5-5250U (2 cores, 4 threads)) there is only a small performance penalty (if at all) when manipulating small matrices in parallel, while for larger matrices multithreading boosts performance:
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test 50 1000
dsytrd:
Average of 1000 runs for a 50x50 matrix = 229.2 us
zhetrd:
Average of 1000 runs for a 50x50 matrix = 315.5 us
$ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test 50 1000
dsytrd:
Average of 1000 runs for a 50x50 matrix = 180.8 us
zhetrd:
Average of 1000 runs for a 50x50 matrix = 288.8 us
$ export OPENBLAS_NUM_THREADS=4
$ ./lapack-test 500 100
dsytrd:
Average of 100 runs for a 500x500 matrix = 14451.2 us
zhetrd:
Average of 100 runs for a 500x500 matrix = 42615.3 us
$ export OPENBLAS_NUM_THREADS=1
$ ./lapack-test 500 100
dsytrd:
Average of 100 runs for a 500x500 matrix = 19221.5 us
zhetrd:
Average of 100 runs for a 500x500 matrix = 55341.5 us
This uses OpenBLAS 0.3.21 compiled using clang from Apple LLVM 10.0.1 and GNU Fortran (GCC) 8.2.0 (for building LAPACK); cmake (with the default options).
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 28 (9 by maintainers)
syr2 with small n is already transformed to a single-threaded axpy loop, but syr2k does indeed lack a lower threshold.