OpenBLAS: `LAPACKE_dsytrd` performance degradation with multiple threads on Windows

(This follows from https://github.com/JuliaLang/julia/issues/47211) On Windows, LAPACKE_dsytrd seems to slow down considerably in multi-threaded mode compared to single-threaded. The test code is here. For a 50x50 matrix, the results are (Windows 10 machine with i7-6700K (4 cores, 8 threads)):

> set OPENBLAS_NUM_THREADS=1
> lapack-test.exe 50 1000
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 124.9 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 203.1 us
> set OPENBLAS_NUM_THREADS=4
> lapack-test.exe 50 1000
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 1296.5 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 249.9 us

I’ve also included LAPACKE_zhetrd for comparison; it is apparent that in the 4-threaded case dsytrd is even slower than zhetrd, which seems counterintuitive. For larger matrices, dsytrd becomes faster than zhetrd, but multithreding still leads to a slowdown:

> set OPENBLAS_NUM_THREADS=1
> lapack-test.exe 500 100
dsytrd:
 Average of 100 runs for a 500x500 matrix = 10831.1 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 32492.4 us
> set OPENBLAS_NUM_THREADS=4
> lapack-test.exe 500 100
dsytrd:
 Average of 100 runs for a 500x500 matrix = 30951.7 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 59309.2 us

I could reproduce these results on four different Windows 10 machines (brief system info here). I used OpenBLAS 0.3.21 which I compiled using GCC 7.2.0 and cmake (with the default options).

For comparison, on an intel Mac (macOS 10.14.6, i5-5250U (2 cores, 4 threads)) there is only a small performance penalty (if at all) when manipulating small matrices in parallel, while for larger matrices multithreading boosts performance:

$ export OPENBLAS_NUM_THREADS=4    
$ ./lapack-test 50 1000                  
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 229.2 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 315.5 us
$ export OPENBLAS_NUM_THREADS=1    
$ ./lapack-test 50 1000                  
dsytrd:
 Average of 1000 runs for a 50x50 matrix = 180.8 us
zhetrd:
 Average of 1000 runs for a 50x50 matrix = 288.8 us

$ export OPENBLAS_NUM_THREADS=4    
$ ./lapack-test 500 100                  
dsytrd:
 Average of 100 runs for a 500x500 matrix = 14451.2 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 42615.3 us
$ export OPENBLAS_NUM_THREADS=1    
$ ./lapack-test 500 100                                                            
dsytrd:
 Average of 100 runs for a 500x500 matrix = 19221.5 us
zhetrd:
 Average of 100 runs for a 500x500 matrix = 55341.5 us

This uses OpenBLAS 0.3.21 compiled using clang from Apple LLVM 10.0.1 and GNU Fortran (GCC) 8.2.0 (for building LAPACK); cmake (with the default options).

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 28 (9 by maintainers)

Most upvoted comments

syr2 with small n is already transformed to a single-threaded axpy loop, but syr2k does indeed lack a lower threshold.

martin-frbg on Oct 30, 2022