OpenBLAS: Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8

After the discussion of compiling the HPL on the Power 8 platform I tried several of my codes with OpenBLAS from the development branch on an IBM OpenPower 8. The machine is an IBM Model 8335-GTB (2x10 Cores, 8-way SMT, 160 virtual cores) with CentOS7.3, gcc 5.4, glibc 2.17 and Kernel 4.8.

I compiled the current OpenBLAS development ( ab2033f ) version using

make USE_OPENMP=1 USE_THREAD=1 

and linked my code (Fortran, without any OpenMP parallelized parts). The code solves a Sylvester equation AXB+CXD=F, (A,B,C,D,F,X are 1024x1024 matrices) using some algorithm and it relies on the following BLAS operations (figured out using the profiling feature of OpenBLAS):

  • dswap
  • dscal
  • dnrm2
  • idamax
  • drot
  • dgemv
  • dger
  • dtrmv
  • dgemm
  • dlaswp (see below in the posts)

The code solves the equation computing X 25 times and checks the forward error. When I execute the code with setting OMP_NUM_THREADS=20 to avoid over-subscription of the cores more than the half of the performed computations are wrong (ferr := || X_true - X_computed || / || X_true ||):

Run:   0 	 ferr = 6.12559e-12
Run:   1 	 ferr = 6.12559e-12
Run:   2 	 ferr = 0.0187795
Run:   3 	 ferr = 0.63829
Run:   4 	 ferr = 6.12559e-12
Run:   5 	 ferr = 0.0902158
Run:   6 	 ferr = 0.0154748
Run:   7 	 ferr = 4367.12
Run:   8 	 ferr = 4.22541
Run:   9 	 ferr = 6.12559e-12
Run:  10 	 ferr = 6.12559e-12
Run:  11 	 ferr = 0.00973169
Run:  12 	 ferr = 30.6905
Run:  13 	 ferr = 6.12559e-12
Run:  14 	 ferr = 12.3574
Run:  15 	 ferr = 334.143
Run:  16 	 ferr = 574.977
Run:  17 	 ferr = 0.0585436
Run:  18 	 ferr = 258.05
Run:  19 	 ferr = 401.401
Run:  20 	 ferr = 493.521
Run:  21 	 ferr = 0.112386
Run:  22 	 ferr = 0.0393082
Run:  23 	 ferr = 0.258171
Run:  24 	 ferr = 543.353

If I run the same code using OMP_NUM_THREADS=1 or the reference BLAS implementation and having OMP_NUM_THREADS unset (to ensure that really nothing in my code depends on threading) I obtain a forward error ferr of approx. 10^-12 for all runs of the benchmark. I already check what happened if I restrict the number of threads to 20 at compile time (make ... NUM_THREADS=20) and this yields the wrong result by more runs are correct. Disabling threading completely in OpenBLAS results in the correct reasults as well.

From this observation I concluded that something with threading went wrong and there is a race condition.

Interestingly, if I use the PGI for OpenPower compiler suite, which delivers a BLAS implementation based on OpenBLAS (I think a slightly modified one to be able to be compiled with the PGI compiler on the ppc64le architecture) the same error appears. But this means the bug is not in the GNU OpenMP implementation because PGI uses its own separate one.

Unfortunately, I do not have a minimal-not-working example for this bug yet, because the code mentioned above is part of current research.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 34 (22 by maintainers)

Most upvoted comments

The stock KERNEL.POWER8 file has a number of commented-out entries for xSYMV functions near the end - maybe it would make sense to enable these (first?) to see if switching back to their generic implementations helps.