OpenBLAS: Slowdown when using openblas-pthreads alongside openmp based parallel code

Hi,

I have a code which mixes BLAS calls (gemm) and OpenMP based parallel loops (they are not nested). When OpenBLAS is built using OpenMP everything is fine but when OpenBLAS is built with pthreads there’s a huge slowdown. Below is a reproducible example (sorry it’s from python/cython)

%load_ext cython

%%cython -f --compile-args=-fopenmp --link-args=-fopenmp
# cython: profile=True, cdivision=True, boundscheck=False, wraparound=False

import time
import numpy as np
from cython.parallel import prange
 
def f2(double[:, ::1] A):
     cdef:
         double v = 0
         int m = A.shape[0]
         int n = A.shape[1]
         int i, j
 
    with nogil:
        for i in prange(m):      # OpenMP parallel for loop (*)
            for j in range(n):
                v += A[i, j]
 
    return v
    
    
def f1(U, V):
    v = 0
    for n_iter in range(100):
        UV = U @ V.T             # BLAS call (gemm)
        v += f2(UV)              # function runs an OpenMP parallel for loop
    return v
    

U = np.random.randn(10000, 100)
V = np.random.randn(10, 100)
 
t = time.time()
v = f1(U, V)
print(time.time() - t)

On my laptop (2 physical cores), when I use a sequential loop in (*), it runs in 0.26s. When I use a parallel loop it runs in 2.6s (10x slower). This is with OpenBLAS 0.3.12 built with pthreads. This conda env allows to reproduce conda create -n tmp -c conda-forge python numpy cython ipython.

However, if I use OpenBLAS built with OpenMP, it runs in 0.26s with and without prange. This is with OpenBLAS 0.3.9 built with OpenMP. This conda env allows to reproduce conda create -n tmp -c conda-forge python numpy cython ipython blas[build=openblas] libopenblas=0.3.9.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 27 (13 by maintainers)

Most upvoted comments

@jeremiedbb OpenBLAS does have a similar wait policy for its threads, governed by the value of THREAD_TIMEOUT at build time (or the environment variable OPENBLAS_THREAD_TIMEOUT at runtime) which defines the number of clock cycles to wait as 1 << n where the default n is 26 (which according to the comment in Makefile.rule is equivalent to about 25ms at 3GHz).

martin-frbg on Oct 12, 2021

behaviour should not be affected

OMP placement policy actually sets CPU affinity for OMP threads, so all the following pthreads cannot escape that. There is no hidden side-step API that nobody else uses.

brada4 on Apr 27, 2021

That just forces each new unnecessary OpenBLAS pthread swarm onto a single core,slightly better at data locality than completely unbound but still bad.

OpenBLAS pthread behaviour should not be affected by what openmp does since the openblas calls are from the main thread and an unrelated work happens in the openmp threads.

isuruf on Apr 26, 2021

I see the same issue. 30x slowdown with libgomp and 2x slowdown with libomp.

isuruf on Apr 23, 2021

It’s not an oversubscription issue here. They are not nested. I first call gemm and then call a function which executes a parallel loop (with no blas inside).

jeremiedbb on Apr 23, 2021