scipy: BUG: Hang on Windows in scikit-learn with 1.13rc1 and 1.14.dev (maybe due to OpenBLAS 0.3.26 update?)

Describe your issue.

This was originally seen in MNE-Python CI roughly a week ago (March 13) using the scipy developement wheel (scientific-python-nightly-wheel) and reported in scikit-learn see https://github.com/scikit-learn/scikit-learn/issues/28625 for more details.

This can also be reproduced with scipy 1.13rc1. Could it be due to updating OpenBLAS to 0.3.26?

The scikit-learn code does nested parallelism, OpenBLAS within OpenMP. Setting OMP_NUM_THREADS=1 or OPENBLAS_NUM_THREADS=1 avoids the hang. A similar work-around can be used with threadpoolctl.

Insights or suggestions more than welcome!

Reproducing Code Example

from sklearn.metrics._pairwise_distances_reduction import ArgKmin
import numpy as np
import threadpoolctl

# Commenting any of these two lines avoids the hang
# threadpoolctl.threadpool_limits(limits=1, user_api='blas')
# threadpoolctl.threadpool_limits(limits=1, user_api='openmp')
X = np.zeros((20, 14000))
ArgKmin.compute(X=X, Y=X, k=10, metric='euclidean')

Error message

Hang, i.e. the script never completes. With one of the work-around it completes in less than one second.

SciPy/NumPy/Python version and system information

1.13.0rc1 1.26.4 sys.version_info(major=3, minor=12, micro=2, releaselevel='final', serial=0)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /c/opt/64/include
    lib directory: /c/opt/64/lib
    name: openblas
    openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= ZEN MAX_THREADS=24
    pc file directory: c:/opt/64/lib/pkgconfig
    version: 0.3.26
  lapack:
    detection method: pkgconfig
    found: true
    include directory: /c/opt/64/include
    lib directory: /c/opt/64/lib
    name: openblas
    openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= ZEN MAX_THREADS=24
    pc file directory: c:/opt/64/lib/pkgconfig
    version: 0.3.26
  pybind11:
    detection method: config-tool
    include directory: unknown
    name: pybind11
    version: 2.11.1
Compilers:
  c:
    commands: cc
    linker: ld.bfd
    name: gcc
    version: 10.3.0
  c++:
    commands: c++
    linker: ld.bfd
    name: gcc
    version: 10.3.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.9
  fortran:
    commands: gfortran
    linker: ld.bfd
    name: gcc
    version: 10.3.0
  pythran:
    include directory: C:\Users\runneradmin\AppData\Local\Temp\pip-build-env-t0bobn5y\overlay\Lib\site-packages/pythr
an
    version: 0.15.0
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: windows
  cross-compiled: false
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: windows
Python Information:
  path: C:\Users\runneradmin\AppData\Local\Temp\cibw-run-bxqwrdgx\cp312-win_amd64\build\venv\Scripts\python.exe
  version: '3.12'

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Comments: 53 (33 by maintainers)

Most upvoted comments

@lesteve @tylerjereddy @martin-frbg I have been able to reliably reproduce this issue, believe I have determined the cause and am testing the fix now. Once testing confirms the fix I will send over a PR. ETA is today.

The good news is that if this addresses the issue the change is very small, an unfortunate bug in task queue management that only shows up with nested parallelism.

@martin-frbg we should add a test case for this scenario - ideally one that fails now with the current code so that we can confirm the fix addresses that failing test case. I am not familiar with how to integrate a utest that uses OMP. Once I have a confirmed fix I will explore that but any pointers to accelerate me are appreciated.

Unfortunately i am traveling this weekend and will not be back to a computer till Tues. I will create an issue on the the repo with instructions and ping you there.

If you don’t mind, I’m going to leave this open until we take an action that fixes the problem in our wheels for the 1.13.0 release process. It is easy for me to forget about stuff otherwise, especially with the NumPy 2.0 support challenge coming up on top of it.

For completeness sake, below is a Cython snippet that is close to what scikit-learn is doing, i.e. computing squared norms for each row of a 2d array in parallel.

cdef sqeuclidean_row_norms(const float64_t[::1] X intp_t num_threads):
    """Compute the squared euclidean norm of the rows of X in parallel.

    This is faster than using np.einsum("ij, ij->i") even when using a single thread.
    """
    cdef:
        # Casting for X to remove the const qualifier is needed because APIs
        # exposed via scipy.linalg.cython_blas aren't reflecting the arguments'
        # const qualifier.
        # See: https://github.com/scipy/scipy/issues/14262
        float64_t * X_ptr = <float64_t *> &X[0, 0]
        intp_t idx = 0
        int inc = 1
        intp_t n = X.shape[0]
        intp_t d = X.shape[1]
        int id = <int> d
        intp_t thread_num;
        float64_t[::1] squared_row_norms = np.empty(n, dtype=np.float64)

    # with threadpool_limits(limits=1, user_api='blas'):
    # with threadpool_limits(limits=4, user_api='blas'):
    cdef int nb_threads = omp_get_max_threads()
    printf("number of threads: %d", nb_threads)
    for idx in prange(n, schedule='static', nogil=True, num_threads=num_threads):
        squared_row_norms[idx] = ddot(&id, X_ptr + idx * d, &inc, X_ptr + idx * d, &inc)

    return squared_row_norms

I cannot repro the issue with or without OpenMP on Linux. (Edited to remove autocorrection nonsense).

My repro above is indeed a simple example, I assume scikit-learn does something else, not just repeated computations in a loop. I suppose this example can made closer to what the OP was doing, in an environment where the OP issue is reproducible.

ISTM it would be useful to see if the problem is with OpenBLAS itself or with Cython.

This will be quite useful (with the note that it’s mostly about taking SciPy out of the equation rather than Cython itself). The reproducer should use nested parallelism though with an explicit OpenMP call, otherwise nothing will be wrong.

Sure, here’s a C reproducer, might be easier on Windows:

$ cat ddot.c
#include<stdio.h>
#include<stdlib.h>
#include "cblas.h"

int main(){

int n = 14000;
int incx = 1;
double *x, ddt;

x = (double*)malloc(n*sizeof(double));

for (int i=0; i<n; i++){
     x[i] = 1.0;
}

ddt = cblas_ddot(n, x, incx, x, incx);
printf("dot(x, x) = %f\n", ddt);
free(x);
}

ISTM it would be useful to see if the problem is with OpenBLAS itself or with Cython.

right, nrm2 on x86_64 is not multithreaded

wait, e60fb0f is the merge commit for Mark’s 4359 (the windows thread server rewrite) not my 4425 - that’s still unfortunate but a lot more plausible

It does seem like https://github.com/OpenMathLib/OpenBLAS/pull/4359 is indeed the cause of this behaviour.

I managed to build OpenBLAS on Windows following https://github.com/OpenMathLib/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio (Native (MSVC) ABI way) and replace the dll.

For completeness the way I used to setup the build:

cmake .. \
    -G "Ninja" -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_C_COMPILER=clang-cl \
    -DCMAKE_Fortran_COMPILER=flang -DCMAKE_MT=mt -DBUILD_WITHOUT_LAPACK=yes \
    -DNOFORTRAN=1 -DDYNAMIC_ARCH=OFF -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON

Maybe the recent changes in

I’d consider that unlikely especially if you’re not using an OpenMP build of OpenBLAS in the first place