scipy: BUG: Hang on Windows in scikit-learn with 1.13rc1 and 1.14.dev (maybe due to OpenBLAS 0.3.26 update?)
Describe your issue.
This was originally seen in MNE-Python CI roughly a week ago (March 13) using the scipy developement wheel (scientific-python-nightly-wheel) and reported in scikit-learn see https://github.com/scikit-learn/scikit-learn/issues/28625 for more details.
This can also be reproduced with scipy 1.13rc1. Could it be due to updating OpenBLAS to 0.3.26?
The scikit-learn code does nested parallelism, OpenBLAS within OpenMP. Setting OMP_NUM_THREADS=1
or OPENBLAS_NUM_THREADS=1
avoids the hang. A similar work-around can be used with threadpoolctl
.
Insights or suggestions more than welcome!
Reproducing Code Example
from sklearn.metrics._pairwise_distances_reduction import ArgKmin
import numpy as np
import threadpoolctl
# Commenting any of these two lines avoids the hang
# threadpoolctl.threadpool_limits(limits=1, user_api='blas')
# threadpoolctl.threadpool_limits(limits=1, user_api='openmp')
X = np.zeros((20, 14000))
ArgKmin.compute(X=X, Y=X, k=10, metric='euclidean')
Error message
Hang, i.e. the script never completes. With one of the work-around it completes in less than one second.
SciPy/NumPy/Python version and system information
1.13.0rc1 1.26.4 sys.version_info(major=3, minor=12, micro=2, releaselevel='final', serial=0)
Build Dependencies:
blas:
detection method: pkgconfig
found: true
include directory: /c/opt/64/include
lib directory: /c/opt/64/lib
name: openblas
openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= ZEN MAX_THREADS=24
pc file directory: c:/opt/64/lib/pkgconfig
version: 0.3.26
lapack:
detection method: pkgconfig
found: true
include directory: /c/opt/64/include
lib directory: /c/opt/64/lib
name: openblas
openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= ZEN MAX_THREADS=24
pc file directory: c:/opt/64/lib/pkgconfig
version: 0.3.26
pybind11:
detection method: config-tool
include directory: unknown
name: pybind11
version: 2.11.1
Compilers:
c:
commands: cc
linker: ld.bfd
name: gcc
version: 10.3.0
c++:
commands: c++
linker: ld.bfd
name: gcc
version: 10.3.0
cython:
commands: cython
linker: cython
name: cython
version: 3.0.9
fortran:
commands: gfortran
linker: ld.bfd
name: gcc
version: 10.3.0
pythran:
include directory: C:\Users\runneradmin\AppData\Local\Temp\pip-build-env-t0bobn5y\overlay\Lib\site-packages/pythr
an
version: 0.15.0
Machine Information:
build:
cpu: x86_64
endian: little
family: x86_64
system: windows
cross-compiled: false
host:
cpu: x86_64
endian: little
family: x86_64
system: windows
Python Information:
path: C:\Users\runneradmin\AppData\Local\Temp\cibw-run-bxqwrdgx\cp312-win_amd64\build\venv\Scripts\python.exe
version: '3.12'
About this issue
- Original URL
- State: open
- Created 3 months ago
- Comments: 53 (33 by maintainers)
@lesteve @tylerjereddy @martin-frbg I have been able to reliably reproduce this issue, believe I have determined the cause and am testing the fix now. Once testing confirms the fix I will send over a PR. ETA is today.
The good news is that if this addresses the issue the change is very small, an unfortunate bug in task queue management that only shows up with nested parallelism.
@martin-frbg we should add a test case for this scenario - ideally one that fails now with the current code so that we can confirm the fix addresses that failing test case. I am not familiar with how to integrate a utest that uses OMP. Once I have a confirmed fix I will explore that but any pointers to accelerate me are appreciated.
Unfortunately i am traveling this weekend and will not be back to a computer till Tues. I will create an issue on the the repo with instructions and ping you there.
If you don’t mind, I’m going to leave this open until we take an action that fixes the problem in our wheels for the
1.13.0
release process. It is easy for me to forget about stuff otherwise, especially with the NumPy 2.0 support challenge coming up on top of it.For completeness sake, below is a Cython snippet that is close to what scikit-learn is doing, i.e. computing squared norms for each row of a 2d array in parallel.
I cannot repro the issue with or without OpenMP on Linux. (Edited to remove autocorrection nonsense).
My repro above is indeed a simple example, I assume scikit-learn does something else, not just repeated computations in a loop. I suppose this example can made closer to what the OP was doing, in an environment where the OP issue is reproducible.
This will be quite useful (with the note that it’s mostly about taking SciPy out of the equation rather than Cython itself). The reproducer should use nested parallelism though with an explicit OpenMP call, otherwise nothing will be wrong.
Sure, here’s a C reproducer, might be easier on Windows:
ISTM it would be useful to see if the problem is with OpenBLAS itself or with Cython.
right, nrm2 on x86_64 is not multithreaded
wait, e60fb0f is the merge commit for Mark’s 4359 (the windows thread server rewrite) not my 4425 - that’s still unfortunate but a lot more plausible
It does seem like https://github.com/OpenMathLib/OpenBLAS/pull/4359 is indeed the cause of this behaviour.
I managed to build OpenBLAS on Windows following https://github.com/OpenMathLib/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio (Native (MSVC) ABI way) and replace the dll.
develop
branch https://github.com/OpenMathLib/OpenBLAS/commit/5b09833b1c877ad1d395ee38791521ccd32386beFor completeness the way I used to setup the build:
I’d consider that unlikely especially if you’re not using an OpenMP build of OpenBLAS in the first place