scikit-learn: Tests test_theil_sen_parallel and test_multi_output_classification_partial_fit_parallelism hang on Windows

Description

Install scikit-learn 0.20.0 on Windows as follows:

conda create -n hang python=3.6.5
pip install scikit-learn==0.20.0
pip install pytest

Steps/Code to Reproduce

The following two individual test runs hang (never finish, and remain uninterruptable):

pytest -v --pyarg sklearn.tests.test_multioutput::test_multi_output_classification_partial_fit_parallelism

pytest -v --pyarg sklearn.linear_model.tests.test_theil_sen::test_theil_sen_parallel

Expected Results

Expecting them to pass as they do on Linux, or skipped in the distribution

Actual Results

Tests hang in both pip installed scikit-learn, as well in scikit-learn installed via conda itself.

Reproduced on “Windows Server 2012 R2 Standard”.

Versions

>pip list
Package        Version
-------------- ---------
atomicwrites   1.2.1
attrs          18.2.0
certifi        2018.8.24
colorama       0.3.9
more-itertools 4.3.0
numpy          1.15.2
pip            10.0.1
pluggy         0.7.1
py             1.6.0
pytest         3.8.2
scikit-learn   0.20.0
scipy          1.1.0
setuptools     40.2.0
six            1.11.0
wheel          0.31.1
wincertstore   0.2

@ogrisel

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 35 (35 by maintainers)

Commits related to this issue

MAINT: n_jobs=-1 replaced with n_jobs=4 in tests This change is to work around the hang https://github.com/scikit-learn/scikit-learn/issues/12263 afflicting Windows on machines with > 62 hyperthre... — committed to oleksandr-pavlyk/scikit-learn by oleksandr-pavlyk 5 years ago
MAINT: n_jobs=-1 replaced with n_jobs=4 in tests (#13644) This change is to work around the hang https://github.com/scikit-learn/scikit-learn/issues/12263 afflicting Windows on machines with >... — committed to scikit-learn/scikit-learn by oleksandr-pavlyk 5 years ago
MAINT: n_jobs=-1 replaced with n_jobs=4 in tests (#13644) This change is to work around the hang https://github.com/scikit-learn/scikit-learn/issues/12263 afflicting Windows on machines with >... — committed to jeremiedbb/scikit-learn by oleksandr-pavlyk 5 years ago
MAINT: n_jobs=-1 replaced with n_jobs=4 in tests (#13644) This change is to work around the hang https://github.com/scikit-learn/scikit-learn/issues/12263 afflicting Windows on machines with >... — committed to xhluca/scikit-learn by oleksandr-pavlyk 5 years ago
MAINT: n_jobs=-1 replaced with n_jobs=4 in tests (#13644) This change is to work around the hang https://github.com/scikit-learn/scikit-learn/issues/12263 afflicting Windows on machines with >... — committed to koenvandevelde/scikit-learn by oleksandr-pavlyk 5 years ago

Most upvoted comments

I volunteer to test fixes for free 😃

oleksandr-pavlyk on Nov 20, 2018

but more cores would still trigger the bug, even with python fixed, right? EC2 has c5.18xlarge with 72 cores for $0.70/h.

Note that this problem only exists on windows, so big instances are more around 5/10$ per hour.

But this should still be fixed in loky.

tomMoral on Nov 20, 2018

Actually no, you are right, 64 cores is enough. I misread the above conversation.

ogrisel on Nov 20, 2018

Ok so the problem is caused by loky implementation of wait, which relies on winapi.WaitForMultipleObjects which cannot wait for more than MAXIMUM_WAIT_OBJECTS=64 objects.

I opened an issue in the loky tracker (tomMoral/loky#192).

But it is still strange that you report 32 cores in your machine but get cpu_count() == 128. There seems to have another issue here. Could you please report the results of the following commands?

echo %NUMBER_OF_PROCESSORS%
python -c "import multiprocessing as mp; print('mp:', mp.cpu_count())"

tomMoral on Nov 20, 2018

I can still see the hang with a local build from current master (4e8194909ce0f8879ebe8073ae5d37e3fdc8a00f).

Upon execution of

pytest -v --pyarg sklearn.tests.test_multioutput::test_multi_output_classification_partial_fit_parallelism

There are 130 idle Python processes shown in task manager and the execution never terminates.

Pressing Ctrl+C terminates all 130 processes, which is an improvement over an earlier version of joblib.

Is there a way for me to access prebuilt wheels, or other binaries that you used to check whether the problem is fixed?

Thanks!

oleksandr-pavlyk on Nov 12, 2018

@ogrisel @jnothman Please allow me a moment to build it and run the test. I would report in about one hour (expected build + test run time on our server)

oleksandr-pavlyk on Nov 12, 2018

@tomMoral Yes, my bad. I have updated the earlier comment to include the said line.

oleksandr-pavlyk on Oct 5, 2018