scikit-learn: Tests test_theil_sen_parallel and test_multi_output_classification_partial_fit_parallelism hang on Windows
Description
Install scikit-learn 0.20.0 on Windows as follows:
conda create -n hang python=3.6.5
pip install scikit-learn==0.20.0
pip install pytest
Steps/Code to Reproduce
The following two individual test runs hang (never finish, and remain uninterruptable):
pytest -v --pyarg sklearn.tests.test_multioutput::test_multi_output_classification_partial_fit_parallelism
pytest -v --pyarg sklearn.linear_model.tests.test_theil_sen::test_theil_sen_parallel
Expected Results
Expecting them to pass as they do on Linux, or skipped in the distribution
Actual Results
Tests hang in both pip installed scikit-learn, as well in scikit-learn installed via conda itself.
Reproduced on “Windows Server 2012 R2 Standard”.
Versions
>pip list
Package Version
-------------- ---------
atomicwrites 1.2.1
attrs 18.2.0
certifi 2018.8.24
colorama 0.3.9
more-itertools 4.3.0
numpy 1.15.2
pip 10.0.1
pluggy 0.7.1
py 1.6.0
pytest 3.8.2
scikit-learn 0.20.0
scipy 1.1.0
setuptools 40.2.0
six 1.11.0
wheel 0.31.1
wincertstore 0.2
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 35 (35 by maintainers)
Commits related to this issue
- MAINT: n_jobs=-1 replaced with n_jobs=4 in tests This change is to work around the hang https://github.com/scikit-learn/scikit-learn/issues/12263 afflicting Windows on machines with > 62 hyperthre... — committed to oleksandr-pavlyk/scikit-learn by oleksandr-pavlyk 5 years ago
- MAINT: n_jobs=-1 replaced with n_jobs=4 in tests (#13644) This change is to work around the hang https://github.com/scikit-learn/scikit-learn/issues/12263 afflicting Windows on machines with >... — committed to scikit-learn/scikit-learn by oleksandr-pavlyk 5 years ago
- MAINT: n_jobs=-1 replaced with n_jobs=4 in tests (#13644) This change is to work around the hang https://github.com/scikit-learn/scikit-learn/issues/12263 afflicting Windows on machines with >... — committed to jeremiedbb/scikit-learn by oleksandr-pavlyk 5 years ago
- MAINT: n_jobs=-1 replaced with n_jobs=4 in tests (#13644) This change is to work around the hang https://github.com/scikit-learn/scikit-learn/issues/12263 afflicting Windows on machines with >... — committed to xhluca/scikit-learn by oleksandr-pavlyk 5 years ago
- MAINT: n_jobs=-1 replaced with n_jobs=4 in tests (#13644) This change is to work around the hang https://github.com/scikit-learn/scikit-learn/issues/12263 afflicting Windows on machines with >... — committed to koenvandevelde/scikit-learn by oleksandr-pavlyk 5 years ago
I volunteer to test fixes for free 😃
Note that this problem only exists on windows, so big instances are more around 5/10$ per hour.
But this should still be fixed in
loky
.Actually no, you are right, 64 cores is enough. I misread the above conversation.
Ok so the problem is caused by
loky
implementation of wait, which relies onwinapi.WaitForMultipleObjects
which cannot wait for more thanMAXIMUM_WAIT_OBJECTS=64
objects.I opened an issue in the
loky
tracker (tomMoral/loky#192).But it is still strange that you report
32
cores in your machine but getcpu_count() == 128
. There seems to have another issue here. Could you please report the results of the following commands?I can still see the hang with a local build from current master (4e8194909ce0f8879ebe8073ae5d37e3fdc8a00f).
Upon execution of
There are 130 idle Python processes shown in task manager and the execution never terminates.
Pressing Ctrl+C terminates all 130 processes, which is an improvement over an earlier version of joblib.
Is there a way for me to access prebuilt wheels, or other binaries that you used to check whether the problem is fixed?
Thanks!
@ogrisel @jnothman Please allow me a moment to build it and run the test. I would report in about one hour (expected build + test run time on our server)
@tomMoral Yes, my bad. I have updated the earlier comment to include the said line.