scikit-learn: Tests test_theil_sen_parallel and test_multi_output_classification_partial_fit_parallelism hang on Windows

Description

Install scikit-learn 0.20.0 on Windows as follows:

conda create -n hang python=3.6.5
pip install scikit-learn==0.20.0
pip install pytest

Steps/Code to Reproduce

The following two individual test runs hang (never finish, and remain uninterruptable):

pytest -v --pyarg sklearn.tests.test_multioutput::test_multi_output_classification_partial_fit_parallelism
pytest -v --pyarg sklearn.linear_model.tests.test_theil_sen::test_theil_sen_parallel

Expected Results

Expecting them to pass as they do on Linux, or skipped in the distribution

Actual Results

Tests hang in both pip installed scikit-learn, as well in scikit-learn installed via conda itself.

Reproduced on “Windows Server 2012 R2 Standard”.

Versions

>pip list
Package        Version
-------------- ---------
atomicwrites   1.2.1
attrs          18.2.0
certifi        2018.8.24
colorama       0.3.9
more-itertools 4.3.0
numpy          1.15.2
pip            10.0.1
pluggy         0.7.1
py             1.6.0
pytest         3.8.2
scikit-learn   0.20.0
scipy          1.1.0
setuptools     40.2.0
six            1.11.0
wheel          0.31.1
wincertstore   0.2

@ogrisel

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 35 (35 by maintainers)

Commits related to this issue

Most upvoted comments

I volunteer to test fixes for free 😃

but more cores would still trigger the bug, even with python fixed, right? EC2 has c5.18xlarge with 72 cores for $0.70/h.

Note that this problem only exists on windows, so big instances are more around 5/10$ per hour.

But this should still be fixed in loky.

Actually no, you are right, 64 cores is enough. I misread the above conversation.

Ok so the problem is caused by loky implementation of wait, which relies on winapi.WaitForMultipleObjects which cannot wait for more than MAXIMUM_WAIT_OBJECTS=64 objects.

I opened an issue in the loky tracker (tomMoral/loky#192).

But it is still strange that you report 32 cores in your machine but get cpu_count() == 128. There seems to have another issue here. Could you please report the results of the following commands?

echo %NUMBER_OF_PROCESSORS%
python -c "import multiprocessing as mp; print('mp:', mp.cpu_count())"

I can still see the hang with a local build from current master (4e8194909ce0f8879ebe8073ae5d37e3fdc8a00f).

Upon execution of

pytest -v --pyarg sklearn.tests.test_multioutput::test_multi_output_classification_partial_fit_parallelism

There are 130 idle Python processes shown in task manager and the execution never terminates.

Pressing Ctrl+C terminates all 130 processes, which is an improvement over an earlier version of joblib.

Is there a way for me to access prebuilt wheels, or other binaries that you used to check whether the problem is fixed?

Thanks!

@ogrisel @jnothman Please allow me a moment to build it and run the test. I would report in about one hour (expected build + test run time on our server)

@tomMoral Yes, my bad. I have updated the earlier comment to include the said line.