scikit-learn: Some processes not working under clustering.MeanShift

Hi, I’m using the parallel version of clustering.MeanShift (which I had written, interestingly). I’ve now noticed that most of the processes are actually “sleeping”, and only a few actually work. Even more oddly, this doesn’t always happen:

the problem is worse on some machine than on others
the problem doesn’t seem to appear when working with 2 dimensions instead of 4 (see code below).
changing the code to use multiprocessing instead of joblib makes it work

I have no idea where to start…

Reproduce

When running the code

from sklearn.cluster import MeanShift
import numpy as np

ndim = 4
points = np.random.random([100000, ndim])

MS = MeanShift(n_jobs=20, bandwidth=0.1)
print("Starting.")
MS.fit(points)

a call to htop shows:

screenshot from 2016-06-27 14-11-33

Versions

Linux-2.6.32-573.3.1.el6.x86_64-x86_64-with-redhat-6.6-Carbon Python 3.4.2 (default, Feb 4 2015, 08:24:27) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] NumPy 1.11.1 SciPy 0.17.1 Scikit-Learn 0.17.1

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 30 (16 by maintainers)

Most upvoted comments

So it seems like the automatic batching of tasks is not well suited to some machines. I am not exactly sure why yet.

A work-around that works for me is to set joblib.parallel.MIN_IDEAL_BATCH_DURATION to a higher value. If you can test whether this snippet works for you, that’d be great:


import numpy as np

from sklearn.cluster import MeanShift
from sklearn.externals.joblib import parallel

parallel.MIN_IDEAL_BATCH_DURATION = 1.
parallel.MAX_IDEAL_BATCH_DURATION = parallel.MIN_IDEAL_BATCH_DURATION * 10

ndim = 4
points = np.random.random([100000, ndim])

MS = MeanShift(n_jobs=20, bandwidth=0.1)
print("Starting.")
MS.fit(points)

lesteve on Jun 28, 2016