scikit-learn: Euclidean pairwise_distances slower for n_jobs > 1

In a followup of issue #8213 , it looks like using n_jobs > 1 in Eucledian pairwise_distances makes computations slower instead of speeding them up.

Steps to reproduce

from sklearn.metrics import pairwise_distances
import numpy as np

np.random.seed(99999)
n_dim = 200

for n_train, n_test in [(1000, 100000),
                        (10000, 10000),
                        (100000, 1000)]:
    print('\n# n_train={}, n_test={}, n_dim={}\n'.format(
                         n_train, n_test, n_dim))

    X_train = np.random.rand(n_train, n_dim)
    X_test = np.random.rand(n_test, n_dim)

    for n_jobs in [1, 2]:
        print('n_jobs=', n_jobs, ' => ', end='')

        %timeit pairwise_distances(X_train, X_test, 'euclidean',
                                   n_jobs=n_jobs, squared=True)

which on a 2 core CPU returns,

# n_train=1000, n_test=100000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 1.92 s per loop
n_jobs= 2  => 1 loop, best of 3: 4.95 s per loop

# n_train=10000, n_test=10000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 1.89 s per loop
n_jobs= 2  => 1 loop, best of 3: 4.74 s per loop

# n_train=100000, n_test=1000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 2 s per loop
n_jobs= 2  => 1 loop, best of 3: 5.6 s per loop

While for small datasets, it would make sens that the parallel processing would not improve performance due to the multiprocessing etc overhead, this is by no mean a small dataset. And the compute time does not decrease when using e.g. n_jobs=4 on a 4 core CPU.

This also holds for other number of dimensions, n_dim=10

# n_train=1000, n_test=100000, n_dim=10

n_jobs= 1  => 1 loop, best of 3: 873 ms per loop
n_jobs= 2  => 1 loop, best of 3: 4.25 s per loop

n_dim=1000

# n_train=1000, n_test=100000, n_dim=1000

n_jobs= 1  => 1 loop, best of 3: 6.56 s per loop
n_jobs= 2  => 1 loop, best of 3: 8.56 s per loop

Running benchmarks/bench_plot_parallel_pairwise.py also yields similar results, untitled

This might affect a number of estimators / metrics where pairwise_distances is used.

Versions

Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1

I also get similar results with scikit-learn 0.17.1

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 1
Comments: 22 (22 by maintainers)

Most upvoted comments

Actually for cosine similarities and and euclidean distances there is a level 3 GEMM involved. So at least this step should benefit from multicore without running into the memory bandwidth limit. But the BLAS implementation (MKL or OpenBLAS) should have no problem to multithread that part and there is probably no point in trying to manually handle the parallelism at a coarser level for those GEMM based distances.

ogrisel on Feb 10, 2017

Parallelism is always fraught with this kind of issue. There are multiple factors in whether it will succeed, and very often it won’t. […] n_jobs may be more useful when the calculation is slow, i.e. non-euclidean, or over sparse matrices or whatnot.

Yes, I know, I tried to do a quick parallellization of pariwise_distances(..., n_jobs=1),

def _naive_parallel_pairwise_distances(X, Y, metric, batch_size=100, n_jobs=1, squared=True):
    queue = []
    n_samples = X.shape[0]
    for k in range(n_samples//batch_size + int(n_samples % batch_size != 0)):
        mslice = slice(k*batch_size, min((k+1)*batch_size, n_samples))
        X_sl = X[mslice, :]
        queue.append(delayed(pairwise_distances)(X_sl, Y, metric,
                                        n_jobs=1, squared=squared))
    res = Parallel(n_jobs=n_jobs)(queue)
    return np.hstack(res)

and this yields comparable performance, so it must be as you say that n_jobs would better improve non-eucledian distances etc that take longer to compute.

Again, you’ve not varied n_features, which might be an important factor.

I have tried n_features in [10, 200, 1000] in my original post above (sorry called it n_dim), in all cases n_jobs > 1 makes things slower. Including for large arrays X [100000, 1000], Y [1000, 1000] which are reaching the limit of what I can compute with 8 GB RAM.

The question is that if n_jobs always makes pairwise euclidean distances slower, no matter n_samples, n_features (as that’s what all the benchmarks suggest so far), then some warning should probably be printed to the user if n_jobs > 1 is used? Particularly since euclidean distance is the default (and probably the most frequent used) distance metric.

Also when benchmarks/bench_plot_parallel_pairwise.py was added by @mblondel in https://github.com/scikit-learn/scikit-learn/commit/3b8f54e58667be80af6c099bfda40c7fe6579712 , I imagine it aimed to demonstrate that using n_jobs > 1 could speed up Eucledian / RBF distance calculations (couldn’t find any plots / PR from that time). When I run this benchmark now, it illustrates that using n_jobs > 1 makes things slower (cf images in the original post above), which doesn’t really make sens to show on a benchmark…

rth on Jan 20, 2017