scikit-learn: Euclidean pairwise_distances slower for n_jobs > 1
In a followup of issue #8213 , it looks like using n_jobs > 1 in Eucledian pairwise_distances makes computations slower instead of speeding them up.
Steps to reproduce
from sklearn.metrics import pairwise_distances
import numpy as np
np.random.seed(99999)
n_dim = 200
for n_train, n_test in [(1000, 100000),
(10000, 10000),
(100000, 1000)]:
print('\n# n_train={}, n_test={}, n_dim={}\n'.format(
n_train, n_test, n_dim))
X_train = np.random.rand(n_train, n_dim)
X_test = np.random.rand(n_test, n_dim)
for n_jobs in [1, 2]:
print('n_jobs=', n_jobs, ' => ', end='')
%timeit pairwise_distances(X_train, X_test, 'euclidean',
n_jobs=n_jobs, squared=True)
which on a 2 core CPU returns,
# n_train=1000, n_test=100000, n_dim=200
n_jobs= 1 => 1 loop, best of 3: 1.92 s per loop
n_jobs= 2 => 1 loop, best of 3: 4.95 s per loop
# n_train=10000, n_test=10000, n_dim=200
n_jobs= 1 => 1 loop, best of 3: 1.89 s per loop
n_jobs= 2 => 1 loop, best of 3: 4.74 s per loop
# n_train=100000, n_test=1000, n_dim=200
n_jobs= 1 => 1 loop, best of 3: 2 s per loop
n_jobs= 2 => 1 loop, best of 3: 5.6 s per loop
While for small datasets, it would make sens that the parallel processing would not improve performance due to the multiprocessing etc overhead, this is by no mean a small dataset. And the compute time does not decrease when using e.g. n_jobs=4 on a 4 core CPU.
This also holds for other number of dimensions, n_dim=10
# n_train=1000, n_test=100000, n_dim=10
n_jobs= 1 => 1 loop, best of 3: 873 ms per loop
n_jobs= 2 => 1 loop, best of 3: 4.25 s per loop
n_dim=1000
# n_train=1000, n_test=100000, n_dim=1000
n_jobs= 1 => 1 loop, best of 3: 6.56 s per loop
n_jobs= 2 => 1 loop, best of 3: 8.56 s per loop
Running benchmarks/bench_plot_parallel_pairwise.py also yields similar results,

This might affect a number of estimators / metrics where pairwise_distances is used.
Versions
Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1
I also get similar results with scikit-learn 0.17.1
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 22 (22 by maintainers)
Actually for cosine similarities and and euclidean distances there is a level 3 GEMM involved. So at least this step should benefit from multicore without running into the memory bandwidth limit. But the BLAS implementation (MKL or OpenBLAS) should have no problem to multithread that part and there is probably no point in trying to manually handle the parallelism at a coarser level for those GEMM based distances.
Yes, I know, I tried to do a quick parallellization of
pariwise_distances(..., n_jobs=1),and this yields comparable performance, so it must be as you say that
n_jobswould better improve non-eucledian distances etc that take longer to compute.I have tried
n_featuresin [10, 200, 1000] in my original post above (sorry called itn_dim), in all casesn_jobs > 1makes things slower. Including for large arraysX [100000, 1000],Y [1000, 1000]which are reaching the limit of what I can compute with 8 GB RAM.The question is that if
n_jobsalways makes pairwise euclidean distances slower, no mattern_samples,n_features(as that’s what all the benchmarks suggest so far), then some warning should probably be printed to the user ifn_jobs > 1is used? Particularly since euclidean distance is the default (and probably the most frequent used) distance metric.Also when
benchmarks/bench_plot_parallel_pairwise.pywas added by @mblondel in https://github.com/scikit-learn/scikit-learn/commit/3b8f54e58667be80af6c099bfda40c7fe6579712 , I imagine it aimed to demonstrate that usingn_jobs > 1could speed up Eucledian / RBF distance calculations (couldn’t find any plots / PR from that time). When I run this benchmark now, it illustrates that usingn_jobs > 1makes things slower (cf images in the original post above), which doesn’t really make sens to show on a benchmark…