scikit-learn: Euclidean pairwise_distances slower for n_jobs > 1
In a followup of issue #8213 , it looks like using n_jobs > 1
in Eucledian pairwise_distances
makes computations slower instead of speeding them up.
Steps to reproduce
from sklearn.metrics import pairwise_distances
import numpy as np
np.random.seed(99999)
n_dim = 200
for n_train, n_test in [(1000, 100000),
(10000, 10000),
(100000, 1000)]:
print('\n# n_train={}, n_test={}, n_dim={}\n'.format(
n_train, n_test, n_dim))
X_train = np.random.rand(n_train, n_dim)
X_test = np.random.rand(n_test, n_dim)
for n_jobs in [1, 2]:
print('n_jobs=', n_jobs, ' => ', end='')
%timeit pairwise_distances(X_train, X_test, 'euclidean',
n_jobs=n_jobs, squared=True)
which on a 2 core CPU returns,
# n_train=1000, n_test=100000, n_dim=200
n_jobs= 1 => 1 loop, best of 3: 1.92 s per loop
n_jobs= 2 => 1 loop, best of 3: 4.95 s per loop
# n_train=10000, n_test=10000, n_dim=200
n_jobs= 1 => 1 loop, best of 3: 1.89 s per loop
n_jobs= 2 => 1 loop, best of 3: 4.74 s per loop
# n_train=100000, n_test=1000, n_dim=200
n_jobs= 1 => 1 loop, best of 3: 2 s per loop
n_jobs= 2 => 1 loop, best of 3: 5.6 s per loop
While for small datasets, it would make sens that the parallel processing would not improve performance due to the multiprocessing etc overhead, this is by no mean a small dataset. And the compute time does not decrease when using e.g. n_jobs=4 on a 4 core CPU.
This also holds for other number of dimensions, n_dim=10
# n_train=1000, n_test=100000, n_dim=10
n_jobs= 1 => 1 loop, best of 3: 873 ms per loop
n_jobs= 2 => 1 loop, best of 3: 4.25 s per loop
n_dim=1000
# n_train=1000, n_test=100000, n_dim=1000
n_jobs= 1 => 1 loop, best of 3: 6.56 s per loop
n_jobs= 2 => 1 loop, best of 3: 8.56 s per loop
Running benchmarks/bench_plot_parallel_pairwise.py
also yields similar results,
This might affect a number of estimators / metrics where pairwise_distances
is used.
Versions
Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1
I also get similar results with scikit-learn 0.17.1
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 22 (22 by maintainers)
Actually for cosine similarities and and euclidean distances there is a level 3 GEMM involved. So at least this step should benefit from multicore without running into the memory bandwidth limit. But the BLAS implementation (MKL or OpenBLAS) should have no problem to multithread that part and there is probably no point in trying to manually handle the parallelism at a coarser level for those GEMM based distances.
Yes, I know, I tried to do a quick parallellization of
pariwise_distances(..., n_jobs=1)
,and this yields comparable performance, so it must be as you say that
n_jobs
would better improve non-eucledian distances etc that take longer to compute.I have tried
n_features
in [10, 200, 1000] in my original post above (sorry called itn_dim
), in all casesn_jobs > 1
makes things slower. Including for large arraysX [100000, 1000]
,Y [1000, 1000]
which are reaching the limit of what I can compute with 8 GB RAM.The question is that if
n_jobs
always makes pairwise euclidean distances slower, no mattern_samples
,n_features
(as that’s what all the benchmarks suggest so far), then some warning should probably be printed to the user ifn_jobs > 1
is used? Particularly since euclidean distance is the default (and probably the most frequent used) distance metric.Also when
benchmarks/bench_plot_parallel_pairwise.py
was added by @mblondel in https://github.com/scikit-learn/scikit-learn/commit/3b8f54e58667be80af6c099bfda40c7fe6579712 , I imagine it aimed to demonstrate that usingn_jobs > 1
could speed up Eucledian / RBF distance calculations (couldn’t find any plots / PR from that time). When I run this benchmark now, it illustrates that usingn_jobs > 1
makes things slower (cf images in the original post above), which doesn’t really make sens to show on a benchmark…