scikit-learn: HashingVectorizer slow in version 0.18 on Windows
Description
After upgrading to scikit-learn 0.18 HashingVectorizer is about 10 times slower.
Before:
scikit-learn 0.17. Numpy 1.11.2. Python 3.5.2 AMD64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in 4.594092130661011 seconds, resulting shape (11314, 1048576)
After upgrade:
scikit-learn 0.18. Numpy 1.11.2. Python 3.5.2 AMD64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in 43.587692737579346 seconds, resulting shape (11314, 1048576)
Steps/Code to Reproduce
import time, sklearn, platform, numpy
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer
data_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
print('scikit-learn {}. Numpy {}. Python {} {}'.format(
sklearn.__version__, numpy.version.full_version,
platform.python_version(), platform.machine()))
vectorizer = HashingVectorizer()
print("Vectorizing 20newsgroup {} documents".format(len(data_train.data)))
start = time.time()
data = vectorizer.fit_transform(data_train.data)
print("Vectorization completed in ", time.time() - start,
' seconds, resulting shape ', data.shape)
Expected Results
vectorization should take about 5 seconds, as in version 0.17
Actual Results
vectorization takes 42 seconds in version 0.18
Versions
scikit-learn 0.18. Numpy 1.11.2. Python 3.5.2 AMD64
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 26 (21 by maintainers)
Agreed ! Since it what I tried originally did not make much sense (going back to 2e4aafd which was still ~30s and where the only difference compared to 0.17 was the removal of an email address in a comment), I ran a brute-force git bisect and it looks like the culprit is the removal of .C generated files in 986fb60947f91306d9dcf8d09a321ba52f4417af. Trying to debug further the problem to see whether this can be attributed to a cython regression somehow …
So the result of the investigation is that the regression seems to have been introduced by cython 0.21.2.
The differences between sklearn/feature_extraction/_hashing.c generated by cython 0.21.1 and 0.22.2:
More details, in case someone want to reproduce: I first ran a clean build on master with cython 0.24 and then hacked build_tools/cythonized.py like this:
This way the pyx don’t get cythonized again and the .c files are used instead. I can cythonize sklearn/feature_extraction/_hashing.pyx manually with the cython version I want and then run
python setup.py build_ext -i
.Great. Excellent work to all the team!
Nice work on finding the issue @lesteve !
It would have been great to have some kind of automatic benchmarks (e.g. with airspeed velocity) to detect such major performance regressions, but I guess it would require a lot of additional work and won’t solve all the problems (variations in dependency versions / OS, regressions in the dependencies, scaling to large datasets, etc)…