scikit-learn: HashingVectorizer slow in version 0.18 on Windows

Description

After upgrading to scikit-learn 0.18 HashingVectorizer is about 10 times slower.

Before:

scikit-learn 0.17. Numpy 1.11.2. Python 3.5.2 AMD64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in  4.594092130661011  seconds, resulting shape  (11314, 1048576)

After upgrade:

scikit-learn 0.18. Numpy 1.11.2. Python 3.5.2 AMD64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in  43.587692737579346  seconds, resulting shape  (11314, 1048576)

Steps/Code to Reproduce

import time, sklearn, platform, numpy
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer

data_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

print('scikit-learn {}. Numpy {}. Python {} {}'.format(
    sklearn.__version__, numpy.version.full_version,
    platform.python_version(), platform.machine()))

vectorizer = HashingVectorizer()
print("Vectorizing 20newsgroup {} documents".format(len(data_train.data)))
start = time.time()
data = vectorizer.fit_transform(data_train.data)
print("Vectorization completed in ", time.time() - start,
    ' seconds, resulting shape ', data.shape)

Expected Results

vectorization should take about 5 seconds, as in version 0.17

Actual Results

vectorization takes 42 seconds in version 0.18

Versions

scikit-learn 0.18. Numpy 1.11.2. Python 3.5.2 AMD64

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 26 (21 by maintainers)

Most upvoted comments

Very strange

Agreed ! Since it what I tried originally did not make much sense (going back to 2e4aafd which was still ~30s and where the only difference compared to 0.17 was the removal of an email address in a comment), I ran a brute-force git bisect and it looks like the culprit is the removal of .C generated files in 986fb60947f91306d9dcf8d09a321ba52f4417af. Trying to debug further the problem to see whether this can be attributed to a cython regression somehow …

So the result of the investigation is that the regression seems to have been introduced by cython 0.21.2.

The differences between sklearn/feature_extraction/_hashing.c generated by cython 0.21.1 and 0.22.2:

diff --git a/home/lesteve/VirtualBox VMs/Windows 10/shared/_hashing.c0.21.1 b/home/lesteve/VirtualBox VMs/Windows 10/shared/_hashing.c0.21.2
index 3f908f6..6f9279e 100644
--- a/home/lesteve/VirtualBox VMs/Windows 10/shared/_hashing.c0.21.1
+++ b/home/lesteve/VirtualBox VMs/Windows 10/shared/_hashing.c0.21.2
@@ -1,4 +1,4 @@
-/* Generated by Cython 0.21.1 */
+/* Generated by Cython 0.21.2 */

 #define PY_SSIZE_T_CLEAN
 #ifndef CYTHON_USE_PYLONG_INTERNALS
@@ -19,7 +19,7 @@
 #elif PY_VERSION_HEX < 0x02060000 || (0x03000000 <= PY_VERSION_HEX && PY_VERSION_HEX < 0x03020000)
     #error Cython requires Python 2.6+ or Python 3.2+.
 #else
-#define CYTHON_ABI "0_21_1"
+#define CYTHON_ABI "0_21_2"
 #include <stddef.h>
 #ifndef offsetof
 #define offsetof(type, member) ( (size_t) & ((type*)0) -> member )
@@ -933,13 +933,15 @@ static CYTHON_INLINE int resize(arrayobject *self, Py_ssize_t n) {
 static CYTHON_INLINE int resize_smart(arrayobject *self, Py_ssize_t n) {
     void *items = (void*) self->data.ob_item;
     Py_ssize_t newsize;
-    if (n < self->allocated) {
-        if (n*4 > self->allocated) {
-            self->ob_size = n;
-            return 0;
-        }
+    if (n < self->ob_size) {
+        self->ob_size = n;
+        return 0;
+    }
+    newsize = n + (n / 2) + 1;
+    if (newsize <= self->allocated) {
+        PyErr_NoMemory();
+        return -1;
     }
-    newsize = n  * 3 / 2 + 1;
     PyMem_Resize(items, char, (size_t)(newsize * self->ob_descr->itemsize));
     if (items == NULL) {
         PyErr_NoMemory();

More details, in case someone want to reproduce: I first ran a clean build on master with cython 0.24 and then hacked build_tools/cythonized.py like this:

diff --git a/build_tools/cythonize.py b/build_tools/cythonize.py
index b01da58..326d57c 100755
--- a/build_tools/cythonize.py
+++ b/build_tools/cythonize.py
@@ -151,6 +151,9 @@ def cythonize_if_unchanged(path, cython_file, gen_file, hashes):
     if current_hash == hashes.get(clean_path(full_cython_path)):
         print('%s has not changed' % full_cython_path)
         return
+    else:
+        print('Ignoring hack!!!!')
+        return

     print('Processing %s' % full_cython_path)
     cythonize(full_cython_path, full_gen_file_path)

This way the pyx don’t get cythonized again and the .c files are used instead. I can cythonize sklearn/feature_extraction/_hashing.pyx manually with the cython version I want and then run python setup.py build_ext -i.

Great. Excellent work to all the team!

Nice work on finding the issue @lesteve !

It would have been great to have some kind of automatic benchmarks (e.g. with airspeed velocity) to detect such major performance regressions, but I guess it would require a lot of additional work and won’t solve all the problems (variations in dependency versions / OS, regressions in the dependencies, scaling to large datasets, etc)…