scikit-learn: OrdinalEncoder becomes slow in presence of numerous `nan` values

Describe the bug

I want to use ordinalencoder with a feature with ~10 categories, but >99% values nan. Execution time is very slow. ~4min for a 1e5 rows. But strangely enough, if the feature is not sparsed, then fitting time is ~1s

Worth mention, that if you use category_encoders.ordinal.OrdinalEncoder the problem with sparse cat feature doesn’t happen.

Steps/Code to Reproduce

import datetime as dt
import fire
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
# from category_encoders.ordinal import OrdinalEncoder

def xgb_cat():
    print('xgb cat')
    num_rows = int(1e5)
    # num_cols = 200

    x = pd.DataFrame()

    encoder = OrdinalEncoder()

    x['cat'] = [f'cat_{v}' for v in np.random.randint(10, size=num_rows)]
    start_ = dt.datetime.now()
    print('\ncat without nan')
    print('start fit')
    encoder.fit(x)
    end_ = dt.datetime.now()
    print(f"execution time: {end_ - start_}")
    # execution time: 1s

    x['cat'] = np.nan
    x.loc[0:99, 'cat'] = [f'cat_{v}' for v in np.random.randint(10, size=100)]

    start_ = dt.datetime.now()
    print('\ncat with >99% nan')
    print('start fit')
    encoder.fit(x)
    end_ = dt.datetime.now()
    print(f"execution time: {end_ - start_}")
    # execution time: 4min
xgb_cat()

Expected Results

execution time for sparsed column is also ~1s

Actual Results

execution time for sparsed column is ~4min

Versions

System:
    python: 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)]
executable: C:\--\python.exe
   machine: Windows-10-10.0.19043-SP0
Python dependencies:
      sklearn: 1.1.3
          pip: 22.3.1
   setuptools: 65.4.1
        numpy: 1.23.4
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.1
   matplotlib: None
       joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: vcomp
       filepath: C:\--\Lib\site-packages\sklearn\.libs\vcomp140.dll
        version: None
    num_threads: 16
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\--\Lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
        version: 0.3.20
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 16
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\--\Lib\site-packages\scipy.libs\libopenblas-57db09cfe174768fb409a6bb5a530d4c.dll
        version: 0.3.18
threading_layer: pthreads
   architecture: Prescott
    num_threads: 16

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (16 by maintainers)

Commits related to this issue

Most upvoted comments

I think we can consider it a performance bug of python itself and advise the impacted users to upgrade their version of python so that we can focus our efforts on improving scikit-learn itself.

So just as a recall:


`main` branch:

Case without missing values: 2.01 ms ± 5.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Case with 99% missing values: 2min 45s ± 3.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR:

Case without missing values: 17.8 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Case with 99% missing values: 30.5 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

With Python > 3.10 and `main`:

Case without missing values: 189 µs ± 281 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Case with 99% missing values: 3.85 ms ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python 3.1x+ is so fast. I don’t think we should change the code. Maybe we should implement the fix only for older versions of Python because the slowdown is quite dramatic.

I would need input of @betatim @adrinjalali @thomasjpfan

Interesting thing: the slowdown is only in the case of Python < 3.10. I did a new bench and could not reproduce it in my newly installed environment. The reason is:

Hashes of NaN values of both float type and decimal.Decimal type now depend on object identity. Formerly, they always hashed to 0 even though NaN values are not equal to one another. This caused potentially quadratic runtime behavior due to excessive hash collisions when creating dictionaries and sets containing multiple NaNs. (Contributed by Raymond Hettinger in bpo-43475.)

cf. https://docs.python.org/3/whatsnew/3.10.html

I have to bench again with the same setup in Python 3.10 to check what would be the performance of the current implementation.

At first a and b point to the same np.nan singleton. So the identity and equality will work in a list or a numpy of dtype object since you store only the reference of the singleton.

Once you call np.array, it will store the value into a contiguous array of floating points and then, you are not going to refer to the singleton anymore and the identity will not hold. Converting to a list, each element will point to the two objects created by the NumPy array that has different addresses, and thus the identity will not hold as well.

This is quite logical but not intuitive at first 😃