scikit-learn: OrdinalEncoder becomes slow in presence of numerous `nan` values
Describe the bug
I want to use ordinalencoder with a feature with ~10 categories, but >99% values nan. Execution time is very slow. ~4min for a 1e5 rows. But strangely enough, if the feature is not sparsed, then fitting time is ~1s
Worth mention, that if you use category_encoders.ordinal.OrdinalEncoder
the problem with sparse cat feature doesn’t happen.
Steps/Code to Reproduce
import datetime as dt
import fire
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
# from category_encoders.ordinal import OrdinalEncoder
def xgb_cat():
print('xgb cat')
num_rows = int(1e5)
# num_cols = 200
x = pd.DataFrame()
encoder = OrdinalEncoder()
x['cat'] = [f'cat_{v}' for v in np.random.randint(10, size=num_rows)]
start_ = dt.datetime.now()
print('\ncat without nan')
print('start fit')
encoder.fit(x)
end_ = dt.datetime.now()
print(f"execution time: {end_ - start_}")
# execution time: 1s
x['cat'] = np.nan
x.loc[0:99, 'cat'] = [f'cat_{v}' for v in np.random.randint(10, size=100)]
start_ = dt.datetime.now()
print('\ncat with >99% nan')
print('start fit')
encoder.fit(x)
end_ = dt.datetime.now()
print(f"execution time: {end_ - start_}")
# execution time: 4min
xgb_cat()
Expected Results
execution time for sparsed column is also ~1s
Actual Results
execution time for sparsed column is ~4min
Versions
System:
python: 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)]
executable: C:\--\python.exe
machine: Windows-10-10.0.19043-SP0
Python dependencies:
sklearn: 1.1.3
pip: 22.3.1
setuptools: 65.4.1
numpy: 1.23.4
scipy: 1.9.3
Cython: None
pandas: 1.5.1
matplotlib: None
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
prefix: vcomp
filepath: C:\--\Lib\site-packages\sklearn\.libs\vcomp140.dll
version: None
num_threads: 16
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: C:\--\Lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
version: 0.3.20
threading_layer: pthreads
architecture: SkylakeX
num_threads: 16
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: C:\--\Lib\site-packages\scipy.libs\libopenblas-57db09cfe174768fb409a6bb5a530d4c.dll
version: 0.3.18
threading_layer: pthreads
architecture: Prescott
num_threads: 16
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (16 by maintainers)
I think we can consider it a performance bug of python itself and advise the impacted users to upgrade their version of python so that we can focus our efforts on improving scikit-learn itself.
So just as a recall:
Python 3.1x+ is so fast. I don’t think we should change the code. Maybe we should implement the fix only for older versions of Python because the slowdown is quite dramatic.
I would need input of @betatim @adrinjalali @thomasjpfan
Interesting thing: the slowdown is only in the case of Python < 3.10. I did a new bench and could not reproduce it in my newly installed environment. The reason is:
Hashes of NaN values of both float type and decimal.Decimal type now depend on object identity. Formerly, they always hashed to 0 even though NaN values are not equal to one another. This caused potentially quadratic runtime behavior due to excessive hash collisions when creating dictionaries and sets containing multiple NaNs. (Contributed by Raymond Hettinger in bpo-43475.)
cf. https://docs.python.org/3/whatsnew/3.10.html
I have to bench again with the same setup in Python 3.10 to check what would be the performance of the current implementation.
At first
a
andb
point to the samenp.nan
singleton. So the identity and equality will work in a list or a numpy ofdtype
object since you store only the reference of the singleton.Once you call
np.array
, it will store the value into a contiguous array of floating points and then, you are not going to refer to the singleton anymore and the identity will not hold. Converting to a list, each element will point to the two objects created by the NumPy array that has different addresses, and thus the identity will not hold as well.This is quite logical but not intuitive at first 😃