scikit-learn: OrdinalEncoder fit fails to encode big integers.
Describe the bug
In scikit-learn 0.24 there were some methods added affecting OrdinalEncoder
, including method _unique_python
added in sklearn/utils/_encode. This method utilize _extract_missing
method which uses np.isnan(X)
(by is_scalar_nan(X)
): https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/__init__.py#L985 .
When the given array X consist of really big ints, e.g. 44253463435747313673
the np.isnan()
method throws:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Which is caught by a try except statement with not very obvious error message :
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int']
My input array had only big integer values, but the message suggest that it has some different dtypes.
Dont know if it is an expected behavior, but the message could be improved.
Steps/Code to Reproduce
Example:
python
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
OE = OrdinalEncoder()
X = np.array([44253463435747313673, 9867966753463435747313673, 44253462342215747313673, 442534634357764313673]).reshape(-1, 1)
OE.fit(X)
Expected Results
Better error message or handling big ints in OrdinalEncoder
as it worked in sklearn 0.23.2.
Actual Results
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
values = array([140114708448418632577632402066430035116,
140114708448418632577632402066430035116,
170172835760119...0666130,
170172835760119224333519554008280666130,
140114708448418632577632402066430035116], dtype=object)
def _unique_python(values, *, return_inverse):
# Only used in `_uniques`, see docstring there for details
try:
uniques_set = set(values)
uniques_set, missing_values = _extract_missing(uniques_set)
uniques = sorted(uniques_set)
uniques.extend(missing_values.to_list())
uniques = np.array(uniques, dtype=values.dtype)
except TypeError:
types = sorted(t.__qualname__
for t in set(type(v) for v in values))
> raise TypeError("Encoders require their input to be uniformly "
f"strings or numbers. Got {types}")
E TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int']
../../../../../../../opt/anaconda3/envs/sklearn24/lib/python3.7/site-packages/sklearn/utils/_encode.py:138: TypeError
Versions
Python dependencies: pip: 21.0.1 setuptools: 49.6.0.post20200814 sklearn: 0.24.2 numpy: 1.19.2 scipy: 1.5.0 Cython: 0.29.21 pandas: 1.2.1 matplotlib: 3.2.1 joblib: 0.14.1 threadpoolctl: 2.1.0
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (9 by maintainers)
@ogrisel Thats correct those values are from hashing and they are used for ml, and so that they are encoded.
Since this is a regression, we probably need our
_unique_python
to handle other types thanstr
then.Yes, we most probably refactor part of the code there.