scikit-learn: OrdinalEncoder fit fails to encode big integers.

Describe the bug

In scikit-learn 0.24 there were some methods added affecting OrdinalEncoder, including method _unique_python added in sklearn/utils/_encode. This method utilize _extract_missing method which uses np.isnan(X) (by is_scalar_nan(X) ): https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/__init__.py#L985 . When the given array X consist of really big ints, e.g. 44253463435747313673 the np.isnan() method throws: TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' Which is caught by a try except statement with not very obvious error message : TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int'] My input array had only big integer values, but the message suggest that it has some different dtypes.

Dont know if it is an expected behavior, but the message could be improved.

Steps/Code to Reproduce

Example:

python

from sklearn.preprocessing import OrdinalEncoder
import numpy as np
OE = OrdinalEncoder()
X = np.array([44253463435747313673, 9867966753463435747313673, 44253462342215747313673, 442534634357764313673]).reshape(-1, 1)
OE.fit(X)

Expected Results

Better error message or handling big ints in OrdinalEncoder as it worked in sklearn 0.23.2.

Actual Results

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

values = array([140114708448418632577632402066430035116,
       140114708448418632577632402066430035116,
       170172835760119...0666130,
       170172835760119224333519554008280666130,
       140114708448418632577632402066430035116], dtype=object)

    def _unique_python(values, *, return_inverse):
        # Only used in `_uniques`, see docstring there for details
        try:
            uniques_set = set(values)
            uniques_set, missing_values = _extract_missing(uniques_set)
    
            uniques = sorted(uniques_set)
            uniques.extend(missing_values.to_list())
            uniques = np.array(uniques, dtype=values.dtype)
        except TypeError:
            types = sorted(t.__qualname__
                           for t in set(type(v) for v in values))
>           raise TypeError("Encoders require their input to be uniformly "
                            f"strings or numbers. Got {types}")
E           TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int']

../../../../../../../opt/anaconda3/envs/sklearn24/lib/python3.7/site-packages/sklearn/utils/_encode.py:138: TypeError

Versions

Python dependencies: pip: 21.0.1 setuptools: 49.6.0.post20200814 sklearn: 0.24.2 numpy: 1.19.2 scipy: 1.5.0 Cython: 0.29.21 pandas: 1.2.1 matplotlib: 3.2.1 joblib: 0.14.1 threadpoolctl: 2.1.0

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (9 by maintainers)

Most upvoted comments

@ogrisel Thats correct those values are from hashing and they are used for ml, and so that they are encoded.

szymonkucharczyk on Aug 10, 2021

Since this is a regression, we probably need our _unique_python to handle other types than str then.

Although, it is a regression as it worked in sklearn 0.23.2 and it doesnt with 0.24. It also fails no matter of numpy version.

Yes, we most probably refactor part of the code there.

glemaitre on Aug 10, 2021