scikit-learn: Classifiers may not work with arrays defining __array_function__

Description

With NEP-18, numpy functions that previously converted an array-like to an ndarray may no longer do the (implicit) conversion. dask.array recently implemented __array_function__ so np.unique(dask.array.Array) now returns a dask.array.Array.

Some more details in https://github.com/dask/dask-ml/issues/541

Steps/Code to Reproduce

import dask.array as da
import dask_ml.datasets
import sklearn.linear_model

X, y = dask_ml.datasets.make_classification(chunks=50)

clf = sklearn.linear_model.LogisticRegression()
clf.fit(X, y)

Expected Results

No error, the same output as clf.fit(X.compute(), y.compute()), or by setting the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION='0'.

Actual Results

That raises

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-b0953fbb1d6e> in <module>
----> 1 clf.fit(X, y)

~/Envs/dask-dev/lib/python3.7/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
   1536
   1537         multi_class = _check_multi_class(self.multi_class, solver,
-> 1538                                          len(self.classes_))
   1539
   1540         if solver == 'liblinear':

TypeError: 'float' object cannot be interpreted as an integer

This is because self.classes_ = np.unique(y) is a Dask Array with unknown length

In [2]: np.unique(da.arange(12))
Out[2]: dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>

since Dask is lazy and doesn’t know the unique elements until compute time.

Versions

System:
    python: 3.7.3 (default, Apr  5 2019, 14:56:38)  [Clang 10.0.1 (clang-1001.0.46.3)]
executable: /Users/taugspurger/Envs/dask-dev/bin/python
   machine: Darwin-18.6.0-x86_64-i386-64bit

Python deps:
       pip: 19.2.1
setuptools: 41.0.1
   sklearn: 0.21.3
     numpy: 1.18.0.dev0+5e7e74b
     scipy: 1.2.0
    Cython: 0.29.9
    pandas: 0.25.0+169.g5de4e55d6

I think this needs need NumPy>=1.17 and Dask>=2.0.0


Possible solution: Explicitly convert array-likes to concrete ndarrays where necessary (this is a bit hard to determine though). For example https://github.com/scikit-learn/scikit-learn/blob/148491867920cc2af0e7e5700a0299be4a5d1c9f/sklearn/linear_model/logistic.py#L1517 would be self.classes_ = np.asarray(np.unique(y)). That may not be ideal for other libraries implementing __array_function__ (like pydata/sparse).

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 18 (18 by maintainers)

Most upvoted comments

NumPy itself doesn’t really impose any restrictions on what you can do with __array_function__. I think it would be perfectly reasonable to error when length is NaN.

I would definitely coerce everything to NumPy arrays in check_array. As a first pass that’s definitely the right thing to do. I’m a little surprised that wasn’t happening already. Duck array support is something you want to add intentionally, not accidentally.

And I feel like len returning NaN seems to imply that dask doesn’t actually implement the protocol fully, right?

Do you mean the __array_function__ protocol, or Python’s data model? Most of the time Dask arrays have a known shape, so len will return an int. But when you do an operation where the size of the output depends on the values of the data (like np.unique) the length is unknown.