scikit-learn: Classifiers may not work with arrays defining __array_function__
Description
With NEP-18, numpy functions that previously converted an array-like to an ndarray may no longer do the (implicit) conversion. dask.array
recently implemented __array_function__
so np.unique(dask.array.Array)
now returns a dask.array.Array
.
Some more details in https://github.com/dask/dask-ml/issues/541
Steps/Code to Reproduce
import dask.array as da
import dask_ml.datasets
import sklearn.linear_model
X, y = dask_ml.datasets.make_classification(chunks=50)
clf = sklearn.linear_model.LogisticRegression()
clf.fit(X, y)
Expected Results
No error, the same output as clf.fit(X.compute(), y.compute())
,
or by setting the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION='0'
.
Actual Results
That raises
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-b0953fbb1d6e> in <module>
----> 1 clf.fit(X, y)
~/Envs/dask-dev/lib/python3.7/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
1536
1537 multi_class = _check_multi_class(self.multi_class, solver,
-> 1538 len(self.classes_))
1539
1540 if solver == 'liblinear':
TypeError: 'float' object cannot be interpreted as an integer
This is because self.classes_ = np.unique(y)
is a Dask Array with unknown length
In [2]: np.unique(da.arange(12))
Out[2]: dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,), chunktype=numpy.ndarray>
since Dask is lazy and doesn’t know the unique elements until compute time.
Versions
System:
python: 3.7.3 (default, Apr 5 2019, 14:56:38) [Clang 10.0.1 (clang-1001.0.46.3)]
executable: /Users/taugspurger/Envs/dask-dev/bin/python
machine: Darwin-18.6.0-x86_64-i386-64bit
Python deps:
pip: 19.2.1
setuptools: 41.0.1
sklearn: 0.21.3
numpy: 1.18.0.dev0+5e7e74b
scipy: 1.2.0
Cython: 0.29.9
pandas: 0.25.0+169.g5de4e55d6
I think this needs need NumPy>=1.17 and Dask>=2.0.0
Possible solution: Explicitly convert array-likes to concrete ndarrays where necessary (this is a bit hard to determine though). For example https://github.com/scikit-learn/scikit-learn/blob/148491867920cc2af0e7e5700a0299be4a5d1c9f/sklearn/linear_model/logistic.py#L1517 would be self.classes_ = np.asarray(np.unique(y))
. That may not be ideal for other libraries implementing __array_function__
(like pydata/sparse).
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 18 (18 by maintainers)
NumPy itself doesn’t really impose any restrictions on what you can do with
__array_function__
. I think it would be perfectly reasonable to error when length is NaN.I would definitely coerce everything to NumPy arrays in
check_array
. As a first pass that’s definitely the right thing to do. I’m a little surprised that wasn’t happening already. Duck array support is something you want to add intentionally, not accidentally.Do you mean the
__array_function__
protocol, or Python’s data model? Most of the time Dask arrays have a known shape, solen
will return an int. But when you do an operation where the size of the output depends on the values of the data (likenp.unique
) the length is unknown.