dask-ml: Error of random seed when using train_test_split()
I get this error when executing the following code:
import dask.array as da
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split
print('Local dask.__version__: {}'.format(dask.__version__))
print('Local dask_ml.__version__: {}'.format(dask_ml.__version__))
print('Client dask.__version__: {}'.format(dask.delayed(dask.__version__).compute()))
print('Client dask_ml.__version__: {}'.format(dask.delayed(dask_ml.__version__).compute()))
Local dask.version: 0.16.1 Local dask_ml.version: 0.6.0 Client dask.version: 0.16.1 Client dask_ml.version: 0.6.0
X, y = make_regression(n_samples=10000, n_features=4, random_state=0, chunks=4)
X
dask.array<da.random.normal, shape=(10000, 4), dtype=float64, chunksize=(4, 4)>
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-51-bd652d57a653> in <module>()
----> 1 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in train_test_split(*arrays, **options)
276 train_size=train_size, blockwise=blockwise,
277 random_state=random_state)
--> 278 train_idx, test_idx = next(splitter.split(*arrays))
279
280 train_test_pairs = ((_blockwise_slice(arr, train_idx),
~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in split(self, X, y, groups)
137 for i in range(self.n_splits):
138 if self.blockwise:
--> 139 yield self._split_blockwise(X)
140 else:
141 yield self._split(X)
~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in _split_blockwise(self, X)
144 chunks = X.chunks[0]
145 rng = check_random_state(self.random_state)
--> 146 seeds = rng.randint(0, 2**32 - 1, size=len(chunks))
147
148 train_pct, test_pct = _maybe_normalize_split_sizes(self.train_size,
mtrand.pyx in mtrand.RandomState.randint()
ValueError: high is out of bounds for int32
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 16 (8 by maintainers)
Thanks for the report. I assume your Python is 32 bit? We don’t do any testing with 32-bit builds.
Anyway, the bug is that
seeds = rng.randint(0, 2**32 - 1, size=len(chunks))should be
Any interest in making a PR to fix it? Otherwise I’ll get to it later today or tomorrow.