scikit-learn: extend StratifiedKFold to float for regression
It is important to stratify the samples according to y for cross-validation in regression models; otherwise, you might possibly get totally different ranges of y in training and validation sets. However, current StratifiedKFold
doesn’t allow float:
$ x=sklearn.cross_validation.StratifiedKFold(np.random.random(9), 2)
/anaconda/envs/py3/lib/python3.4/site-packages/sklearn/cross_validation.py:417: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=2.
% (min_labels, self.n_folds)), Warning)
$ list(x)
[(array([], dtype=int64), array([0, 1, 2, 3, 4, 5, 6, 7, 8])),
(array([0, 1, 2, 3, 4, 5, 6, 7, 8]), array([], dtype=int64))]
In case I may miss something, is there any reason why StratifiedKFold
does not work properly for float?
About this issue
- Original URL
- State: open
- Created 9 years ago
- Reactions: 3
- Comments: 52 (24 by maintainers)
I wanted this feature for a project and came up with this straightforward solution for stratifying with KBinsDiscretizer. The test is far from comprehensive, but at least it seems to be working. I’m not sure what the protocol is for attempting to help with issues like this, but I hope this is helpful Edit: updated code with KBinsDiscretizer
which prints out
If it helps, a simple way to implement this, as suggested above by @amueller, I think it would be defining a new cross-validator class that inherits from StratifiedKFold and overwrites the split method. Something like this:
I have also created a Colab notebook where one can see how it works: Stratified KFold in regression setups demo
P.S. This is an edited message because I realized that the previous version could fail when there are repeated observations due to the lack of precision of the target variable.
Totally different ranges are unlikely if you shuffle your data and it is not very small btw.