scikit-learn: extend StratifiedKFold to float for regression

It is important to stratify the samples according to y for cross-validation in regression models; otherwise, you might possibly get totally different ranges of y in training and validation sets. However, current StratifiedKFold doesn’t allow float:


$ x=sklearn.cross_validation.StratifiedKFold(np.random.random(9), 2)                                                                                                                                        
/anaconda/envs/py3/lib/python3.4/site-packages/sklearn/cross_validation.py:417: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=2.
  % (min_labels, self.n_folds)), Warning)

$ list(x)
[(array([], dtype=int64), array([0, 1, 2, 3, 4, 5, 6, 7, 8])),
 (array([0, 1, 2, 3, 4, 5, 6, 7, 8]), array([], dtype=int64))]

In case I may miss something, is there any reason why StratifiedKFold does not work properly for float?

About this issue

  • Original URL
  • State: open
  • Created 9 years ago
  • Reactions: 3
  • Comments: 52 (24 by maintainers)

Most upvoted comments

I wanted this feature for a project and came up with this straightforward solution for stratifying with KBinsDiscretizer. The test is far from comprehensive, but at least it seems to be working. I’m not sure what the protocol is for attempting to help with issues like this, but I hope this is helpful Edit: updated code with KBinsDiscretizer

import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import KBinsDiscretizer

class regressor_stratified_cv:
    def __init__(self,n_splits=10,n_repeats=2,group_count=10,random_state=0,strategy='quantile'):
        self.group_count=group_count
        self.strategy=strategy
        self.cvkwargs=dict(n_splits=n_splits,n_repeats=n_repeats,random_state=random_state)
        self.cv=RepeatedStratifiedKFold(**self.cvkwargs)
        self.discretizer=KBinsDiscretizer(n_bins=self.group_count,encode='ordinal',strategy=self.strategy)  
            
    def split(self,X,y,groups=None):
        kgroups=self.discretizer.fit_transform(y[:,None])[:,0]
        return self.cv.split(X,kgroups,groups)
    
    def get_n_splits(self,X,y,groups=None):
        return self.cv.get_n_splits(X,y,groups)
    
    
if __name__=="__main__":
    n_splits=5
    n_repeats=5
    group_count=5
    cv=regressor_stratified_cv(n_splits=n_splits,n_repeats=n_repeats,
        group_count=group_count,random_state=0,strategy='uniform')
    import numpy as np
    n=1000000
    y=np.linspace(-n//2,n//2,n+1)
    n=y.size
    np.random.shuffle(y)
    X=y.copy()[:,None] # make 2d
    
    i=0;j=0;splist=[];test_idx_list=[]
    for train,test in cv.split(X,y):
        if i==0:print(f'cv results for *test* set {j} ')
        #print(train,test)
        range_i=np.ptp(y[train])
        splist.append(range_i)
        test_idx_list.append(train)
        print(f'range for rep:{j}, fold:{i}, {range_i}')
        i+=1
        if i==n_splits:
            test_unique_count=np.size(np.unique(np.concatenate(test_idx_list)))
            print(f'range of ranges, {np.ptp(np.array(splist))}')
            print(f'unique elements:{test_unique_count} for n:{n}','\n')
            splist=[];test_idx_list=[]
            i=0;j+=1

which prints out

range for rep:0, fold:0, 1000000.0
range for rep:0, fold:1, 999997.0
range for rep:0, fold:2, 1000000.0
range for rep:0, fold:3, 1000000.0
range for rep:0, fold:4, 1000000.0
range of ranges, 3.0
unique elements:1000001 for n:1000001 

cv results for *test* set 1 
range for rep:1, fold:0, 999999.0
range for rep:1, fold:1, 999999.0
range for rep:1, fold:2, 1000000.0
range for rep:1, fold:3, 1000000.0
range for rep:1, fold:4, 1000000.0
range of ranges, 1.0
unique elements:1000001 for n:1000001 

cv results for *test* set 2 
range for rep:2, fold:0, 1000000.0
range for rep:2, fold:1, 1000000.0
range for rep:2, fold:2, 999999.0
range for rep:2, fold:3, 1000000.0
range for rep:2, fold:4, 999999.0
range of ranges, 1.0
unique elements:1000001 for n:1000001 

cv results for *test* set 3 
range for rep:3, fold:0, 1000000.0
range for rep:3, fold:1, 999999.0
range for rep:3, fold:2, 999999.0
range for rep:3, fold:3, 1000000.0
range for rep:3, fold:4, 1000000.0
range of ranges, 1.0
unique elements:1000001 for n:1000001 

cv results for *test* set 4 
range for rep:4, fold:0, 1000000.0
range for rep:4, fold:1, 1000000.0
range for rep:4, fold:2, 1000000.0
range for rep:4, fold:3, 1000000.0
range for rep:4, fold:4, 999997.0
range of ranges, 3.0
unique elements:1000001 for n:1000001 

If it helps, a simple way to implement this, as suggested above by @amueller, I think it would be defining a new cross-validator class that inherits from StratifiedKFold and overwrites the split method. Something like this:

import numpy as np
from sklearn.model_selection import StratifiedKFold

class StratifiedKFoldReg(StratifiedKFold):
    
    """
    
    This class generate cross-validation partitions
    for regression setups, such that these partitions
    resemble the original sample distribution of the 
    target variable.
    
    """
    
    def split(self, X, y, groups=None):
        
        n_samples = len(y)
        
        # Number of labels to discretize our target variable,
        # into bins of quasi equal size
        n_labels = int(np.floor(n_samples/self.n_splits))
        
        # Assign a label to each bin of n_splits points
        y_labels_sorted = np.concatenate([np.repeat(ii, self.n_splits) \
            for ii in range(n_labels)])
        
        # Get number of points that would fall
        # out of the equally-sized bins
        mod = np.mod(n_samples, self.n_splits)
        
        # Find unique idxs of first unique label's ocurrence
        _, labels_idx = np.unique(y_labels_sorted, return_index=True)
        
        # sample randomly the label idxs to which assign the 
        # the mod points
        rand_label_ix = np.random.choice(labels_idx, mod, replace=False)

        # insert these at the beginning of the corresponding bin
        y_labels_sorted = np.insert(y_labels_sorted, rand_label_ix, y_labels_sorted[rand_label_ix])
        
        # find each element of y to which label corresponds in the sorted 
        # array of labels
        map_labels_y = dict()
        for ix, label in zip(np.argsort(y), y_labels_sorted):
            map_labels_y[ix] = label
    
        # put labels according to the given y order then
        y_labels = np.array([map_labels_y[ii] for ii in range(n_samples)])

        return super().split(X, y_labels, groups)

I have also created a Colab notebook where one can see how it works: Stratified KFold in regression setups demo

P.S. This is an edited message because I realized that the previous version could fail when there are repeated observations due to the lack of precision of the target variable.

Totally different ranges are unlikely if you shuffle your data and it is not very small btw.