scipy: Kolmogorov Smirnov 2 samples returning max distance location [Enhancement]

2 sample Kolmogorov Smirnov test (KS test) is about finding the maximal vertical distance between two empirical cumulative distributions; a large max distance is unlikely under the null assumption that the two empirical distributions are sampled from the same theoretical curve. This is implemented in scipy.stats.ks_twosamp.

The function calculates but does not return the location of the cumulative at which the max distance happens, but that would be extremely useful for a number of machine learning algorithms, e.g. to get an initial guess for a classifier between the two kinds of data.

I propose to add an optional argument to the function to return the location of the max distance between the distributions in terms of indices of the data and also value of the quantitative variable.

Reproducing code example:

Here a rough initial implementation for example purposes. The final few lines are the only ones that differ from the current implementation.

def max_distance_cumulative(data1, data2):
    """
    Computes the maximal vertical distance between cumulative distributions
    (this is the statistic for KS tests). Code mostly copied from
    scipy.stats.ks_twosamp

    Parameters
    ----------
    data1 : array_like
        First data set
    data2 : array_like
        Second data set
    Returns
    -------
    d : float
        Max distance, i.e. value of the Kolmogorov Smirnov test. Sign is + if
        the cumulative of data1 < the one of data2 at that location, else -.
    x : float
        Value of x where maximal distance d is reached.
    """
    from numpy import ma

    (data1, data2) = (ma.asarray(data1), ma.asarray(data2))
    (n1, n2) = (data1.count(), data2.count())
    mix = ma.concatenate((data1.compressed(), data2.compressed()))
    mixsort = mix.argsort(kind='mergesort')
    csum = np.where(mixsort < n1, 1./n1, -1./n2).cumsum()

    # Check for ties
    if len(np.unique(mix)) < (n1+n2):
        ind = np.r_[np.diff(mix[mixsort]).nonzero()[0], -1]
        csum = csum[ind]
        mixsort = mixsort[ind]

    csumabs = ma.abs(csum)
    i = csumabs.argmax()

    d = csum[i]
    # mixsort[i] contains the index of mix with the max distance
    x = mix[mixsort[i]]

    return (d, x)

I’m happy to make a PR if there is interest for this.

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 17 (15 by maintainers)

Most upvoted comments

given that there is interest in this functionality: instead of changing ks_2samp, what about adding a new function, e.g. ks_dist that returns the distance and the location?

chrisb83 on Jul 9, 2020

I’m still happy to make a PR for this if there is hope. And since we are on the topic, it would seem reasonable to add and option to ks_2samp that returns the KS statistic with a sign instead of the absolute value. Would be SO USEFUL in many occasions.

iosonofabio on Jun 15, 2020

@iosonofabio Thanks for your code. It is exactly what I was looking for. However, I have a question about the code. I am feeding two CDF as numpy to the code, but I keep getting an error of " ‘numpy.ndarray’ object has no attribute ‘count’". I have tried data1[‘Values’].count() as well. However, I still give the same error. I wonder if you or anyone else who had the same issue can guide me on how to get rid of this error.

Thank you so much.

HomaiRS on Jun 10, 2019