scipy: BUG: power_divergence is raising on master when it does not in 1.6 or earlier

My issue is about a bug appearing in the unreleased SciPy 1.7.0.dev0+9b9f2e8. This is appearing in the statsmodels pip-pre run.

Reproducing code example:

from numpy import array
from scipy import stats


f_obs = array([44, 74, 48, 24,  8,  2])
f_exp = array([43.93623589, 71.60431015, 52.20297275, 23.22015199,  7.16316534,
        1.59334208])
chi2 = stats.chisquare(f_obs , f_exp )

Error message:

>               raise ValueError(msg)
E  ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected 
E frequencies to a relative tolerance of 1e-08, but the percent differences are:
E  0.0014010692380163618

Scipy/Numpy/Python version information:

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 34 (20 by maintainers)

Most upvoted comments

Your implementation is just incorrect. Even in the book that you cite, it deals with the truncation. I also think you are missing a (1-exp) term. See the section on the Gap test in Knuth’s Seminumerical Algorithms, probably the most authoritative source, for the correct formulae.

A guess based on a quick code check

The skidmarks.gap_test function truncates the array of expected observations. So, AFAICS, egaps should not add up to one

    egaps = [l * (exp ** ii) for ii in range(1, len(ogaps) + 1)]
    chi, pval = chisquare(np.array(ogaps), np.array(egaps))

In statsmodels I had a similar problem. I truncated expected counts eg. for poisson, which was negligible before the change in scipy and raised after for some cases. My fix, AFAIR, was to add the missing probability to the last count, which also makes sense for chisquare test if truncation is even of a nonnegligible amount.

Also chisquare test requires that expected count in each cell is large enough. So relying on tiny cells is not good for pvalues either.

Folks, for instance, for this stats.chisquare([214598,241153],[227877,227877]), both vectors sum up to 455754, but I get this error:

For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
6.582541782683966e-06

Please find a way, as that’s a bug, not a feature!

Actually the first sums to 455751, the other sums to 455754, hence the error.