scipy: BUG: power_divergence is raising on master when it does not in 1.6 or earlier
My issue is about a bug appearing in the unreleased SciPy 1.7.0.dev0+9b9f2e8. This is appearing in the statsmodels pip-pre run.
Reproducing code example:
from numpy import array
from scipy import stats
f_obs = array([44, 74, 48, 24, 8, 2])
f_exp = array([43.93623589, 71.60431015, 52.20297275, 23.22015199, 7.16316534,
1.59334208])
chi2 = stats.chisquare(f_obs , f_exp )
Error message:
> raise ValueError(msg)
E ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected
E frequencies to a relative tolerance of 1e-08, but the percent differences are:
E 0.0014010692380163618
Scipy/Numpy/Python version information:
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 34 (20 by maintainers)
Your implementation is just incorrect. Even in the book that you cite, it deals with the truncation. I also think you are missing a
(1-exp)
term. See the section on the Gap test in Knuth’s Seminumerical Algorithms, probably the most authoritative source, for the correct formulae.A guess based on a quick code check
The skidmarks.gap_test function truncates the array of expected observations. So, AFAICS, egaps should not add up to one
In statsmodels I had a similar problem. I truncated expected counts eg. for poisson, which was negligible before the change in scipy and raised after for some cases. My fix, AFAIR, was to add the missing probability to the last count, which also makes sense for chisquare test if truncation is even of a nonnegligible amount.
Also chisquare test requires that expected count in each cell is large enough. So relying on tiny cells is not good for pvalues either.
Actually the first sums to 455751, the other sums to 455754, hence the error.