scipy: Why doesn't “beta.fit” come out right?Please take a look at it for me

distribution_type = getattr(stats, 'beta')  # 共轭分布为beta分布,参考《Introduction to Empirical Bayes》
observed = pd.read_excel('retention.xlsx')
print(observed['and_retention'].values)
a, b = distribution_type.fit(observed['and_retention'].values)[:2]
print(a, b, sep='\n')
ax = plt.subplot(111)
ax.plot(np.linspace(0, 1, 100), stats.beta.pdf(np.linspace(0, 1, 100), a, b))
ax.hist(observed['and_retention'].values, alpha=0.75, color='green', bins=104)
plt.show()

“print” out:

[0.294 0.2955 0.235 0.2536 0.2423 0.2844 0.2099 0.2355 0.2946 0.3388 0.2202 0.2523 0.2209 0.2707 0.1885 0.2414 0.2846 0.328 0.2265 0.2563 0.2345 0.2845 0.1787 0.2392 0.2777 0.3076 0.2108 0.2477 0.234 0.2696 0.1839 0.2344 0.2872 0.3224 0.2152 0.2593 0.2295 0.2702 0.1876 0.2331 0.2809 0.3316 0.2099 0.2814 0.2174 0.2516 0.2029 0.2282 0.2697 0.3424 0.2259 0.2626 0.2187 0.2502 0.2161 0.2194 0.2628 0.3296 0.2323 0.2557 0.2215 0.2383 0.2166 0.2315 0.2757 0.3163 0.2311 0.2479 0.2199 0.2418 0.1938 0.2394 0.2718 0.3297 0.2346 0.2523 0.2262 0.2481 0.2118 0.241 0.271 0.3525 0.2323 0.2513 0.2313 0.2476 0.232 0.2295 0.2645 0.3386 0.2334 0.2631 0.226 0.2603 0.2334 0.2375 0.2744 0.3491 0.2052 0.2473 0.228 0.2448 0.2189 0.2149] 6.056697373013153 409078.57804704335

The α and β is out of whack(α=6.056697373013153,β=409078.57804704335) The fitting image is also unreasonable. The fitted beta distribution is almost the same as the straight line

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments

The beta distribution has four parameters: alpha, beta, location and scale. In the line

a, b = distribution_type.fit(observed['and_retention'].values)[:2]

you discarded the location and scale parameters. When you computed the PDF with the expression

stats.beta.pdf(np.linspace(0, 1, 100), a, b)

you did not specify the location and scale parameters, so the default values of 0 and 1 (respectively) were used. So it is no surprise that the plot of the PDF does not match the histogram.

Try changing the fit call to

a, b, loc, scale = distribution_type.fit(observed['and_retention'].values)

and then change the computation of the PDF to use

stats.beta.pdf(np.linspace(0, 1, 100), a, b, loc, scale)

A separate problem is that, in order for the histogram’s vertical scale to match the PDF, you need to use the argument density=True:

ax.hist(observed, alpha=0.75, color='green', bins=104, density=True)

I suspect you’ll want to follow the alternative suggestion that I made in my second comment: use the standard beta distribution with location 0 and scale 1. Here’s a modified version of your script that does that:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt


observed = [0.294, 0.2955, 0.235, 0.2536, 0.2423, 0.2844, 0.2099, 0.2355, 0.2946, 0.3388,
            0.2202, 0.2523, 0.2209, 0.2707, 0.1885, 0.2414, 0.2846, 0.328, 0.2265, 0.2563,
            0.2345, 0.2845, 0.1787, 0.2392, 0.2777, 0.3076, 0.2108, 0.2477, 0.234, 0.2696,
            0.1839, 0.2344, 0.2872, 0.3224, 0.2152, 0.2593, 0.2295, 0.2702, 0.1876, 0.2331,
            0.2809, 0.3316, 0.2099, 0.2814, 0.2174, 0.2516, 0.2029, 0.2282, 0.2697, 0.3424,
            0.2259, 0.2626, 0.2187, 0.2502, 0.2161, 0.2194, 0.2628, 0.3296, 0.2323, 0.2557,
            0.2215, 0.2383, 0.2166, 0.2315, 0.2757, 0.3163, 0.2311, 0.2479, 0.2199, 0.2418,
            0.1938, 0.2394, 0.2718, 0.3297, 0.2346, 0.2523, 0.2262, 0.2481, 0.2118, 0.241,
            0.271, 0.3525, 0.2323, 0.2513, 0.2313, 0.2476, 0.232, 0.2295, 0.2645, 0.3386,
            0.2334, 0.2631, 0.226, 0.2603, 0.2334, 0.2375, 0.2744, 0.3491, 0.2052, 0.2473,
            0.228, 0.2448, 0.2189, 0.2149]

a, b = stats.beta.fit(observed, floc=0, fscale=1)[:2]

print("a =", a, "  b =", b)

ax = plt.subplot(111)
ax.plot(np.linspace(0, 1, 100), stats.beta.pdf(np.linspace(0, 1, 100), a, b), 'k')
ax.hist(observed, alpha=0.75, color='green', bins=104, density=True)
plt.show()

The script prints

a = 33.26401059422594   b = 99.0180817184922

and generates the following plot:

figure_1