pandas: ENH: Standard Error of the Mean (sem) aggregation method

A very common operation when trying to work with data is to find out the error range for the data. In scientific research, including error ranges is required.

There are two main ways to do this: standard deviation and standard error of the mean. Pandas has an optimized std aggregation method for both dataframe and groupby. However, it does not have an optimized standard error method, meaning users who want to compute error ranges have to rely on the unoptimized scipy method.

Since computing error ranges is such a common operation, I think it would be very useful if there was an optimized sem method like there is for std.

About this issue

  • Original URL
  • State: closed
  • Created 10 years ago
  • Comments: 15 (14 by maintainers)

Most upvoted comments

I have also been at three different institutions, and they also all used SEM. And I have seen it on hundreds of papers, presentations, and posters.

every science institution i’ve ever worked in (just 3 really so not a whole lot of weight there) has used sem at some point (even if just to get a rough idea of error ranges). i see your point about different definitions, maybe other folks want to chime in

@jreback i don’t think this is code bloat relative to the alternative:

You can’t really use scipy.stats.sem because it doesn’t handle nans:

In [19]: from scipy.stats import sem

In [20]: df = DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'c'])

In [21]: df
Out[21]:
        a       b       c
0  1.1658  0.2184 -2.0823
1  0.5625 -0.5034  0.7028
2 -0.8424  0.1333 -1.1065
3  0.9335 -0.6088  1.4308
4 -0.1027 -0.1888 -0.5816
5 -0.5202  0.3210 -0.9942
6 -0.8666  0.8711 -0.5691
7 -0.7701 -2.1855 -0.4302
8  1.0664 -1.2672  0.7117
9 -0.7530 -0.8466  0.0194

[10 rows x 3 columns]

In [22]: sem(df[df > 0])
Out[22]: array([ nan,  nan,  nan])

Okay, so let’s try it with scipy.stats.mstats.sem:

In [26]: from scipy.stats.mstats import sem as sem

In [27]: sem(df[df > 0])
Out[27]:
masked_array(data = [-- -- --],
             mask = [ True  True  True],
       fill_value = 1e+20)

That’s hardly what I would expect here, and masked arrays are almost as fun as recarrays. I’m +1 on reopening this.

Here’s what it would take to get the desired result from scipy:

In [32]: Series(sem(np.ma.masked_invalid(df[df > 0])),index=df.columns)
Out[32]:
a    0.1321
b    0.1662
c    0.2881
dtype: float64

In [33]: df[df > 0].std() / sqrt(df[df > 0].count())
Out[33]:
a    0.1321
b    0.1662
c    0.2881
dtype: float64

http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.sem.html

@toddrjen What do you mean with an optimized method? std is optimized, so you don’t have to rely on an ‘unoptimized’ scipy.stats method, you can just do: df.std()/len(df)

And by the way, scipy.stats.sem is not that ‘unoptimized’. In fact, it is even faster, as this does not do eg the extra nan-checking as pandas does:

In [2]: s = pd.Series(np.random.randn(1000))

In [7]: from scipy import stats

In [8]: stats.sem(s.values)
Out[8]: 0.031635197968083853

In [9]: s.std() / np.sqrt(len(s))
Out[9]: 0.031635197968083832

In [11]: %timeit stats.sem(s.values)
10000 loops, best of 3: 46.2 µs per loop

In [12]: %timeit s.std() / np.sqrt(len(s))
10000 loops, best of 3: 85.7 µs per loop

In [12]: %timeit s.std() / np.sqrt(len(s))

But of course, the question still remains, do we provide a shortcut to this functionality in the form of a sem method, or do we just expect out users to divide the std themselves.