pandas: DataFrame.interpolate() extrapolates over trailing missing data

See also the discussion at StackOverflow.

Linear interpolation on a series with missing data at the end of the array will overwrite trailing missing values with the last non-missing value. In effect, the function extrapolates rather than strictly interpolating.

Example:

import pandas as pd
import numpy as np

a = pd.Series([np.nan, 1, np.nan, 3, np.nan])
a.interpolate()

Yields (note the extrapolated 4):

0   NaN
1     1
2     2
3     3
4     4
5     4
dtype: float64

not

0   NaN
1     1
2     2
3     3
4     4
5     NaN
dtype: float64

I believe the fix is something along the lines of changing lines 1545:1546 in core/common.py from

result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid], yvalues[firstIndex:][valid])

to

result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid], yvalues[firstIndex:][valid], np.nan, np.nan)

About this issue

  • Original URL
  • State: closed
  • Created 10 years ago
  • Reactions: 2
  • Comments: 17 (14 by maintainers)

Commits related to this issue

Most upvoted comments

This is definitely a bug. All new panda users will find this behaviour as confusing and error-prone (as I just did). If there is a code that rely on this bug - that’s mean there is a bug in that code also. You should fix it. Interpolate - means interpolate, not extrapolate in any way.

Given that the filling of the trailing values does not follow the specified method, but just forward fills, I think we could consider this as a bug. However, of course, still a bug that people could rely upon, so not sure whether we should just change the behaviour.

yeah it looks like a typo; this change is in 0.23

would love a PR to update!