pandas: Inconsistent behavior of DatetimeIndex Partial String Indexing on Series and DataFrames
This bugreport is related to this SO question and the discussion there.
Summary
I believe that current DatetimeIndex Partial String Indexing behavior is either inconsistent or underdocumented as the result depends nontrivially on whether we are working with Series
or DataFrame
and whether DateTimeIndex
is periodic or not.
Series
vs. DataFrame
series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
'2016-12-08 09:00:00',
'2016-12-09 09:00:00']))
print(type(series["2016-12-07 09:00:00"]))
# <class 'numpy.int64'>
df = pd.DataFrame(series)
df["2016-12-07 09:00:00"]
KeyError: '2016-12-07 09:00:00'
Here we see that the behaviour depends on what we are indexing: Series
returns scalar while DataFrame
raises an exception. This exception is consistent with the documentation notice:
Warning The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)
Why we do not get the same exception for Series
object?
Periodic vs. Non-periodic
series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
'2016-12-08 09:00:00',
'2016-12-09 09:00:01']))
# now it is not periodic due to 1 second in the last timestamp
print(type(series["2016-12-07 09:00:00"]))
# <class 'pandas.core.series.Series'>
In contrast with the previous example, we get an instance of Series
here, so the same timestamp is considered as a slice, not index. Why it depends in such a way on periodicity of the index?
df = pd.DataFrame(series)
print(type(df["2016-12-07 09:00:00"]))
# <class 'pandas.core.frame.DataFrame'>
No exceptions here, in contrast with periodic case.
Is it intended behavior? If yes, I believe that this should be clearly documented and rationale provided.
Output of pd.show_versions()
pandas: 0.19.0+157.g2466ecb nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.1 numpy: 1.11.2 scipy: None statsmodels: None xarray: None IPython: 5.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None boto: None pandas_datareader: None
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 18 (17 by maintainers)
Commits related to this issue
- Fix inconsistency in Partial String Index with 'second' resolution See #14826. Now the following logic applies: - If timestamp resolution is strictly less precise than index resolution, timetamp is... — committed to ischurov/pandas by ischurov 8 years ago
- API/BUG: Fix inconsistency in Partial String Index with 'second' resolution Closes #14826 Fix inconsistency in Partial String Index with 'second' resolution. See #14826. Now if the timestamp and the... — committed to ShaharBental/pandas by ischurov 8 years ago
Finally, it seems that I got it. The code I’m interested in is the following:
raising
KeyError
here means that the timestamp cannot be coerced to a slice. The condition basically says that if the resolution of the timestamp (that we remember from the string) is less precise than the resolution of the index, it is slice, otherwise it is not. That sounds very reasonable.I’m not sure yet, why
second
resolution uses different condition (non-strict inequality instead of strict one), but basically I understand what’s going on here.I’ll try to improve the docs soon and prepare PR that will refer to this issue.