pandas: Inconsistent behavior of DatetimeIndex Partial String Indexing on Series and DataFrames

This bugreport is related to this SO question and the discussion there.

Summary

I believe that current DatetimeIndex Partial String Indexing behavior is either inconsistent or underdocumented as the result depends nontrivially on whether we are working with Series or DataFrame and whether DateTimeIndex is periodic or not.

`Series` vs. `DataFrame`

series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
                                                '2016-12-08 09:00:00',
                                                '2016-12-09 09:00:00']))
print(type(series["2016-12-07 09:00:00"]))
# <class 'numpy.int64'>

df = pd.DataFrame(series)
df["2016-12-07 09:00:00"]

KeyError: '2016-12-07 09:00:00'

Here we see that the behaviour depends on what we are indexing: Series returns scalar while DataFrame raises an exception. This exception is consistent with the documentation notice:

Warning The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)

Why we do not get the same exception for Series object?

Periodic vs. Non-periodic

series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
                                                '2016-12-08 09:00:00',
                                                '2016-12-09 09:00:01']))
# now it is not periodic due to 1 second in the last timestamp

print(type(series["2016-12-07 09:00:00"]))
# <class 'pandas.core.series.Series'>

In contrast with the previous example, we get an instance of Series here, so the same timestamp is considered as a slice, not index. Why it depends in such a way on periodicity of the index?

df = pd.DataFrame(series)
print(type(df["2016-12-07 09:00:00"]))
# <class 'pandas.core.frame.DataFrame'>

No exceptions here, in contrast with periodic case.

Is it intended behavior? If yes, I believe that this should be clearly documented and rationale provided.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.1.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.0+157.g2466ecb nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.1 numpy: 1.11.2 scipy: None statsmodels: None xarray: None IPython: 5.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None boto: None pandas_datareader: None

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 18 (17 by maintainers)

Commits related to this issue

Fix inconsistency in Partial String Index with 'second' resolution See #14826. Now the following logic applies: - If timestamp resolution is strictly less precise than index resolution, timetamp is... — committed to ischurov/pandas by ischurov 8 years ago
API/BUG: Fix inconsistency in Partial String Index with 'second' resolution Closes #14826 Fix inconsistency in Partial String Index with 'second' resolution. See #14826. Now if the timestamp and the... — committed to ShaharBental/pandas by ischurov 8 years ago

Most upvoted comments

Finally, it seems that I got it. The code I’m interested in is the following:

    def _partial_date_slice(self, reso, parsed, use_lhs=True, use_rhs=True):
        is_monotonic = self.is_monotonic
        if ((reso in ['day', 'hour', 'minute'] and
             not (self._resolution < Resolution.get_reso(reso) or
                  not is_monotonic)) or
            (reso == 'second' and
             not (self._resolution <= Resolution.RESO_SEC or
                  not is_monotonic))):
            # These resolution/monotonicity validations came from GH3931,
            # GH3452 and GH2369.
            raise KeyError

raising KeyError here means that the timestamp cannot be coerced to a slice. The condition basically says that if the resolution of the timestamp (that we remember from the string) is less precise than the resolution of the index, it is slice, otherwise it is not. That sounds very reasonable.

I’m not sure yet, why second resolution uses different condition (non-strict inequality instead of strict one), but basically I understand what’s going on here.

I’ll try to improve the docs soon and prepare PR that will refer to this issue.

ischurov on Dec 9, 2016