pandas: BUG: NonExistentTimeError when resampling series localized to 'America/Santiago'

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

s = pd.Series(0.0, pd.date_range('2001-10-14 02:00', periods=2, freq='1H', tz='America/Santiago'))
s.resample('30T').max()  # raises error for '2001-10-14 00:00', which is not in the index

Problem description

This shouldn’t raise an error, as there is no ambiguity on how the resampling should be made. Also it isn’t clear why it raises an error for the ‘2001-10-14 00:00’ timestamp, which is not in the index.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2c8480af2f25efdbd803218b9d87980f416563e python : 3.8.8.final.0 python-bits : 64 OS : Darwin OS-release : 20.2.0 Version : Darwin Kernel Version 20.2.0: Wed Dec 2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 1.2.3 numpy : 1.19.2 pytz : 2021.1 dateutil : 2.8.1 pip : 21.0.1 setuptools : 52.0.0.post20210125 Cython : 0.29.22 pytest : 6.2.2 hypothesis : None sphinx : 3.5.2 blosc : None feather : None xlsxwriter : 1.3.7 lxml.etree : 4.6.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 2.11.3 IPython : 7.21.0 pandas_datareader: None bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : 0.8.3 fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.1 sqlalchemy : 1.3.23 tables : 3.6.1 tabulate : None xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.53.0

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 15 (12 by maintainers)

Most upvoted comments

@mroeschke To be clear here the issue is that the period to be resampled has no non-existent times so I’m not sure users would expect to need to specify parameters? It’s the use of normalize internally inside resample to provide an anchor were a non-existent time comes into play.

Perhaps the solution is that normalize should allow a nonexistent time parameter, and then in resample we specify a normalize(nonexistent='shift_forward') when finding the anchor?

@AlexKirko My guess as to why America/Santiago is causing the problem is that their DST transition occurs at midnight local time, which breaks normalize. This appears to be a simpler repro:

pd.to_datetime('2001-10-14 02:00').tz_localize('America/Santiago').normalize()

In the resampling case, the resample logic is calling .normalize() to find an anchor for the resampling (we do this so that a resample(‘30T’) doesn’t depend on whether the first data point resides on a 30T boundary or not)-- but since there is no midnight on that day, it’s throwing NonExistentTimeError. My guess is there are probably many other places in the code that assume that Timestamp.normalize is a safe call, but that’s not a correct assumption given the current definition “Normalize timestamp to midnight, preserving tz information”. Perhaps the best solution would be to define a behavior that normalize should exhibit if midnight does not exist in the local timezone.

#40817 may also be related, I note that that also is referencing Santiago specifically.