pandas: BUG: pd.to_datetime does not assign UTC TZ to Unix Epoch Timestamps

  • [ X ] I have checked that this issue has not already been reported.

  • [ X ] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
from datetime import datetime

now = datetime.now()

df = pd.DataFrame({'ts_pd': [now.timestamp()], 'ts_py': [now.timestamp()]})
print(f"It is now: {now.timestamp()} seconds from Unix Epoch (00:00:00 UTC on 1 January 1970)")
df['ts_pd'] = pd.to_datetime(df['ts_pd'], unit='s')
df['ts_py'] = df['ts_py'].apply(datetime.fromtimestamp)
print(df)

# It is now: 1620338266.89658 seconds from Unix Epoch (00:00:00 UTC on 1 January 1970)
# ts_pd                           ts_py
# 2021-05-06 21:57:46  2021-05-06 17:57:46

Problem description

I would expect to_datetime to behave the same as fromtimestamp. That is, if returning a TZ-naive time, it should probably be a localtime. If parsing a unix epoch timestamp, we know the value is in UTC, so leaving it TZ-unaware is a footgun as I think a naive user would expect either (1) localtime or (2) to have UTC tzinfo attached to prevent mingling with true localtimes.

I encountered this problem because I was parsing timestamps from a couple of files. In one file were text-form timestamps, which were parsed as local-time. In the other were epoch-form timestamps, which were parsed as TZ-unaware UTC. If I unconditionally pass utc=True, now my text-form timestamps are mistakenly given UTC tzinfo. In the end I wrote a helper which only passes utc=True if the column is numerical.

This table shows the outputs and workaround:

str ts comment
0 pd.to_datetime(now.isoformat()) 2021-05-06 18:03:31.766199
1 datetime.fromtimestamp(now.timestamp()) 2021-05-06 18:03:31.766199
2 datetime.utcfromtimestamp(now.timestamp()) 2021-05-06 22:03:31.766199 This is OK, I specifically asked for UTC not localtime.
3 pd.to_datetime(now.timestamp(), unit=“s”) 2021-05-06 22:03:31.766199112 Not OK. I’ve now parsed a different time than if I had parsed isoformat() of the same time.
4 pd.to_datetime(now.timestamp(), unit=“s”, utc=True) 2021-05-06 22:03:31.766199112+00:00 OK, but should be default.

Expected Output

ts_pd ts_py
2021-05-06 21:53:42+00:00 2021-05-06 17:53:42

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.8.3.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252

pandas : 1.0.5 numpy : 1.18.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 49.2.0.post20200714 Cython : 0.29.21 pytest : 5.4.3 hypothesis : None sphinx : 3.1.2 blosc : None feather : None xlsxwriter : 1.2.9 lxml.etree : 4.5.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.16.1 pandas_datareader: None bs4 : 4.9.1 bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.2 matplotlib : 3.2.2 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.4 pandas_gbq : None pyarrow : None pytables : None pytest : 5.4.3 pyxlsb : None s3fs : None scipy : 1.5.0 sqlalchemy : 1.3.18 tables : 3.6.1 tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.9 numba : 0.50.1

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

I think I understand the confusion. I’m used to ignoring the origin parameter when it is “unix”. Could change it to lib.no_default, and if a user explicitly specifies it as unix then set utc=True.

Any new developments?

The internal of pandas.Timestamp is confusing and unpredicable.

If now is a standard epoch time (UTC), assert pd.Timestamp.fromtimestamp(now).timestamp() == now is False.

The Python module datetime assumes that the input timestamp is in standard UTC format and that the user is aware of this. However, pandas.Timestamp appears to use two different strategies for handling user input and storing data internally. It reads user input as a UTC timestamp and converts it to local time for output. But instead of storing that UTC timestamp in its internal data, it re-parses the local time as UTC and stores this value instead.

So, this accounts for why assert pd.Timestamp.fromtimestamp(now).timestamp() == now is False and why the result of subtracting them is the number of seconds in timezone offset.

If this is not going to be changed. What’s the best practice to use pandas.Timestamp? For now, it’s less error-prone if always explicitly specifying timezone.

<bump>