pandas: BUG: pd.to_datetime does not assign UTC TZ to Unix Epoch Timestamps
-
[ X ] I have checked that this issue has not already been reported.
-
[ X ] I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
from datetime import datetime
now = datetime.now()
df = pd.DataFrame({'ts_pd': [now.timestamp()], 'ts_py': [now.timestamp()]})
print(f"It is now: {now.timestamp()} seconds from Unix Epoch (00:00:00 UTC on 1 January 1970)")
df['ts_pd'] = pd.to_datetime(df['ts_pd'], unit='s')
df['ts_py'] = df['ts_py'].apply(datetime.fromtimestamp)
print(df)
# It is now: 1620338266.89658 seconds from Unix Epoch (00:00:00 UTC on 1 January 1970)
# ts_pd ts_py
# 2021-05-06 21:57:46 2021-05-06 17:57:46
Problem description
I would expect to_datetime
to behave the same as fromtimestamp
. That is, if returning a TZ-naive time, it should probably be a localtime. If parsing a unix epoch timestamp, we know the value is in UTC, so leaving it TZ-unaware is a footgun as I think a naive user would expect either (1) localtime or (2) to have UTC tzinfo attached to prevent mingling with true localtimes.
I encountered this problem because I was parsing timestamps from a couple of files. In one file were text-form timestamps, which were parsed as local-time. In the other were epoch-form timestamps, which were parsed as TZ-unaware UTC. If I unconditionally pass utc=True, now my text-form timestamps are mistakenly given UTC tzinfo. In the end I wrote a helper which only passes utc=True if the column is numerical.
This table shows the outputs and workaround:
str | ts | comment | |
---|---|---|---|
0 | pd.to_datetime(now.isoformat()) | 2021-05-06 18:03:31.766199 | |
1 | datetime.fromtimestamp(now.timestamp()) | 2021-05-06 18:03:31.766199 | |
2 | datetime.utcfromtimestamp(now.timestamp()) | 2021-05-06 22:03:31.766199 | This is OK, I specifically asked for UTC not localtime. |
3 | pd.to_datetime(now.timestamp(), unit=“s”) | 2021-05-06 22:03:31.766199112 | Not OK. I’ve now parsed a different time than if I had parsed isoformat() of the same time. |
4 | pd.to_datetime(now.timestamp(), unit=“s”, utc=True) | 2021-05-06 22:03:31.766199112+00:00 | OK, but should be default. |
Expected Output
ts_pd | ts_py |
---|---|
2021-05-06 21:53:42+00:00 | 2021-05-06 17:53:42 |
Output of pd.show_versions()
pandas : 1.0.5 numpy : 1.18.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 49.2.0.post20200714 Cython : 0.29.21 pytest : 5.4.3 hypothesis : None sphinx : 3.1.2 blosc : None feather : None xlsxwriter : 1.2.9 lxml.etree : 4.5.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.16.1 pandas_datareader: None bs4 : 4.9.1 bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.2 matplotlib : 3.2.2 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.4 pandas_gbq : None pyarrow : None pytables : None pytest : 5.4.3 pyxlsb : None s3fs : None scipy : 1.5.0 sqlalchemy : 1.3.18 tables : 3.6.1 tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.9 numba : 0.50.1
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (8 by maintainers)
I think I understand the confusion. I’m used to ignoring the origin parameter when it is “unix”. Could change it to lib.no_default, and if a user explicitly specifies it as unix then set utc=True.
Any new developments?
The internal of
pandas.Timestamp
is confusing and unpredicable.If
now
is a standard epoch time (UTC),assert pd.Timestamp.fromtimestamp(now).timestamp() == now
isFalse
.The Python module
datetime
assumes that the input timestamp is in standard UTC format and that the user is aware of this. However,pandas.Timestamp
appears to use two different strategies for handling user input and storing data internally. It reads user input as a UTC timestamp and converts it to local time for output. But instead of storing that UTC timestamp in its internal data, it re-parses the local time as UTC and stores this value instead.So, this accounts for why
assert pd.Timestamp.fromtimestamp(now).timestamp() == now
isFalse
and why the result of subtracting them is the number of seconds in timezone offset.If this is not going to be changed. What’s the best practice to use
pandas.Timestamp
? For now, it’s less error-prone if always explicitly specifying timezone.