pandas: Datetimes are inconsistently truncated in to_csv

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({
    'date': pd.to_datetime('1970-01-01'),
    'datetime': pd.date_range('1970-01-01', periods=5, freq='H')
})
df['date'].to_csv('/dev/stdout', index=False)
df[['date', 'datetime']].to_csv('/dev/stdout', index=False)
date
1970-01-01
1970-01-01
1970-01-01
1970-01-01
1970-01-01

date,datetime
1970-01-01 00:00:00,1970-01-01 00:00:00
1970-01-01 00:00:00,1970-01-01 01:00:00
1970-01-01 00:00:00,1970-01-01 02:00:00
1970-01-01 00:00:00,1970-01-01 03:00:00
1970-01-01 00:00:00,1970-01-01 04:00:00

Problem description

datetime64 series with all 00:00:00 time components are written out in inconsistent formats depending on the inclusion of time in other datetime64 series within the dataframe. I expected series with 00:00:00 times to be trimmed on a per-series basis, regardless of the inclusion of other Series with populated times. Per-series formatting would be better because the results become more independent and intuitive.

This makes testing more brittle because changes in one series (or adding a timed series to a df) have side effects for other series’ output.

I have output CSVs that are then loaded into another system that expects date types in some columns and datetimes in others, which then fails to load because the date only columns have 00:00:00 included. Currently, I track internal column metadata and .dt.strftime('%Y-%m-%d').map(lambda x: x != 'NaT' and x or None) the date columns to get around the formatting.

Expected Output

# df[['date', 'datetime']].to_csv('/dev/stdout', index=False)
date,datetime
1970-01-01,1970-01-01 00:00:00
1970-01-01,1970-01-01 01:00:00
1970-01-01,1970-01-01 02:00:00
1970-01-01,1970-01-01 03:00:00
1970-01-01,1970-01-01 04:00:00

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.1 pytest: None pip: 10.0.1 setuptools: 39.1.0 Cython: None numpy: 1.14.3 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8.1 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

Just to clarify OP’s ask, Pandas doesn’t not support a date data type. pd.to_datetime('1970-01-01') will return a datetime.

In [8]: df.dtypes
Out[8]:
date        datetime64[ns]
datetime    datetime64[ns]
dtype: object

So trying to determine whether a column has date vs datetime data based on a repr is fragile.

I don’t think this is a CSV issue, but that being said, maybe the datetime repr may be a bit confusing and or inconsistent between dates vs datetime and Series vs DataFrames respectively.