pandas: Inconsistent date parsing of to_datetime

As this comes up regularly:

Depending on the string formats, you can get inconsistent parsing (meaning: different interpretation of parts of the strings as days vs months between different values in the passed list/array of strings). For example:

In [22]: pd.to_datetime(["31/12/2014", "10/03/2011"])
Out[22]: DatetimeIndex(['2014-12-31', '2011-10-03'])

where the first is parsed as ‘dayfirst’ and the second not. Above particular case can be solved by passing dayfirst=True, but in any case, the default behaviour of to_datetime results in inconsistent parsing here. Another typical case if where you do provide dayfirst=True but have a malformed date, then it will fall back to ‘not dayfirst’ (https://github.com/pydata/pandas/issues/3341).

The reason

The reason is the flexibility of dateutil.parser.parse. This is in the first place also a feature, as it lets you parse messy datetime colums. But when it comes to flexibility regarding dayfirst/yearfirst, this gives unexpected results. So dateutil will ‘try’ to use dayfirst=True, but when this does not give a valid date, it will ignore dayfirst=True and parse the string anyway:

In [3]: dateutil.parser.parse('1/12/2012', dayfirst=True)
Out[3]: datetime.datetime(2012, 12, 1, 0, 0)

In [4]: dateutil.parser.parse('1/13/2012', dayfirst=True)
Out[4]: datetime.datetime(2012, 1, 13, 0, 0)

The same is true for dayfirst=False (the default) as this will also try with ‘day first’ if the other way around failed. The issue for the lack of strictness of dayfirst is actually https://github.com/pydata/pandas/issues/3341.

Possible solutions

  • In some cases passing dayfirst=True is sufficient (if you don’t have messy or malformed dates).
  • The typical answer is saying: "if you want to be sure to have strict and consistent parsing, provide a format argument. We should probably stress this more in the docstring of to_datetime

But, regardless of the two points above, fact is that the default behaviour of to_datetime (without passing any arguments) can do inconsistent parsing and so give unexpected results. Possible ways to have better default behaviour:

  • Have the possibility to specify that dateutil should be strict about dayfirst. I raised an issue for that here: https://github.com/dateutil/dateutil/issues/214
  • Changing the default of infer_datetime_format to True would solve it in some cases (where the first datetime has the correct format you want), but certainly not all.
  • The idea of raising a warning when inconsistent parsing happened, is AFAIK not possible as pandas does not know how dateutil parsed the dates. And always warning when no format is passed or could be inferred (so in practice, when dateutil is used), is maybe to drastically, as the flexible parsing is also a feature.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 4
  • Comments: 34 (24 by maintainers)

Most upvoted comments

I am new to Pandas and github - please be gentle.

If I understand the thread correctly - this is an unaddressed issue for 3.5 years? Don’t you think that explicitly coding how a date should be read is an unreasonable solution? It’s like importing integers and coding “randomly_scramble_my_number=False”. This error is enough to ban use of Pandas in some large organisations 😦

@erpda

There are 3000+ open issues in pandas. How should the handful of mostly volunteer folks prioritize? if large orgs want to just take and not help fix then I am not sure what to say.

open source is about finding, reporting, AND helping to fix. there are a many orgs and many many individual contributors to pandas and open source generally that have built an amazing eco system.

Would love for large orgs to step up and seriously fund open source. just taking and not giving back at all IMHO is terrible.

Since mixed date formats within one series is unusual, why not raise a warning every time there are ambiguous values?

pandas is of course a completely volunteer project with almost 3000 issues, so statements like this

this is a very serious problem

are not very helpful to anyhow. pull-requests are the way to solve issues.

Agreeing with @garfieldthecat - this is a very serious problem - in that correctly formatted dd/mm datetimes will often / usually be garbled by pandas using the default setting, because pandas can and will reverse the dd/mm within a column, as it is guessing separately for each date. It’s very easy for the user to miss that this has happened.

I don’t think there’s any common use for the current behavior, where the guess is per-entry, so:

  • By default, the guessed format should be the same for each date in the series. I would personally prefer an incorrect guess of mm/dd for the first value, and the same incorrect guess for all subsequent values, to the current behavior.
  • The ideal would be to use a guess that gives a valid date for all dates in the series. This would involve back-tracking to switch days / months when the current date proves valid for dd/mm but not mm/dd. This would only be triggered for the dangerous cases, so even with the performance hit, I think that’s an improvement.
  • If anyone does have a use for switching formats mid-series, I guess this could be a keyword option. To me, that seems so pathological, that it’s reasonable to ask the user to write a custom function for that case.

Dear all,

It seems to_datetime is causing significant problem. Here are the codes I’ve ran. When compare the EARTHQUAKEDATE fields of the original and copied and to_datetime applied dataframes, I see the values are unfortunately different.

image

image image

Original is in DD/MM/YYYY hh:mm:ss format EARTHQUAKEDATE 05/10/1977 05:34:43

But modified is in YYYY/DD/MM hh:mm:ss format EARTHQUAKEDATE 1977-05-10 05:34:43

I’d would expect the result as YYYY-MM-DD hh:mm:ss.

I’d like to do anything I can for the solution.

you would do this in array_to_datetime

once a date is parsed with dayfirst or yearfirst then it would be an error to not parse others in the same way

The dayfirst parameter to the reading functions such as read_csv is even more confusing, as multiple datetime columns with different formats might co-exist in the same dataset.

I’ve put together a PDEP for this: https://github.com/pandas-dev/pandas/pull/48621

In addition to the core and triage teams, I’d encourage others who have commented here (@lordgrenville, @garfieldthecat, @paalge, @Craskermasker, @ahmetanildindar , @matthew-brett) to have a look - if you have concerns about this proposal, I’d appreciate your input

I’ve stumbled across this issue in working with our scientific dataset. Specifying the dayfirst argument to read_csv sorted us out. I would agree with the suggestion of printing a warning when inconsistent datetime parsing may be occurring. Can this be done after parsing by checking for inconsistent mapping from the string input to the datetime output across the whole series? Would this introduce a performance penalty?

I’m keen to contribute a fix but I’m a newbie to the pandas project.

Since mixed date formats within one series is unusual, why not raise a warning every time there are ambiguous values?

I am not sure there is way to get from dateutil how (assuming which format) it parsed the date, so then it will also be difficult to warn for different formats.