ciso8601: Invalid datetime can produce partial parse. Should return None instead

I came across this silent failure when parsing a third party datetime with an unusual separator:

>>> import ciso8601

>>> ciso8601.parse_datetime('2017-05-29 17:36:00Z')
datetime.datetime(2017, 5, 29, 17, 36, tzinfo=<UTC>)

#Now for a completely invalid string
>>> print(ciso8601.parse_datetime('Completely invalid!'))
None

#Now for an out of spec string (invalid separator)
>>> ciso8601.parse_datetime('2017-05-29_17:36:00Z')
datetime.datetime(2017, 5, 29, 0, 0)

It seems that the parser sees a character that it doesn’t recognize and stops parsing?

I would have expected that if the entire string cannot be parsed, it would return None as parse_datetime does for other parse failures, not silently return a datetime without time or tzinfo.

This lead to a sneaky bug in the code.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 15 (8 by maintainers)

Most upvoted comments

I’m not sure what sort of PR you’d be expecting. If it were my decision (ie. my library), I think I would change it to return None for all invalid input.

If that’s what you’d like to see, I could take a shot at that. But it would be a change of behaviour, so anyone who has written code that relies on this behaviour would need to change their code.

As an example of usage that would change, this test would change to expect None

Thanks for this. Having both parse_datetime and parse_datetime_as_naive as you described sounds good to me!

I’ve made an attempt at making this change. I’ll be making a PR as soon as I’ve resolved anything that comes out of this discussion. You can see my initial comments and a [largely outdated] draft PR here.

v1.x problem

The problem with the current version of cio8601 is its error handling. In its current (v1.0.7) state:

  • All valid datetime strings within the supported subset of ISO 8601 will result in the correct Python datetime (this is good).
  • Some invalid timestamps will return None and others might get truncated and return an incorrect Python datetime.

As a developer with a given timestamp string, I cannot predict a priori what ciso8601 is going to return without looking at the code. Fundamentally, this is the problem that I’m hoping to address.

My v2 Invariant

In my mind, the best way to resolve this ambiguity (in the case of invalid timestamps) is to premise version 2 of ciso8601 on the following invariant:

parse_datetime(dt: String): datetime is a function that takes a string and either:

  • Returns a properly parsed Python datetime, if and only if that entire string conforms to the supported subset of ISO 8601
  • Raises an Exception (ex. ValueError) with a description of the reason why the string doesn’t conform to the supported subset of ISO 8601

My current work seems to indicate that this imposes a <10% performance penalty over v1.0.7 (but I’ll continue profiling to make sure there’s no more optimization to do).

Conflict with parse_datetime_unaware

parse_datetime_unaware existed for the case where your input timestamp had tzinfo, but you wanted to ignore the tzinfo and therefore could save some cycles by not parsing the tzinfo characters. parse_datetime already handles both tz-aware and tz-naive timestamps.

However, shortcircuiting the parse conflicts with the v2 invariant, as it is no longer checking the entirity of the string.

For example, if you had the invalid timestamp 2014-01-05T12:34:56ABCDEFG and ran it through parse_datetime_unaware, it would not raise a ValueError, since it would stop parsing after the :56. However, this timestamp is invalid and should raise a ValueError under the v2 invariant.

My response to this conflict was to drop parse_datetime_unaware. I’ve never found this use case to be compelling. Ignoring tzinfo on a tz-aware timestamp is almost never what you want to do (after all, there is a reason the tzinfo exists in the first place).

If this were truly what you desired as a developer, you could still acheive this functionality by explicitly dropping the tzinfo after the parse:

    dt = parse_datetime("2018-01-01T00:00:00+05:00").replace(tzinfo=None)

So the capability is still there. It just removes the performance boost for this case in exchange for a consistent interface.

@thomasst had the following comment:

I’d like to keep parse_datetime_unaware. It could either ignore the timezone information, or only accept timezone-less datetime strings. But if we’re strict here, should parse_datetime then reject datetime strings that don’t have a timezone?

I’ll now go through each line and rebut:

It could either ignore the timezone information

This breaks the invariant of matching the entire string, as discussed above.

or only accept timezone-less datetime strings

parse_datetime_unaware offers no value then. parse_datetime_unaware was only there for the case of better performance for timezone-aware timestamps. parse_datetime can already handle both tz-aware and tz-naive timestamps without issue.

But if we’re strict here, should parse_datetime then reject datetime strings that don’t have a timezone?

This would be a very bad idea. Both tz-aware and tz-naive timestamps produce valid datetimes. There is no distiction between them in the ISO 8601 spec.

Imagine that you made parse_datetime only parse tz-aware timestamps, and parse_datetime_unaware only parse tz-naive timestamps. This would then force the developer to know a priori which they are processing. If they were parsing a mixed stream of valid tz-aware and tz-naive timestamps, they would have to somehow implement a way of detecting the difference (by parsing the timestamps), which would defeat much of the purpose of using ciso8601.

Alternative Invariant

Instead of dropping parse_datetime_unaware, it is conceivable that we could come up with a different invariant that addresses the v1 problem. I have given it some thought and have yet come up with a more useful one than the one presented above.

As a compromise, I would consider renaming parse_datetime_unaware to something like parse_datetime_as_naive_unsafe (naive is the term that Python uses, not unaware) which would have a different invariant:

parse_datetime_as_naive_unsafe(datetime:str): datetime is a function that takes a string and either:

  • Returns a properly parsed Python datetime, if and only if all characters up to, but not including, any potential tzinfo characters conform to the supported subset of ISO 8601
  • Raise an Exception (ex. ValueError) with a description of the reason why the string doesn’t conform to the supported subset of ISO 8601

This alternative invariant would then be documented in the README. Something like

Note: that unlike parse_datetime this will not catch all invalid timestamps. It will not raise ValueError in the case of an invalid timestamp having trailing characters (ex. 2014-01-05T12:34:56ABCDEFG). Only use parse_datetime_as_naive_unsafe if you are certain your data doesn’t suffer from this data quality problem (or you don’t care if it does).

Note that the name change (parse_datetime_as_naive_unsafe) is done intentionally so that:

  1. Developers don’t assume that it is the default way to parse naive datetimes (that is parse_datetime)
  2. There is a recognition that this function is breaking the (nicer) invariant offering perfect safety from data quality errors.

Conclusion

I am completely open to suggestions for how to better address the issue.

Finally, obviously you are the maintainers and I’ll defer to your ultimate authority.

I would argue that returning None is wrong entirely. Raising an exception feels a lot more pythonic:

>>> import datetime
>>> datetime.datetime.strptime('junk', '%Y-%m-%dT%H:%M:%S.%fZ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_strptime.py", line 362, in _strptime
    (data_string, format))
ValueError: time data 'junk' does not match format '%Y-%m-%dT%H:%M:%S.%fZ'

I think being strict is the best approach (eg return None if you see anything ‘invalid’ whilst parsing the string), otherwise you do end up with bugs like the ones mentioned where you get partial dates or partial times if you are too liberal or return partial times.

Also as we found in https://github.com/closeio/ciso8601/pull/30, the cPython PyDatetime interface does accept invalid input in some cases which then causes errors when you manipulate badly parsed datetimes so we definitely need to ensure we have good checking in our C parsing code on our side.