ciso8601: Invalid datetime can produce partial parse. Should return None instead
I came across this silent failure when parsing a third party datetime with an unusual separator:
>>> import ciso8601
>>> ciso8601.parse_datetime('2017-05-29 17:36:00Z')
datetime.datetime(2017, 5, 29, 17, 36, tzinfo=<UTC>)
#Now for a completely invalid string
>>> print(ciso8601.parse_datetime('Completely invalid!'))
None
#Now for an out of spec string (invalid separator)
>>> ciso8601.parse_datetime('2017-05-29_17:36:00Z')
datetime.datetime(2017, 5, 29, 0, 0)
It seems that the parser sees a character that it doesn’t recognize and stops parsing?
I would have expected that if the entire string cannot be parsed, it would return None as parse_datetime does for other parse failures, not silently return a datetime without time or tzinfo.
This lead to a sneaky bug in the code.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 15 (8 by maintainers)
I’m not sure what sort of PR you’d be expecting. If it were my decision (ie. my library), I think I would change it to return
Nonefor all invalid input.If that’s what you’d like to see, I could take a shot at that. But it would be a change of behaviour, so anyone who has written code that relies on this behaviour would need to change their code.
As an example of usage that would change, this test would change to expect
NoneThanks for this. Having both
parse_datetimeandparse_datetime_as_naiveas you described sounds good to me!I’ve made an attempt at making this change. I’ll be making a PR as soon as I’ve resolved anything that comes out of this discussion. You can see my initial comments and a [largely outdated] draft PR here.
v1.x problem
The problem with the current version of
cio8601is its error handling. In its current (v1.0.7) state:Noneand others might get truncated and return an incorrect Python datetime.As a developer with a given timestamp string, I cannot predict a priori what
ciso8601is going to return without looking at the code. Fundamentally, this is the problem that I’m hoping to address.My v2 Invariant
In my mind, the best way to resolve this ambiguity (in the case of invalid timestamps) is to premise version 2 of
ciso8601on the following invariant:parse_datetime(dt: String): datetimeis a function that takes a string and either:ValueError) with a description of the reason why the string doesn’t conform to the supported subset of ISO 8601My current work seems to indicate that this imposes a <10% performance penalty over v1.0.7 (but I’ll continue profiling to make sure there’s no more optimization to do).
Conflict with
parse_datetime_unawareparse_datetime_unawareexisted for the case where your input timestamp hadtzinfo, but you wanted to ignore the tzinfo and therefore could save some cycles by not parsing the tzinfo characters.parse_datetimealready handles both tz-aware and tz-naive timestamps.However, shortcircuiting the parse conflicts with the v2 invariant, as it is no longer checking the entirity of the string.
For example, if you had the invalid timestamp
2014-01-05T12:34:56ABCDEFGand ran it throughparse_datetime_unaware, it would not raise aValueError, since it would stop parsing after the:56. However, this timestamp is invalid and should raise aValueErrorunder the v2 invariant.My response to this conflict was to drop
parse_datetime_unaware. I’ve never found this use case to be compelling. Ignoring tzinfo on a tz-aware timestamp is almost never what you want to do (after all, there is a reason the tzinfo exists in the first place).If this were truly what you desired as a developer, you could still acheive this functionality by explicitly dropping the tzinfo after the parse:
So the capability is still there. It just removes the performance boost for this case in exchange for a consistent interface.
@thomasst had the following comment:
I’ll now go through each line and rebut:
This breaks the invariant of matching the entire string, as discussed above.
parse_datetime_unawareoffers no value then.parse_datetime_unawarewas only there for the case of better performance for timezone-aware timestamps.parse_datetimecan already handle both tz-aware and tz-naive timestamps without issue.This would be a very bad idea. Both tz-aware and tz-naive timestamps produce valid datetimes. There is no distiction between them in the ISO 8601 spec.
Imagine that you made
parse_datetimeonly parse tz-aware timestamps, andparse_datetime_unawareonly parse tz-naive timestamps. This would then force the developer to know a priori which they are processing. If they were parsing a mixed stream of valid tz-aware and tz-naive timestamps, they would have to somehow implement a way of detecting the difference (by parsing the timestamps), which would defeat much of the purpose of usingciso8601.Alternative Invariant
Instead of dropping
parse_datetime_unaware, it is conceivable that we could come up with a different invariant that addresses the v1 problem. I have given it some thought and have yet come up with a more useful one than the one presented above.As a compromise, I would consider renaming
parse_datetime_unawareto something likeparse_datetime_as_naive_unsafe(naive is the term that Python uses, not unaware) which would have a different invariant:parse_datetime_as_naive_unsafe(datetime:str): datetimeis a function that takes a string and either:ValueError) with a description of the reason why the string doesn’t conform to the supported subset of ISO 8601This alternative invariant would then be documented in the README. Something like
Note that the name change (
parse_datetime_as_naive_unsafe) is done intentionally so that:parse_datetime)Conclusion
I am completely open to suggestions for how to better address the issue.
Finally, obviously you are the maintainers and I’ll defer to your ultimate authority.
I would argue that returning
Noneis wrong entirely. Raising an exception feels a lot more pythonic:I think being strict is the best approach (eg return None if you see anything ‘invalid’ whilst parsing the string), otherwise you do end up with bugs like the ones mentioned where you get partial dates or partial times if you are too liberal or return partial times.
Also as we found in https://github.com/closeio/ciso8601/pull/30, the cPython PyDatetime interface does accept invalid input in some cases which then causes errors when you manipulate badly parsed datetimes so we definitely need to ensure we have good checking in our C parsing code on our side.