dateparser: Bad escape characters trigger an exception
Note: As a workaround for this issue, we have pinned regex. Which makes Python 3.11 support either impossible or uncomfortable. The goal now is to remove that version pin on regex without making this issue resurface.
Hello everyone,
Tried parsing under python 3.7.5 and 3.9
dateparser.parse('12/12/12')
It also gives the same output for any “valid” input shown in the doc:
dateparser.parse('Fri, 12 Dec 2014 10:55:50')
dateparser.parse('22 Décembre 2010', date_formats=['%d %B %Y'])
...
Here’s the error:
---------------------------------------------------------------------------
error Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 dateparser.parse("12/12/12")
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\conf.py:92, in apply_settings.<locals>.wrapper(*args, **kwargs)
89 if not isinstance(kwargs['settings'], Settings):
90 raise TypeError("settings can only be either dict or instance of Settings class")
---> 92 return f(*args, **kwargs)
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\__init__.py:61, in parse(date_string, date_formats, languages, locales, region, settings, detect_languages_function)
57 if languages or locales or region or detect_languages_function or not settings._default:
58 parser = DateDataParser(languages=languages, locales=locales,
59 region=region, settings=settings, detect_languages_function=detect_languages_function)
---> 61 data = parser.get_date_data(date_string, date_formats)
63 if data:
64 return data['date_obj']
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:428, in DateDataParser.get_date_data(self, date_string, date_formats)
425 date_string = sanitize_date(date_string)
427 for locale in self._get_applicable_locales(date_string):
--> 428 parsed_date = _DateLocaleParser.parse(
429 locale, date_string, date_formats, settings=self._settings)
430 if parsed_date:
431 parsed_date['locale'] = locale.shortname
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:178, in _DateLocaleParser.parse(cls, locale, date_string, date_formats, settings)
175 @classmethod
176 def parse(cls, locale, date_string, date_formats=None, settings=None):
177 instance = cls(locale, date_string, date_formats, settings)
--> 178 return instance._parse()
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:182, in _DateLocaleParser._parse(self)
180 def _parse(self):
181 for parser_name in self._settings.PARSERS:
--> 182 date_data = self._parsers[parser_name]()
183 if self._is_valid_date_data(date_data):
184 return date_data
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:196, in _DateLocaleParser._try_freshness_parser(self)
194 def _try_freshness_parser(self):
195 try:
--> 196 return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
197 except (OverflowError, ValueError):
198 return None
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:234, in _DateLocaleParser._get_translated_date(self)
232 def _get_translated_date(self):
233 if self._translated_date is None:
--> 234 self._translated_date = self.locale.translate(
235 self.date_string, keep_formatting=False, settings=self._settings)
236 return self._translated_date
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:131, in Locale.translate(self, date_string, keep_formatting, settings)
128 dictionary = self._get_dictionary(settings)
129 date_string_tokens = dictionary.split(date_string, keep_formatting)
--> 131 relative_translations = self._get_relative_translations(settings=settings)
133 for i, word in enumerate(date_string_tokens):
134 word = word.lower()
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:158, in Locale._get_relative_translations(self, settings)
155 if settings.NORMALIZE:
156 if self._normalized_relative_translations is None:
157 self._normalized_relative_translations = (
--> 158 self._generate_relative_translations(normalize=True))
159 return self._normalized_relative_translations
160 else:
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:172, in Locale._generate_relative_translations(self, normalize)
170 value = list(map(normalize_unicode, value))
171 pattern = '|'.join(sorted(value, key=len, reverse=True))
--> 172 pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
173 pattern = re.compile(r'^(?:{})$'.format(pattern), re.UNICODE | re.IGNORECASE)
174 relative_dictionary[pattern] = key
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\regex\regex.py:700, in _compile_replacement_helper(pattern, template)
695 break
696 if ch == "\\":
697 # '_compile_replacement' will return either an int group reference
698 # or a string literal. It returns items (plural) in order to handle
699 # a 2-character literal (an invalid escape sequence).
--> 700 is_group, items = _compile_replacement(source, pattern, is_unicode)
701 if is_group:
702 # It's a group, so first flush the literal.
703 if literal:
File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\regex\_regex_core.py:1736, in _compile_replacement(source, pattern, is_unicode)
1733 if value is not None:
1734 return False, [value]
-> 1736 raise error("bad escape \\%s" % ch, source.string, source.pos)
1738 if isinstance(source.sep, bytes):
1739 octal_mask = 0xFF
error: bad escape \d at position 7
How to reproduce: Env: windows 10
- Fresh install of python 3.7.5 or 3.9
- Make a simple python file including these 2 lines:
import dateparser
dateparser.parse("12/12/12")
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 19
- Comments: 17 (4 by maintainers)
Commits related to this issue
- critical: fix dateparser/regex issue (scrapinghub/dateparser#1045) — committed to adbar/htmldate by adbar 2 years ago
- build(critical): :ambulance: fix dateparser/regex issue Introduced by version bump in regex Related to scrapinghub/dateparser#1045 — committed to aphp/edsnlp by bdura 2 years ago
- build(critical): :ambulance: fix dateparser/regex issue Introduced by version bump in regex Related to scrapinghub/dateparser#1045 — committed to aphp/edsnlp by bdura 2 years ago
- 0.3.42 1. fix gfinance exceptions 2. temporary fix for scrapinghub/dateparser#1045 3. readme update — committed to xiaopc/qdii-value by xiaopc 2 years ago
- pin regex version https://github.com/scrapinghub/dateparser/issues/1045 — committed to scrapinghub/dateparser by asadurski 2 years ago
- pin regex version (#1046) https://github.com/scrapinghub/dateparser/issues/1045 — committed to scrapinghub/dateparser by asadurski 2 years ago
- setup.cfg: avoid regex 2022.3.15 see https://github.com/scrapinghub/dateparser/issues/1045 — committed to duncanmmacleod/gwpy by duncanmmacleod 2 years ago
- dependencies: update the dependencies lock file * Fixes problem with regex in datetime: https://github.com/scrapinghub/dateparser/issues/1045 * Fixes flask_celerxext mappings for scheduler. Co-Autho... — committed to rerowep/rero-ils by rerowep 2 years ago
- Upgrade dateparse to 1.1.1 or later to fix a problem with its regex dependency See https://github.com/scrapinghub/dateparser/pull/1046, https://github.com/scrapinghub/dateparser/issues/1045. — committed to elemental-lf/benji by elemental-lf 2 years ago
- Upgrade requirements.txt (2022-03-19) In particular, this fixes an issue with an update of the regex package (https://github.com/scrapinghub/dateparser/issues/1045). — committed to Museum-Barberini/Barberini-Analytics by LinqLover 2 years ago
- dependencies: update the dependencies lock file * Fixes problem with regex in datetime: https://github.com/scrapinghub/dateparser/issues/1045 * Fixes flask_celerxext mappings for scheduler. Co-Autho... — committed to rerowep/rero-ils by rerowep 2 years ago
- dependencies: update the dependencies lock file * Fixes problem with regex in datetime: https://github.com/scrapinghub/dateparser/issues/1045 * Fixes flask_celerxext mappings for scheduler. Co-Autho... — committed to rerowep/rero-ils by rerowep 2 years ago
- dependencies: update the dependencies lock file * Fixes problem with regex in datetime: https://github.com/scrapinghub/dateparser/issues/1045 * Fixes flask_celerxext mappings for scheduler. Co-Autho... — committed to rero/rero-ils by rerowep 2 years ago
- setup.cfg: avoid regex 2022.3.15 see https://github.com/scrapinghub/dateparser/issues/1045 (cherry picked from commit 1b4b22dc2bad405d4880380e8e1b03b91e42f94a) — committed to duncanmmacleod/gwpy by duncanmmacleod 2 years ago
- setup.cfg: avoid regex 2022.3.15 see https://github.com/scrapinghub/dateparser/issues/1045 (cherry picked from commit 1b4b22dc2bad405d4880380e8e1b03b91e42f94a) — committed to duncanmmacleod/gwpy by duncanmmacleod 2 years ago
- Apply the fix from https://github.com/scrapinghub/dateparser/issues/1045#issuecomment-1129846022 — committed to asaggi-supportlogic/dateparser by asaggi-supportlogic 2 years ago
- Try fixing regex incompatibility Fixes https://github.com/scrapinghub/dateparser/issues/1052 Based on https://github.com/scrapinghub/dateparser/issues/1045#issuecomment-1129846022 — committed to goodspark/dateparser by goodspark 2 years ago
- python3Packages.dateparser: fix regex compatibility Patch by thmo (THomas Moschny) taken from https://github.com/scrapinghub/dateparser/issues/1045#issuecomment-1129846022. Also relaxes the regex ve... — committed to NixOS/nixpkgs by mweinelt 2 years ago
- python3Packages.dateparser: fix regex compatibility Patch by thmo (THomas Moschny) taken from https://github.com/scrapinghub/dateparser/issues/1045#issuecomment-1129846022. Also relaxes the regex ve... — committed to NixOS/nixpkgs by mweinelt 2 years ago
dependency
regex==2022.3.15made this probably rolling back toregex==2022.1.18may helpupdate: this commit https://github.com/mrabarnett/mrab-regex/commit/138970bafb3d6fbe0987632ee149c04e8b5acf95
I can confirm that deploying regex==2022.1.18 instead (through conda in my case) makes the bug disappear.
Currently the issue is quick-fixed by pinning regex to an older version, which is not applicable in certain environments, e.g., with modules installed via RPMs.
Wouldn’t something like this fix the issue:
Based on this comment. Note that I’m not sure this is correct or complete, but judging on a a run of the testsuite together with regex-2022.3.15, it seems to work (besides some imho unrelated things, which are also broken with regex-2022.3.2).
Many thanks for thorough investigation! For now I’ll make a quick fix by pinning
regexversion, but in the long run we should follow @tducret’s suggestion (https://github.com/scrapinghub/dateparser/issues/1045#issuecomment-1069484011) and reform the regexes.If anyone’s up for a PR with the fix, please go ahead!
Fine. Expecting now a new publish on pypi !
Independently arrived on the same solution as the PR, explanation for the bug here
Thanks for the fix and for writing the library in the first place. This seems to me to be one of the best date parsing libraries, we use it for a lot of data imports. Hoping for a soon pip release as well. Keep up the good work 👍
Hi. I was also faced with the same problem (and thought it was a Mac M1 problem with the
regexlib). It turns out to be related to the drop of Python 3.6 support inregex:More info in mrabarnett/mrab-regex/issues/459
Here is a problematic pattern but there may be more?