dateparser: Bad escape characters trigger an exception

Note: As a workaround for this issue, we have pinned regex. Which makes Python 3.11 support either impossible or uncomfortable. The goal now is to remove that version pin on regex without making this issue resurface.

Hello everyone,

Tried parsing under python 3.7.5 and 3.9

dateparser.parse('12/12/12')

It also gives the same output for any “valid” input shown in the doc:

dateparser.parse('Fri, 12 Dec 2014 10:55:50')
dateparser.parse('22 Décembre 2010', date_formats=['%d %B %Y'])
...

Here’s the error:


---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
Input In [46], in <cell line: 1>()
----> 1 dateparser.parse("12/12/12")

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\conf.py:92, in apply_settings.<locals>.wrapper(*args, **kwargs)
     89 if not isinstance(kwargs['settings'], Settings):
     90     raise TypeError("settings can only be either dict or instance of Settings class")
---> 92 return f(*args, **kwargs)

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\__init__.py:61, in parse(date_string, date_formats, languages, locales, region, settings, detect_languages_function)
     57 if languages or locales or region or detect_languages_function or not settings._default:
     58     parser = DateDataParser(languages=languages, locales=locales,
     59                             region=region, settings=settings, detect_languages_function=detect_languages_function)
---> 61 data = parser.get_date_data(date_string, date_formats)
     63 if data:
     64     return data['date_obj']

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:428, in DateDataParser.get_date_data(self, date_string, date_formats)
    425 date_string = sanitize_date(date_string)
    427 for locale in self._get_applicable_locales(date_string):
--> 428     parsed_date = _DateLocaleParser.parse(
    429         locale, date_string, date_formats, settings=self._settings)
    430     if parsed_date:
    431         parsed_date['locale'] = locale.shortname

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:178, in _DateLocaleParser.parse(cls, locale, date_string, date_formats, settings)
    175 @classmethod
    176 def parse(cls, locale, date_string, date_formats=None, settings=None):
    177     instance = cls(locale, date_string, date_formats, settings)
--> 178     return instance._parse()

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:182, in _DateLocaleParser._parse(self)
    180 def _parse(self):
    181     for parser_name in self._settings.PARSERS:
--> 182         date_data = self._parsers[parser_name]()
    183         if self._is_valid_date_data(date_data):
    184             return date_data

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:196, in _DateLocaleParser._try_freshness_parser(self)
    194 def _try_freshness_parser(self):
    195     try:
--> 196         return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
    197     except (OverflowError, ValueError):
    198         return None

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\date.py:234, in _DateLocaleParser._get_translated_date(self)
    232 def _get_translated_date(self):
    233     if self._translated_date is None:
--> 234         self._translated_date = self.locale.translate(
    235             self.date_string, keep_formatting=False, settings=self._settings)
    236     return self._translated_date

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:131, in Locale.translate(self, date_string, keep_formatting, settings)
    128 dictionary = self._get_dictionary(settings)
    129 date_string_tokens = dictionary.split(date_string, keep_formatting)
--> 131 relative_translations = self._get_relative_translations(settings=settings)
    133 for i, word in enumerate(date_string_tokens):
    134     word = word.lower()

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:158, in Locale._get_relative_translations(self, settings)
    155 if settings.NORMALIZE:
    156     if self._normalized_relative_translations is None:
    157         self._normalized_relative_translations = (
--> 158             self._generate_relative_translations(normalize=True))
    159     return self._normalized_relative_translations
    160 else:

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\dateparser\languages\locale.py:172, in Locale._generate_relative_translations(self, normalize)
    170     value = list(map(normalize_unicode, value))
    171 pattern = '|'.join(sorted(value, key=len, reverse=True))
--> 172 pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
    173 pattern = re.compile(r'^(?:{})$'.format(pattern), re.UNICODE | re.IGNORECASE)
    174 relative_dictionary[pattern] = key

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\regex\regex.py:700, in _compile_replacement_helper(pattern, template)
    695     break
    696 if ch == "\\":
    697     # '_compile_replacement' will return either an int group reference
    698     # or a string literal. It returns items (plural) in order to handle
    699     # a 2-character literal (an invalid escape sequence).
--> 700     is_group, items = _compile_replacement(source, pattern, is_unicode)
    701     if is_group:
    702         # It's a group, so first flush the literal.
    703         if literal:

File c:\users\strey\appdata\local\programs\python\python39\lib\site-packages\regex\_regex_core.py:1736, in _compile_replacement(source, pattern, is_unicode)
   1733         if value is not None:
   1734             return False, [value]
-> 1736     raise error("bad escape \\%s" % ch, source.string, source.pos)
   1738 if isinstance(source.sep, bytes):
   1739     octal_mask = 0xFF

error: bad escape \d at position 7

How to reproduce: Env: windows 10

  • Fresh install of python 3.7.5 or 3.9
  • Make a simple python file including these 2 lines:
import dateparser
dateparser.parse("12/12/12")

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 19
  • Comments: 17 (4 by maintainers)

Commits related to this issue

Most upvoted comments

dependency regex==2022.3.15 made this probably rolling back to regex==2022.1.18 may help

update: this commit https://github.com/mrabarnett/mrab-regex/commit/138970bafb3d6fbe0987632ee149c04e8b5acf95

I can confirm that deploying regex==2022.1.18 instead (through conda in my case) makes the bug disappear.

Currently the issue is quick-fixed by pinning regex to an older version, which is not applicable in certain environments, e.g., with modules installed via RPMs.

Wouldn’t something like this fix the issue:

--- a/dateparser/languages/locale.py
+++ b/dateparser/languages/locale.py
@@ -169,7 +169,7 @@ class Locale:
             if normalize:
                 value = list(map(normalize_unicode, value))
             pattern = '|'.join(sorted(value, key=len, reverse=True))
-            pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\d+', pattern)
+            pattern = pattern.replace(r'\d+', r'?P<n>\d+')
             pattern = re.compile(r'^(?:{})$'.format(pattern), re.UNICODE | re.IGNORECASE)
             relative_dictionary[pattern] = key
         return relative_dictionary

Based on this comment. Note that I’m not sure this is correct or complete, but judging on a a run of the testsuite together with regex-2022.3.15, it seems to work (besides some imho unrelated things, which are also broken with regex-2022.3.2).

Many thanks for thorough investigation! For now I’ll make a quick fix by pinning regex version, but in the long run we should follow @tducret’s suggestion (https://github.com/scrapinghub/dateparser/issues/1045#issuecomment-1069484011) and reform the regexes.

If anyone’s up for a PR with the fix, please go ahead!

Fine. Expecting now a new publish on pypi !

Independently arrived on the same solution as the PR, explanation for the bug here

Thanks for the fix and for writing the library in the first place. This seems to me to be one of the best date parsing libraries, we use it for a lot of data imports. Hoping for a soon pip release as well. Keep up the good work 👍

Hi. I was also faced with the same problem (and thought it was a Mac M1 problem with the regex lib). It turns out to be related to the drop of Python 3.6 support in regex :

Since Python 3.6, the re module has been rejecting unknown escape sequences such as \q in patterns and escape sequences including \d in replacement templates.

As the regex module no longer supports versions of Python ❤️.6, I’ve brought the regex module into line with re.

You code should now read:

pattern = DIGIT_GROUP_PATTERN.sub(r'?P<n>\\d+', pattern)

More info in mrabarnett/mrab-regex/issues/459

Here is a problematic pattern but there may be more?