htmldate: Test htmldate on further web pages and report bugs

I have mostly tested htmldate on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn’t work so far.

Please install the dateparser library beforehand as it significantly extends linguistic coverage: pipor pip3 install -U dateparser or pip install -U htmldate[all].

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS and ADDITIONAL_EXPRESSIONS).

Thanks!

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments

I guess it’s because the download fails, there are websites which restrict access to the download utility, see https://trafilatura.readthedocs.io/en/latest/troubleshooting.html

@stevesong It already works:

$ htmldate -u "https://web.archive.org/web/20240111084001/https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity"
2023-02-13

@dideler Thanks, the year is detected correctly but not the month which is contained in a <font> tag. I’ll see what I can do.

htmldate==1.2.3 used in https://github.com/ofou/graham-essays is incorrectly extracting dates. See output. The essays have MONTH YEAR below the title but that’s not being picked up. Example: http://www.paulgraham.com/greatwork.html

In a fork I tried updating to the latest version and it has the same issue.

@kinoute Thanks for pointing that out, it’s a bug.