htmldate: Test htmldate on further web pages and report bugs

I have mostly tested htmldate on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn’t work so far.

Please install the dateparser library beforehand as it significantly extends linguistic coverage: pipor pip3 install -U dateparser or pip install -U htmldate[all].

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS and ADDITIONAL_EXPRESSIONS).

Thanks!

About this issue

Original URL
State: open
Created 4 years ago
Comments: 15 (9 by maintainers)

Most upvoted comments

I guess it’s because the download fails, there are websites which restrict access to the download utility, see https://trafilatura.readthedocs.io/en/latest/troubleshooting.html

adbar on Jan 17, 2024

@stevesong It already works:

$ htmldate -u "https://web.archive.org/web/20240111084001/https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity"
2023-02-13

adbar on Jan 16, 2024

@dideler Thanks, the year is detected correctly but not the month which is contained in a <font> tag. I’ll see what I can do.

adbar on Aug 30, 2023

htmldate==1.2.3 used in https://github.com/ofou/graham-essays is incorrectly extracting dates. See output. The essays have MONTH YEAR below the title but that’s not being picked up. Example: http://www.paulgraham.com/greatwork.html

In a fork I tried updating to the latest version and it has the same issue.

dideler on Aug 22, 2023

@kinoute Thanks for pointing that out, it’s a bug.

adbar on Aug 4, 2022

wrong date found: https://web.archive.org/web/20220721013749/https://www.ksal.com/high-wheat-quality-expected-despite-yield-drop/ (it is missing the string “June 15, 2022” in an <h5> → <span> and instead picking “July 20, 2022” from a footer)
Russian language date missed: https://web.archive.org/web/20220721014119/https://www.inopressa.ru/article/09Mar2017/welt/deutschland.html (it is missing the Russian language date “9 марта 2017”)

rahulbot on Jul 21, 2022