htmldate: Test htmldate on further web pages and report bugs
I have mostly tested htmldate
on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn’t work so far.
Please install the dateparser
library beforehand as it significantly extends linguistic coverage: pip
or pip3 install -U dateparser
or pip install -U htmldate[all]
.
Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS
and ADDITIONAL_EXPRESSIONS
).
Thanks!
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 15 (9 by maintainers)
I guess it’s because the download fails, there are websites which restrict access to the download utility, see https://trafilatura.readthedocs.io/en/latest/troubleshooting.html
@stevesong It already works:
@dideler Thanks, the year is detected correctly but not the month which is contained in a
<font>
tag. I’ll see what I can do.htmldate==1.2.3 used in https://github.com/ofou/graham-essays is incorrectly extracting dates. See output. The essays have MONTH YEAR below the title but that’s not being picked up. Example: http://www.paulgraham.com/greatwork.html
In a fork I tried updating to the latest version and it has the same issue.
@kinoute Thanks for pointing that out, it’s a bug.
<h5> → <span>
and instead picking “July 20, 2022” from a footer)