trafilatura: List of smaller extraction bugs (text & metadata)

I have mostly tested trafilatura on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn’t work so far.

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in xpaths.py (see BODY_XPATH and COMMENTS_XPATH lists).

Thanks!

About this issue

Original URL
State: open
Created 4 years ago
Comments: 28 (15 by maintainers)

Commits related to this issue

extraction fix: div with only lb (#4) — committed to adbar/trafilatura by adbar 2 years ago

Most upvoted comments

@sepsi77 Please note that brew can now be used to install Trafilatura on MacOS in a seamless way: https://formulae.brew.sh/formula/trafilatura

adbar on Oct 12, 2023

I can’t seem to get it to work. I’m new into this level of tweaking with the system. The installation fails because of missing precompiled Cython files. Trying to run that with the --without-cython flag also doesn’t work.

RuntimeError: ERROR: Trying to build without Cython, but pre-generated 'src/lxml/etree.c' is not available (to ignore this error, pass --without-cython or set environment variable WITHOUT_CYTHON=true).

I think I’ll just move the script into a Docker container and see if that helps.

sepsi77 on Oct 10, 2023

Hi @kinoute, I think it could be because of a tag mismatch (malformed HTML) just before the text segments: <h2 class="indication">به زبانهای دیگر</h3> It implies that all that follows is a title.

Please note that the extraction doesn’t work as well on homepages in general.

adbar on Oct 9, 2023

Hi @karlkovaciny, the cutting-edge version from the repository is slightly better, it outputs the article but still includes garbled javascript. That’s definitely a case to watch for.

EDIT: for the archived version of the page I now get the same problem as you.

adbar on Feb 17, 2022