trafilatura: List of smaller extraction bugs (text & metadata)

I have mostly tested trafilatura on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn’t work so far.

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in xpaths.py (see BODY_XPATH and COMMENTS_XPATH lists).

Thanks!

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 28 (15 by maintainers)

Commits related to this issue

Most upvoted comments

@sepsi77 Please note that brew can now be used to install Trafilatura on MacOS in a seamless way: https://formulae.brew.sh/formula/trafilatura

I can’t seem to get it to work. I’m new into this level of tweaking with the system. The installation fails because of missing precompiled Cython files. Trying to run that with the --without-cython flag also doesn’t work.

RuntimeError: ERROR: Trying to build without Cython, but pre-generated 'src/lxml/etree.c' is not available (to ignore this error, pass --without-cython or set environment variable WITHOUT_CYTHON=true).

I think I’ll just move the script into a Docker container and see if that helps.

Hi @kinoute, I think it could be because of a tag mismatch (malformed HTML) just before the text segments: <h2 class="indication">به زبانهای دیگر</h3> It implies that all that follows is a title.

Please note that the extraction doesn’t work as well on homepages in general.

Hi @karlkovaciny, the cutting-edge version from the repository is slightly better, it outputs the article but still includes garbled javascript. That’s definitely a case to watch for.

EDIT: for the archived version of the page I now get the same problem as you.