trafilatura: Cannot extract Heading tags

Hi,

It’s weird that Trafilatura sometimes failed to extract simple structured heading tags like h2 or h3.

Example urls like: https://www.nextpcb.com/blog/bill-of-material , https://www.protoexpress.com/blog/controlled-impedance-really-matters/

I even tried unwrapping the ID, a, span, b tags using BS4 before sending the source to Trafilatura, but it still fail to extract the h2 or h3s.

The format I am using is: trafilatura.extract(str(soup), output_format='xml', include_comments=False)

About this issue

Original URL
State: closed
Created a year ago
Comments: 26 (10 by maintainers)

Commits related to this issue

fix heading extraction in compare_extraction() for recall (#354) — committed to adbar/trafilatura by adbar a year ago
minor extraction fixes (#356), tackles #318 & #354 * fix heading extraction in compare_extraction() for recall (#354) * add content XPath and clean code * add template element fix — committed to adbar/trafilatura by adbar a year ago

Most upvoted comments

@elrob @fortyfourforty I just opened a new issue to follow up on this.

adbar on Jun 14, 2023

@elrob I just tested the USA Today URL you provided and I see the <h2> title in the output.

It makes much more sense to use the whole HTML, Trafilatura uses a series of markup cues to find and extract the main content and this won’t work well if you strip parts of the document beforehand.

adbar on Jun 13, 2023

@elrob The heuristics you mentioned may be conflicting, the first lines are meant to save time but maybe there is a better way. The test cases used in the evaluation sometimes include titles, even so websites vary a lot in their layout so there can definitely be cases which are not accounted for.

Do you have usable test cases at hand? Even a minimal HTML document would be helpful. As the processing gets more convoluted it gets more difficult to test for all potential cases.

adbar on Jun 13, 2023