trafilatura: Cannot extract Heading tags

Hi,

It’s weird that Trafilatura sometimes failed to extract simple structured heading tags like h2 or h3.

Example urls like: https://www.nextpcb.com/blog/bill-of-material , https://www.protoexpress.com/blog/controlled-impedance-really-matters/

I even tried unwrapping the ID, a, span, b tags using BS4 before sending the source to Trafilatura, but it still fail to extract the h2 or h3s.

The format I am using is: trafilatura.extract(str(soup), output_format='xml', include_comments=False)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 26 (10 by maintainers)

Commits related to this issue

Most upvoted comments

@elrob @fortyfourforty I just opened a new issue to follow up on this.

@elrob I just tested the USA Today URL you provided and I see the <h2> title in the output.

It makes much more sense to use the whole HTML, Trafilatura uses a series of markup cues to find and extract the main content and this won’t work well if you strip parts of the document beforehand.

@elrob The heuristics you mentioned may be conflicting, the first lines are meant to save time but maybe there is a better way. The test cases used in the evaluation sometimes include titles, even so websites vary a lot in their layout so there can definitely be cases which are not accounted for.

Do you have usable test cases at hand? Even a minimal HTML document would be helpful. As the processing gets more convoluted it gets more difficult to test for all potential cases.