trafilatura: Cannot extract Heading tags
Hi,
It’s weird that Trafilatura sometimes failed to extract simple structured heading tags like h2 or h3.
Example urls like:
https://www.nextpcb.com/blog/bill-of-material
, https://www.protoexpress.com/blog/controlled-impedance-really-matters/
I even tried unwrapping the ID, a, span, b tags using BS4 before sending the source to Trafilatura, but it still fail to extract the h2 or h3s.
The format I am using is: trafilatura.extract(str(soup), output_format='xml', include_comments=False)
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 26 (10 by maintainers)
Commits related to this issue
- fix heading extraction in compare_extraction() for recall (#354) — committed to adbar/trafilatura by adbar a year ago
- minor extraction fixes (#356), tackles #318 & #354 * fix heading extraction in compare_extraction() for recall (#354) * add content XPath and clean code * add template element fix — committed to adbar/trafilatura by adbar a year ago
@elrob @fortyfourforty I just opened a new issue to follow up on this.
@elrob I just tested the USA Today URL you provided and I see the
<h2>
title in the output.It makes much more sense to use the whole HTML, Trafilatura uses a series of markup cues to find and extract the main content and this won’t work well if you strip parts of the document beforehand.
@elrob The heuristics you mentioned may be conflicting, the first lines are meant to save time but maybe there is a better way. The test cases used in the evaluation sometimes include titles, even so websites vary a lot in their layout so there can definitely be cases which are not accounted for.
Do you have usable test cases at hand? Even a minimal HTML document would be helpful. As the processing gets more convoluted it gets more difficult to test for all potential cases.