article-extractor: Crashes on Pinterest and a lot of other websites

Pages to test on:

Code:

import { extract } from '@extractus/article-extractor'
const input = 'https://www.pinterest.ca/variamsingh87/'
await extract(input)

Error:

TypeError: Cannot read properties of null (reading 'tagName')
    at Readability._grabArticle (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:1150:37)
    at Readability.parse (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:2277:31)
    at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/extractWithReadability.js:18:25)
    at file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:88:14
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:38
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
    at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:98:19)
    at extract (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/main.js:24:10)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

I presume that the bug is somewhere inside the linkedom package, DOMParser class.

About this issue

Original URL
State: open
Created 5 months ago
Comments: 16 (7 by maintainers)

Most upvoted comments

Or can I pass my DOM object instead of the HTML as a string?

That’s a point I mentioned before. I’m going to do that in new extractus project.

SettingDust on Jan 29, 2024

After some tests. Happy DOM doesn’t support many pseudo classes like :not, :is make it isn’t an option for now. I’ll look into the linkedom and readability for this issue

SettingDust on Jan 26, 2024