article-extractor: Crashes on Pinterest and a lot of other websites
Pages to test on:
Code:
import { extract } from '@extractus/article-extractor'
const input = 'https://www.pinterest.ca/variamsingh87/'
await extract(input)
Error:
TypeError: Cannot read properties of null (reading 'tagName')
    at Readability._grabArticle (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:1150:37)
    at Readability.parse (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:2277:31)
    at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/extractWithReadability.js:18:25)
    at file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:88:14
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:38
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
    at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
    at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:98:19)
    at extract (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/main.js:24:10)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
I presume that the bug is somewhere inside the linkedom package, DOMParser class.
About this issue
- Original URL
 - State: open
 - Created 5 months ago
 - Comments: 16 (7 by maintainers)
 
That’s a point I mentioned before. I’m going to do that in new extractus project.
After some tests. Happy DOM doesn’t support many pseudo classes like
:not,:ismake it isn’t an option for now. I’ll look into the linkedom and readability for this issue