article-extractor: Crashes on Pinterest and a lot of other websites
Pages to test on:
Code:
import { extract } from '@extractus/article-extractor'
const input = 'https://www.pinterest.ca/variamsingh87/'
await extract(input)
Error:
TypeError: Cannot read properties of null (reading 'tagName')
at Readability._grabArticle (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:1150:37)
at Readability.parse (/Users/vasyl/code/killme/node_modules/@mozilla/readability/Readability.js:2277:31)
at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/extractWithReadability.js:18:25)
at file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:88:14
at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:38
at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
at file:///Users/vasyl/code/killme/node_modules/bellajs/src/utils/pipe.js:4:40
at default (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/utils/parseFromHtml.js:98:19)
at extract (file:///Users/vasyl/code/killme/node_modules/@extractus/article-extractor/src/main.js:24:10)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
I presume that the bug is somewhere inside the linkedom
package, DOMParser
class.
About this issue
- Original URL
- State: open
- Created 5 months ago
- Comments: 16 (7 by maintainers)
That’s a point I mentioned before. I’m going to do that in new extractus project.
After some tests. Happy DOM doesn’t support many pseudo classes like
:not
,:is
make it isn’t an option for now. I’ll look into the linkedom and readability for this issue