wallabag: Wallabag and f43.me extract fails for spectrum.ieee.org
Issue details
https://spectrum.ieee.org/energywise/energy/renewables/egypts-massive-18gw-benban-solar-park-nears-completion My very cursory testing indicates that this issue affects Wallabag fetching from Spectrum IEEE in general.
Environment
- wallabag version (or git revision) that exhibits the issue:
2.3.8 - How did you install wallabag? Via
git cloneor by downloading the package?git clone - Last wallabag version that did not exhibit the issue (if applicable):
--- - php version:
7.3.9 - OS: Debian
9.11 - type of hosting (shared or dedicated): dedicated
- which storage system you choose at install (SQLite, MySQL/MariaDB or PostgreSQL): MariaDB
Steps to reproduce/test case
Add the link above to Wallabag or to https://f43.me/feed/test.
Wallabag says “wallabag can’t retrieve contents for this article”, and f43 indicates that a site config for spectrum.ieee.org exists and is used, but then it fails to detect the article body and eventually fails with “Extract failed”.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (3 by maintainers)
Commits related to this issue
- Update spectrum.ieee.org.txt more fixes needed as mentioned here: https://github.com/wallabag/wallabag/issues/4154#issuecomment-1722552925 — committed to HolgerAusB/ftr-site-config by HolgerAusB 9 months ago
- Update spectrum.ieee.org.txt (#1204) * Update spectrum.ieee.org.txt * Update spectrum.ieee.org.txt more fixes needed as mentioned here: https://github.com/wallabag/wallabag/issues/4154#issuec... — committed to fivefilters/ftr-site-config by HolgerAusB 9 months ago
strip_id_or_class: time-to-readI think they changed things meanwhile. With my new site_config, you can catch the content even without wallabagger.
@fergbrain go to your wallabag installation and change the content of the file
vendor/j0k3r/graby-site-config/spectrum.ieee.org.txtto:then clear cache
sudo -u www-data /path-to-wallabag/bin/console --env=prod cache:clearretry to fetch your article
not after the catch. But we can try to write a site-config, so only the relevant parts will be catched. But as I wrote: when I find the time
Both