wallabag: Wallabag and f43.me extract fails for spectrum.ieee.org

Issue details

https://spectrum.ieee.org/energywise/energy/renewables/egypts-massive-18gw-benban-solar-park-nears-completion My very cursory testing indicates that this issue affects Wallabag fetching from Spectrum IEEE in general.

Environment

wallabag version (or git revision) that exhibits the issue: 2.3.8
How did you install wallabag? Via git clone or by downloading the package? git clone
Last wallabag version that did not exhibit the issue (if applicable): ---
php version: 7.3.9
OS: Debian 9.11
type of hosting (shared or dedicated): dedicated
which storage system you choose at install (SQLite, MySQL/MariaDB or PostgreSQL): MariaDB

Steps to reproduce/test case

Add the link above to Wallabag or to https://f43.me/feed/test. Wallabag says “wallabag can’t retrieve contents for this article”, and f43 indicates that a site config for spectrum.ieee.org exists and is used, but then it fails to detect the article body and eventually fails with “Extract failed”.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 15 (3 by maintainers)

Commits related to this issue

Update spectrum.ieee.org.txt more fixes needed as mentioned here: https://github.com/wallabag/wallabag/issues/4154#issuecomment-1722552925 — committed to HolgerAusB/ftr-site-config by HolgerAusB 9 months ago
Update spectrum.ieee.org.txt (#1204) * Update spectrum.ieee.org.txt * Update spectrum.ieee.org.txt more fixes needed as mentioned here: https://github.com/wallabag/wallabag/issues/4154#issuec... — committed to fivefilters/ftr-site-config by HolgerAusB 9 months ago

Most upvoted comments

last link can’t work with my config, because it’s outside ‘spectrum’ subdomain. Are there many articles which are worth to look in that part too?
the link with reading time is in a sub-folder /amp/ which seem to have a different format. I need more time to fix that special thing. Meanwhile just strip that ‘/amp/’ from the url
if you want to have the reading time, remomove/comment-out the line strip_id_or_class: time-to-read
try the new config below to fix the other articles:

body: //div[contains(@class, 'main')]

strip_id_or_class: hide-tablet-and-desktop
strip_id_or_class: hidden
strip_id_or_class: main-menu-wrapper
strip_id_or_class: menu-overlay
strip_id_or_class: topbar
strip_id_or_class: huge-menu
strip_id_or_class: gated-popup
strip_id_or_class: lightbox-popup
strip_id_or_class: widget__shares
strip_id_or_class: all-related-sections
strip_id_or_class: widget__headline
strip_id_or_class: time-to-read
strip_id_or_class: social-author
strip_id_or_class: social-date
strip_id_or_class: article__comments
strip_id_or_class: read_also_posts
strip_id_or_class: tags__item
strip_id_or_class: cta-membership
strip_id_or_class: custom-field-lightbox-img-shortcode-ids

strip: //i
strip: //div[@class='popular_widget']

find_string: data-runner-src="
replace_string: data-src="

# embed YT video
find_string: data-runner-src="https://www.youtube.com/embed/
replace_string: src="https://www.youtube.com/embed/
find_string: data-src="https://www.youtube.com/embed/
replace_string: src="https://www.youtube.com/embed/
replace_string( width="100%">):  

prune: no

test_url: https://spectrum.ieee.org/raspberry-pi-cluster-computer
test_url: https://spectrum.ieee.org/egypts-massive-18gw-benban-solar-park-nears-completion

HolgerAusB on Sep 18, 2023

I think they changed things meanwhile. With my new site_config, you can catch the content even without wallabagger.

@fergbrain go to your wallabag installation and change the content of the file vendor/j0k3r/graby-site-config/spectrum.ieee.org.txt to:

body: //div[contains(@class, 'main')]

strip_id_or_class: hide-tablet-and-desktop
strip_id_or_class: hidden
strip_id_or_class: main-menu-wrapper
strip_id_or_class: menu-overlay
strip_id_or_class: topbar
strip_id_or_class: huge-menu
strip_id_or_class: gated-popup
strip_id_or_class: lightbox-popup
strip_id_or_class: widget__shares
strip_id_or_class: all-related-sections
strip_id_or_class: widget__headline
strip_id_or_class: time-to-read
strip_id_or_class: social-author
strip_id_or_class: social-date
strip_id_or_class: article__comments
strip_id_or_class: read_also_posts
strip_id_or_class: tags__item

strip: //i
strip: //div[@class='popular_widget']
strip: (//div[@class='non_member_follow'])[last()]

find_string: data-runner-src="
replace_string: data-src="

# embed YT video
find_string: data-runner-src="https://www.youtube.com/embed/
replace_string: src="https://www.youtube.com/embed/
find_string: data-src="https://www.youtube.com/embed/
replace_string: src="https://www.youtube.com/embed/
replace_string( width="100%">):  

prune: no

test_url: https://spectrum.ieee.org/raspberry-pi-cluster-computer
test_url: https://spectrum.ieee.org/egypts-massive-18gw-benban-solar-park-nears-completion

then clear cache sudo -u www-data /path-to-wallabag/bin/console --env=prod cache:clear

retry to fetch your article

HolgerAusB on Sep 17, 2023

not after the catch. But we can try to write a site-config, so only the relevant parts will be catched. But as I wrote: when I find the time

HolgerAusB on Sep 17, 2023

Both

j0k3r on Jul 15, 2023