wallabag: Wallabag and f43.me extract fails for spectrum.ieee.org

Issue details

https://spectrum.ieee.org/energywise/energy/renewables/egypts-massive-18gw-benban-solar-park-nears-completion My very cursory testing indicates that this issue affects Wallabag fetching from Spectrum IEEE in general.

Environment

  • wallabag version (or git revision) that exhibits the issue: 2.3.8
  • How did you install wallabag? Via git clone or by downloading the package? git clone
  • Last wallabag version that did not exhibit the issue (if applicable): ---
  • php version: 7.3.9
  • OS: Debian 9.11
  • type of hosting (shared or dedicated): dedicated
  • which storage system you choose at install (SQLite, MySQL/MariaDB or PostgreSQL): MariaDB

Steps to reproduce/test case

Add the link above to Wallabag or to https://f43.me/feed/test. Wallabag says “wallabag can’t retrieve contents for this article”, and f43 indicates that a site config for spectrum.ieee.org exists and is used, but then it fails to detect the article body and eventually fails with “Extract failed”.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (3 by maintainers)

Commits related to this issue

Most upvoted comments

  • last link can’t work with my config, because it’s outside ‘spectrum’ subdomain. Are there many articles which are worth to look in that part too?
  • the link with reading time is in a sub-folder /amp/ which seem to have a different format. I need more time to fix that special thing. Meanwhile just strip that ‘/amp/’ from the url
  • if you want to have the reading time, remomove/comment-out the line strip_id_or_class: time-to-read
  • try the new config below to fix the other articles:
body: //div[contains(@class, 'main')]

strip_id_or_class: hide-tablet-and-desktop
strip_id_or_class: hidden
strip_id_or_class: main-menu-wrapper
strip_id_or_class: menu-overlay
strip_id_or_class: topbar
strip_id_or_class: huge-menu
strip_id_or_class: gated-popup
strip_id_or_class: lightbox-popup
strip_id_or_class: widget__shares
strip_id_or_class: all-related-sections
strip_id_or_class: widget__headline
strip_id_or_class: time-to-read
strip_id_or_class: social-author
strip_id_or_class: social-date
strip_id_or_class: article__comments
strip_id_or_class: read_also_posts
strip_id_or_class: tags__item
strip_id_or_class: cta-membership
strip_id_or_class: custom-field-lightbox-img-shortcode-ids

strip: //i
strip: //div[@class='popular_widget']

find_string: data-runner-src="
replace_string: data-src="

# embed YT video
find_string: data-runner-src="https://www.youtube.com/embed/
replace_string: src="https://www.youtube.com/embed/
find_string: data-src="https://www.youtube.com/embed/
replace_string: src="https://www.youtube.com/embed/
replace_string( width="100%">):  

prune: no

test_url: https://spectrum.ieee.org/raspberry-pi-cluster-computer
test_url: https://spectrum.ieee.org/egypts-massive-18gw-benban-solar-park-nears-completion

I think they changed things meanwhile. With my new site_config, you can catch the content even without wallabagger.

@fergbrain go to your wallabag installation and change the content of the file vendor/j0k3r/graby-site-config/spectrum.ieee.org.txt to:

body: //div[contains(@class, 'main')]

strip_id_or_class: hide-tablet-and-desktop
strip_id_or_class: hidden
strip_id_or_class: main-menu-wrapper
strip_id_or_class: menu-overlay
strip_id_or_class: topbar
strip_id_or_class: huge-menu
strip_id_or_class: gated-popup
strip_id_or_class: lightbox-popup
strip_id_or_class: widget__shares
strip_id_or_class: all-related-sections
strip_id_or_class: widget__headline
strip_id_or_class: time-to-read
strip_id_or_class: social-author
strip_id_or_class: social-date
strip_id_or_class: article__comments
strip_id_or_class: read_also_posts
strip_id_or_class: tags__item

strip: //i
strip: //div[@class='popular_widget']
strip: (//div[@class='non_member_follow'])[last()]

find_string: data-runner-src="
replace_string: data-src="

# embed YT video
find_string: data-runner-src="https://www.youtube.com/embed/
replace_string: src="https://www.youtube.com/embed/
find_string: data-src="https://www.youtube.com/embed/
replace_string: src="https://www.youtube.com/embed/
replace_string( width="100%">):  

prune: no

test_url: https://spectrum.ieee.org/raspberry-pi-cluster-computer
test_url: https://spectrum.ieee.org/egypts-massive-18gw-benban-solar-park-nears-completion

then clear cache sudo -u www-data /path-to-wallabag/bin/console --env=prod cache:clear

retry to fetch your article

not after the catch. But we can try to write a site-config, so only the relevant parts will be catched. But as I wrote: when I find the time

Both