wallabag: Wrong display in wallabag (bloomberg.com)

⚠️ If your issue is about an error during fetching a link, please read: http://doc.wallabag.org/en/user/errors_during_fetching.html#how-can-i-help-to-fix-that

Issue details

If I try to add to Wallabag an article fom blloberg.com (for example https://www.bloomberg.com/news/features/2018-07-18/japan-s-lonely-death-industry ) the result in wallabeg is an entry with the title “Terms of Service Violation” and as content:

Your usage has been flagged as a violation of our terms of service.

For inquiries related to this message please contact support. For sales inquiries, please visit http://www.bloomberg.com/professional/request-demo

If you believe this to be in error, please confirm below that you are not a robot by clicking “I’m not a robot” below.

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review the Terms of Service and Cookie Policy.

Block reference ID:

Environment

  • wallabag version (or git revision) that exhibits the issue: 2.3.2
  • How did you install wallabag? Via git clone or by downloading the package? via cpanel softaculous
  • Last wallabag version that did not exhibit the issue (if applicable):
  • php version: 7.1
  • OS:
  • type of hosting (shared or dedicated): shared
  • which storage system you choose at install (SQLite, MySQL/MariaDB or PostgreSQL): 10.0.34-MariaDB-cll-lve

Steps to reproduce/test case

Just open Wallabag, click on the + to add an article and enter https://www.bloomberg.com/news/features/2018-07-18/japan-s-lonely-death-industry

P.S.: http://f43.me/feed/test and http://siteconfig.fivefilters.org/ can not fetch the content of the original article.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 38 (19 by maintainers)

Most upvoted comments

I ran into this too, I’m not sure how they detect this but it seems to be something that goes beyond IP address/user-agent. For now, you can normally find the exact same articles on bloombergquint.com, they don’t appear to have that kind of detection.

EDIT: another trick seems to be that when you get that block page, if you can open the page in the browser and you have the same public IP as your server, you can accept the Captcha, and then add this config to /wallabag/vendor/j0k3r/graby-site-config:

http_header(user-agent): Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0

I’m guessing any other valid user agent would work too. The user agent probably needs to be the same as what you use to accept the Captcha. Not sure of a way to change the config that works for everyone, though.

I don’t know if this is a universal thing, but Bloomberg Quint now seems to only show the first few sentences of the article and then directs the reader to the main Bloomberg site. Accordingly, the trick of rewriting bloomberg.com to bloombergquint.com no longer works. Are others also finding this to be the case, and if so, is there another workaround?

On 2018-11-30 02:08:58, Kevin Decherf wrote:

I think we should give a try to headless browsers to handle dynamic rendering like this.

That.

More broadly, I think wallabag should be more careful about the raw content it fetches when it crawls articles. It should use WARC files as they allow keeping all resources in a single file, including HTTP headers, for a more faithful playback. Using a headless browser would also allow PDF and PNG snapshots to be taken that reflect exactly the page content.

This would make Wallabag double as a webpage archival system, which I would find very, very useful in my workflow. As things stand now, i have 8000 links in Wallabag, but statistically, over 90% of those are dead links now and Wallabag can’t really help in restoring that content, as the saved data is the “filtered” version, which often fails on sites like this… Having the original copy available would allow Wallabag to re-render the site even if the original content is gone, so that we could retroactively fix those issues correctly. 😃

A workaround is available through wallabagger, see https://github.com/wallabag/wallabagger/releases/tag/v1.14.0

I’m closing this issue

I’m also getting wallabag can't retrieve contents for this article. Please troubleshoot this issue. on https://www.bloomberg.com/opinion/articles/2021-02-19/gamestop-hearing-featured-no-cats . Article is fully readable in browser, I think.

Yeah sorry I closed too fast. The issue isn’t fixed but @techexo & @biva found a way to fix it.

Oups, I forgot a part of it! Sorry 🙏 @biva, could you close the issue if it’s good for you?

@techexo Thank you! Just for the record, the full line to add in /var/www/html/wallabag/vendor/j0k3r/graby/src/Extractor/HttpClient.php is: 'www.bloomberg.com' => ['www.bloomberg.com' => 'www.bloombergquint.com'],

Thanks for the tip with bloombergquint.com. Would it be possible to use an intelligent http redirect in /var/www/html/wallabag/vendor/j0k3r/graby/src/Extractor/HttpClient.php?

I tried: ‘bloomberg.com/news/articles/2018-10-16’ => [‘bloombergquint.com/business’], but it doesn’t work. And above all, the redirection should work for any date (not only 2018-10-16)

Any idea?