wallabag: Wrong display in wallabag (bloomberg.com)

⚠️ If your issue is about an error during fetching a link, please read: http://doc.wallabag.org/en/user/errors_during_fetching.html#how-can-i-help-to-fix-that

Issue details

If I try to add to Wallabag an article fom blloberg.com (for example https://www.bloomberg.com/news/features/2018-07-18/japan-s-lonely-death-industry ) the result in wallabeg is an entry with the title “Terms of Service Violation” and as content:

Your usage has been flagged as a violation of our terms of service.

For inquiries related to this message please contact support. For sales inquiries, please visit http://www.bloomberg.com/professional/request-demo

If you believe this to be in error, please confirm below that you are not a robot by clicking “I’m not a robot” below.

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review the Terms of Service and Cookie Policy.

Block reference ID:

Environment

wallabag version (or git revision) that exhibits the issue: 2.3.2
How did you install wallabag? Via git clone or by downloading the package? via cpanel softaculous
Last wallabag version that did not exhibit the issue (if applicable):
php version: 7.1
OS:
type of hosting (shared or dedicated): shared
which storage system you choose at install (SQLite, MySQL/MariaDB or PostgreSQL): 10.0.34-MariaDB-cll-lve

Steps to reproduce/test case

Just open Wallabag, click on the + to add an article and enter https://www.bloomberg.com/news/features/2018-07-18/japan-s-lonely-death-industry

P.S.: http://f43.me/feed/test and http://siteconfig.fivefilters.org/ can not fetch the content of the original article.

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 2
Comments: 38 (19 by maintainers)

Most upvoted comments

I ran into this too, I’m not sure how they detect this but it seems to be something that goes beyond IP address/user-agent. For now, you can normally find the exact same articles on bloombergquint.com, they don’t appear to have that kind of detection.

EDIT: another trick seems to be that when you get that block page, if you can open the page in the browser and you have the same public IP as your server, you can accept the Captcha, and then add this config to /wallabag/vendor/j0k3r/graby-site-config:

http_header(user-agent): Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0

~~I’m guessing any other valid user agent would work too~~. The user agent probably needs to be the same as what you use to accept the Captcha. Not sure of a way to change the config that works for everyone, though.

4oo4 on Jul 29, 2018

I don’t know if this is a universal thing, but Bloomberg Quint now seems to only show the first few sentences of the article and then directs the reader to the main Bloomberg site. Accordingly, the trick of rewriting bloomberg.com to bloombergquint.com no longer works. Are others also finding this to be the case, and if so, is there another workaround?

syclops on Apr 29, 2019

On 2018-11-30 02:08:58, Kevin Decherf wrote:

I think we should give a try to headless browsers to handle dynamic rendering like this.

That.

More broadly, I think wallabag should be more careful about the raw content it fetches when it crawls articles. It should use WARC files as they allow keeping all resources in a single file, including HTTP headers, for a more faithful playback. Using a headless browser would also allow PDF and PNG snapshots to be taken that reflect exactly the page content.

This would make Wallabag double as a webpage archival system, which I would find very, very useful in my workflow. As things stand now, i have 8000 links in Wallabag, but statistically, over 90% of those are dead links now and Wallabag can’t really help in restoring that content, as the saved data is the “filtered” version, which often fails on sites like this… Having the original copy available would allow Wallabag to re-render the site even if the original content is gone, so that we could retroactively fix those issues correctly. 😃

anarcat on Nov 30, 2018

A workaround is available through wallabagger, see https://github.com/wallabag/wallabagger/releases/tag/v1.14.0

I’m closing this issue

Kdecherf on Mar 24, 2022

I’m also getting wallabag can't retrieve contents for this article. Please troubleshoot this issue. on https://www.bloomberg.com/opinion/articles/2021-02-19/gamestop-hearing-featured-no-cats . Article is fully readable in browser, I think.

hrehfeld on Feb 20, 2021

Yeah sorry I closed too fast. The issue isn’t fixed but @techexo & @biva found a way to fix it.

j0k3r on Nov 29, 2018

Oups, I forgot a part of it! Sorry 🙏 @biva, could you close the issue if it’s good for you?

techexo on Nov 28, 2018

@techexo Thank you! Just for the record, the full line to add in /var/www/html/wallabag/vendor/j0k3r/graby/src/Extractor/HttpClient.php is: 'www.bloomberg.com' => ['www.bloomberg.com' => 'www.bloombergquint.com'],

biva on Nov 28, 2018

Thanks for the tip with bloombergquint.com. Would it be possible to use an intelligent http redirect in /var/www/html/wallabag/vendor/j0k3r/graby/src/Extractor/HttpClient.php?

I tried: ‘bloomberg.com/news/articles/2018-10-16’ => [‘bloombergquint.com/business’], but it doesn’t work. And above all, the redirection should work for any date (not only 2018-10-16)

Any idea?

biva on Oct 29, 2018

another instance: https://www.bloomberg.com/news/features/2018-10-04/the-big-hack-how-china-used-a-tiny-chip-to-infiltrate-america-s-top-companies

archive.org and archive.is also both fail to archive the contents.

anarcat on Oct 4, 2018