lychee: lychee shows network error: forbidden for valid links

Suddenly lychee shows this error for valid links. But these links are valid and also accessible from the browser.

❯ lychee --max-concurrency 1 --no-progress --verbose "work/ok.txt"
✗ [403] https://catboost.ai/ | Network error: Forbidden
✗ [403] https://catboost.ai/en/docs/concepts/python-reference_datasets_msrank | Network error: Forbidden

Issues found in 1 input. Find details below.

[work/ok.txt]:
✗ [403] https://catboost.ai/ | Network error: Forbidden
✗ [403] https://catboost.ai/en/docs/concepts/python-reference_datasets_msrank | Network error: Forbidden

🔍 2 Total ✅ 0 OK 🚫 2 Errors (HTTP:2)

Contents of work/ok.txt

https://catboost.ai/
https://catboost.ai/en/docs/concepts/python-reference_datasets_msrank

Lychee version

❯ lychee --version
lychee 0.10.1

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 41 (27 by maintainers)

Most upvoted comments

Seems like there isn’t much upstream traction, and it’s not something we can fix on our side, so I’m gonna go ahead and close this. If the upstream issue gets fixed, we can reopen and integrate reqwest-impersonate. Apologies if this is not the outcome y’all were hoping for, but I think we need to find another way.

No updates, but if I find the time I will create a pull request to integrate reqwest impersonate as a fallback backend. It will be an optional library feature, but it will be enabled by default in the binary. It’s a great match because I want to refactor the client code anyway soon. Thanks for the reminder.

I won’t be able to test it for a while because I get covid (again).

On your thoughts,

  • Tried with a different user-agent (used the same user-agent my browser has) and the error continues.
  • We don’t have any access to catboost.ai as it’s not ours. So, can’t tell if anything has changed there.
  • I’ve just checked one GitHub Action log of 2 months ago and the error on these two links was present there.

Bad news. I wanted to integrate this, but I don’t think it’s possible right now. reqwest-impersonate patches some dependencies (e.g. hyper) and therefore cannot be published on crates.io. If we integrate it into lychee, that means we couldn’t publish the library on crates.io either even if we put reqwest-impersonate behind a feature flag, which is disabled by default. See https://github.com/rust-lang/cargo/issues/6738. Is there a possibility that I don’t see right now?

@mre

I don’t know the answer for the second question.

For the first one, I suggest to test it on other related issue where browsers are able to open a URL but curl and lychee are not.

  • If those similar issues can be fixed also, then definitely it worth integrating it despite the cost of additional dependency.

Dang. It works. 😞

That means if we integrate that backend into lychee it would solve your issue. Two questions (@lebensterben)

  • do we want to do this?
  • why did it work on my machine before (with the normal backend)?

Okay thanks. The second one should not have failed. It’s an error on my end. However I do expect it to fail just like the first test with request. At least the results have always been consistent between them on my end.

For the last one, which is the most promising one. You need to install boringssl for that first.

I’ve added support for it to getcurl-test and it indeed works:

> cargo run --no-default-features --features curl -- 'https://catboost.ai'
get_url: https://catboost.ai
response: Response { url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("catboost.ai")), port: None, path: "/", query: None, fragment: None }, status: 200, headers: {"content-length": "79366", "content-security-policy": "default-src 'none'; script-src 'unsafe-eval' 'unsafe-inline' 'nonce-06bmt7YDganaJlR0CEne6Q==' mc.yandex.ru social.yandex.ru yastatic.net; style-src 'unsafe-inline' mc.yandex.ru yastatic.net; img-src 'self' data: avatars.yandex.net avatars.mds.yandex.net avatars.mdst.yandex.net mc.yandex.ru ext.captcha.yandex.net yastatic.net; connect-src 'self' mc.yandex.ru; frame-src www.youtube.com video.yandex.ru player.video.yandex.net; media-src ext.captcha.yandex.net; font-src yastatic.net; report-uri https://csp.yandex.net/csp?from=promo-catboost-2017&yandex_login=undefined&yandexuid=undefined;", "content-type": "text/html; charset=utf-8", "date": "Sun, 23 Oct 2022 23:21:38 GMT", "x-content-type-options": "nosniff", "x-frame-options": "DENY", "x-xss-protection": "1; mode=block"} }
status: 200 OK

Tested locally and inside a Github codespace. Can you both test it on your machines as well? Just clone the project and run the command above.

If it works I really don’t know if we should add reqwest-impersonate to the project. Might be a maintenance issue down the road as it could diverge from reqwest and is maintained by a single (yet awesome) person.

@mre

𝛌> echo 'https://catboost.ai' | ./target/debug/lychee --user-agent 'curl/7.79.1' --headers 'Accept=*/*' -
Issues found in 1 input. Find details below.

[stdin]:
✗ [403] https://catboost.ai/ | Failed: Network error: Forbidden
❯ echo 'https://catboost.ai' | lychee --headers 'Accept=*/*' --user-agent 'curl/7.79.1' -
Issues found in 1 input. Find details below.

[stdin]:
✗ [403] https://catboost.ai/ | Network error: Forbidden

🔍 1 Total ✅ 0 OK 🚫 1 Error (HTTP:1)

With my curl version

❯ echo 'https://catboost.ai' | lychee --headers 'Accept=*/*' --user-agent 'curl/7.84.0' - 
Issues found in 1 input. Find details below.

[stdin]:
✗ [403] https://catboost.ai/ | Network error: Forbidden

🔍 1 Total ✅ 0 OK 🚫 1 Error (HTTP:1)
❯ curl --version
curl 7.84.0 (x86_64-pc-linux-gnu) libcurl/7.84.0 OpenSSL/1.1.1q zlib/1.2.12 brotli/1.0.9 zstd/1.5.2 libidn2/2.3.3 libpsl/0.21.1 (+libidn2/2.3.0) libssh2/1.10.0 nghttp2/1.48.0
Release-Date: 2022-06-27
Protocols: dict file ftp ftps gopher gophers http https imap imaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL threadsafe TLS-SRP UnixSockets zstd

I’ve created a new git repository to make a mock test. And the result is negative. Repository: https://github.com/Rizwan-Hasan/test-lychee-links Logs: https://github.com/Rizwan-Hasan/test-lychee-links/runs/7826437363

Hi @mre ,

It originally happened on github action. Same happened after I tried it on my local pc. Ran it several time on github action too and still the same. Here’s the recent github action log https://github.com/Rizwan-Hasan/clearml-docs/runs/7825833483 And this is 3 days earlier https://github.com/Rizwan-Hasan/clearml-docs/runs/7787255012

We need it to work on GitHub Action.

Hum, looks like it’s an issue on your end. 🤔 At least it works over here:

❯❯❯ lychee --max-concurrency 1 --no-progress --verbose "work/ok.txt"
✔ [200] https://catboost.ai/
✔ [200] https://catboost.ai/en/docs/concepts/python-reference_datasets_msrank

🔍 2 Total ✅ 2 OK 🚫 0 Errors

Can you try from a different network? Or maybe reconnect to your wifi? Maybe it also was just a temporary hickup?