lychee: Lychee don't panic, invalid URL's are fine.

From the pleasant discussion with @mre here https://github.com/lycheeverse/lychee-action/issues/82, I intend to work on trying to parse the input URL before passing it to reqwest in order to prevent lychee from panicking, skip the invalid URL and move on to the next one.

Bonus:

If we can track these invalidated URL’s and show them to users at the end so that they are able to fix or ignore them for future runs

Looking for feedback and @vipulgupta2048 will work on this soon (The email will remind me to work on this)

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 25 (17 by maintainers)

Commits related to this issue

Gracefully handle invalid URIs With the upgrade to `reqwest` 0.12, we can finally handle a long-standing issue, when Urls could not be parsed to Uris. Previously, we would panic, but we can now handl... — committed to lycheeverse/lychee by mre 2 months ago
Gracefully handle invalid URIs (#1414) With the upgrade to `reqwest` 0.12, we can finally handle a long-standing issue, when Urls could not be parsed to Uris. Previously, we would panic, but we can... — committed to lycheeverse/lychee by mre 2 months ago

Most upvoted comments

@vipulgupta2048, no worries and thanks for looking into it.

Originally I thought this was possible after the discussion with @lebensterben above:

 let result = panic::catch_unwind(|| self.reqwest_client.execute(request));

let fut = match result {
    Ok(s) => s,
    Err(_) => return Status::Error(ErrorKind::Panic),
};

match fut.await {
    Ok(ref response) => Status::new(response, self.accepted.clone()),
    Err(e) => e.into(),
}

but I’m getting an error:

   --> lychee-lib/src/client.rs:482:22
    |
482 |         let result = panic::catch_unwind(|| self.reqwest_client.execute(request));
    |                      ^^^^^^^^^^^^^^^^^^^ `UnsafeCell<tokio::util::linked_list::Pointers<tokio::time::driver::entry::TimerShared>>` may contain interior mutability and a reference may not be safely transferrable across a catch_unwind boundary
    |

So I searched around a bit and this was recommended:

match self.reqwest_client.execute(request).catch_unwind().await? {
    Ok(ref response) => Status::new(response, self.accepted.clone()),
    Err(e) => e.into(),
}

It polls the future in the background. But I’m getting another error:

error[E0277]: the type `UnsafeCell<tokio::util::linked_list::Pointers<tokio::time::driver::entry::TimerShared>>` may contain interior mutability and a reference may not be safely transferrable across a catch_unwind boundary
   --> lychee-lib/src/client.rs:482:22
    |
482 |         let result = panic::catch_unwind(|| self.reqwest_client.execute(request));
    |                      ^^^^^^^^^^^^^^^^^^^ `UnsafeCell<tokio::util::linked_list::Pointers<tokio::time::driver::entry::TimerShared>>` may contain interior mutability and a reference may not be safely transferrable across a catch_unwind boundary

There is comment in the original thread:

This works when may_panic, as shown in the example is a future created in this module. However it does not work if the code in may_panic waits on another future which was returned by another method where futures:FutureExt was not brought into scope. In that case, catch_unwind complains that the future is not transferrable across a unwind boundary because the future returned by the async function itself waits on that other future which is not transferrable. Any ideas what can be done to resolve this?

(emphasis mine)

No clue how to resolve this. 😕

mre on Mar 18, 2022

@mre Exactly. We should stop ignoring any missing panic docs or error docs in lychee-lib before it gets to 1.0.

lebensterben on Mar 22, 2022

Well, the other alternative is to parse the URIs like reqwest would do and return an error before passing it to reqwest. That would be a waste of resources and be useless for most URLs, but at least it wouldn’t panic. That was our plan in the beginning.

mre on Mar 18, 2022

At this moment we need to wait for the upstream to merge the fix or use a patched dependency.

lebensterben on Mar 18, 2022

I will give it a shot soon, haven’t checked it out yet.

vipulgupta2048 on Mar 7, 2022

Closing. Feel free to reopen if the problem still exists.

mre on Nov 5, 2022

Here’s an example of how this is panicking with an Archive.org URL today:

thread 'tokio-runtime-worker' panicked at 'a parsed Url should always be a valid Uri: InvalidUri(InvalidUriChar)', /cargo/registry/src/github.com-1ecc6299db9ec823/reqwest-0.11.8/src/into_url.rs:70:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'Cannot send request: SendError(Ok(Request { uri: Uri { url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("web.archive.org")), port: None, path: "/web/20070312014327/https://developer.yahoo.com/weather/", query: None, fragment: None } }, source: FsPath("./dennisdel/content/blog/using-the-google-weather-api-pros-and-cons-so-far.md"), element: Some("a"), attribute: Some("href") }))', lychee-bin/src/commands/check.rs:85:14
+ exit_code=101

dend on Mar 22, 2022

Yeah they look fine. Maybe the issue is with a different link that was sent before. It panics and then Tokio closes the channel and after that it tries to send the legit links above but the channel is already closed.

mre on Mar 19, 2022

ACK - thank you so much, @mre and @lebensterben. There are hundreds of URLs that panic for me (mostly ones from Archive.org that are of the form https://web.archive.org/some_stub/https://my_archived_url), so I’ll wait for the upstream change to be merged for now. Thank you for all your hard work and help!

dend on Mar 18, 2022

Is there a temporary workaround one can apply?

@dend, the only workaround we have at the moment is to exclude the panicking URLs. 😞

mre on Mar 18, 2022