lychee: Underscores in URLs sometimes break the (markdown?) link parser
The following URLs in markdown files in my repo are being parsed incorrectly by v0.7.0:
| My URL | Is getting parsed as |
|---|---|
https://upload.wikimedia.org/wikipedia/commons/6/6d/Arahats_-_Flickr_-_odako1.jpg |
https://upload.wikimedia.org/wikipedia/commons/6/6d/Arahats |
https://upload.wikimedia.org/wikipedia/commons/e/ec/%28Above%29_sBed-byed_%28Gopaka%29%2C_holding_a_book%3B_Wellcome_V0018272_%28cropped%29.jpg |
https://upload.wikimedia.org/wikipedia/commons/e/ec/%28Above%29_sBed-byed |
https://upload.wikimedia.org/wikipedia/commons/5/56/P5290155_%2814971680537%29.jpg |
https://upload.wikimedia.org/wikipedia/commons/5/56/P5290155 |
https://upload.wikimedia.org/wikipedia/commons/5/54/Monks_in_Wat_Phra_Singh_-_Chiang_Mai.jpg |
https://upload.wikimedia.org/wikipedia/commons/5/54/Monks_in_Wat_Phra_Singh |
http://www.ahandfulofleaves.org/documents/The%20Buddhist%20Religion%20A%20Historical%20Introduction_%20Robinson.pdf |
http://www.ahandfulofleaves.org/documents/The%20Buddhist%20Religion%20A%20Historical%20Introduction |
http://www.karunabv.org/uploads/1/2/6/3/12630738/__applying_compassion_bbodhi_110312.mp3 |
http://www.karunabv.org/uploads/1/2/6/3/12630738/ |
http://www.karunabv.org/uploads/1/2/6/3/12630738/chanting-kbv__2_.pdf |
http://www.karunabv.org/uploads/1/2/6/3/12630738/chanting-kbv__2 |
http://home.ubalt.edu/ntygfit/ai_01_pursuing_fame/ai_01_tell/jd_morality_.htm |
http://home.ubalt.edu/ntygfit/ai_01_pursuing_fame/ai_01_tell/jd_morality |
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (10 by maintainers)
Just to follow up, I did this hack and indeed it worked as expected. So, for any future people having this issue, I added this step to my workflow:
then ran lychee on
**/*.txtinstead of.mdMixed feelings.
This would surely allow fine-grained control over lychee’s behavior. On the other hand, users have to be aware of the underlying issue first and then know about the parameter, which can resolve it.
The entire parsing topic is a rabbit-hole. 🐰
Supporting different (Markdown) parsers sounds like a good alternative to me. We should at consider one that supports Github Flavored Markdown as that is quite a popular variant.
Because nobody ever encountered that case and so there is no rule to exclude
<code>blocks yet.We could check the element type here and exclude codeblocks: https://github.com/lycheeverse/lychee/blob/1865f7a309cba21af1c2d7e98e2ccaad0810c120/lychee-lib/src/extract.rs#L104-L108
you can always generate HTML first and then check the links.
This is easier for both of us.
we’d provide an optional feature to support gfm via https://lib.rs/crates/comrak
this is not always desirable, so it should be optional.
Thanks for reporting @buddhist-uni.
The underscores (
_) have a special meaning in Markdown and emphasize the enclosed text, so_hello_becomes hello. As a result, our Markdown parser emits separate events for the different sections of the URL:example.org/_hello_would emitMDEvent::Text("example.org")andMDEvent::Text("hello")as individual events; so by the time the text reaches the link parser, it is already split up.There are a few ways to solve the issue:
[example](example.org/_hello_).txtfile extension:underscore_urls.txtwill use our plaintext parser and work as expected.We could also add a
--fuzzyflag to lychee, which would always use the plaintext parser (no matter the file extension), but it sounds like a hack.