lychee: Underscores in URLs sometimes break the (markdown?) link parser

The following URLs in markdown files in my repo are being parsed incorrectly by v0.7.0:

My URL Is getting parsed as
https://upload.wikimedia.org/wikipedia/commons/6/6d/Arahats_-_Flickr_-_odako1.jpg https://upload.wikimedia.org/wikipedia/commons/6/6d/Arahats
https://upload.wikimedia.org/wikipedia/commons/e/ec/%28Above%29_sBed-byed_%28Gopaka%29%2C_holding_a_book%3B_Wellcome_V0018272_%28cropped%29.jpg https://upload.wikimedia.org/wikipedia/commons/e/ec/%28Above%29_sBed-byed
https://upload.wikimedia.org/wikipedia/commons/5/56/P5290155_%2814971680537%29.jpg https://upload.wikimedia.org/wikipedia/commons/5/56/P5290155
https://upload.wikimedia.org/wikipedia/commons/5/54/Monks_in_Wat_Phra_Singh_-_Chiang_Mai.jpg https://upload.wikimedia.org/wikipedia/commons/5/54/Monks_in_Wat_Phra_Singh
http://www.ahandfulofleaves.org/documents/The%20Buddhist%20Religion%20A%20Historical%20Introduction_%20Robinson.pdf http://www.ahandfulofleaves.org/documents/The%20Buddhist%20Religion%20A%20Historical%20Introduction
http://www.karunabv.org/uploads/1/2/6/3/12630738/__applying_compassion_bbodhi_110312.mp3 http://www.karunabv.org/uploads/1/2/6/3/12630738/
http://www.karunabv.org/uploads/1/2/6/3/12630738/chanting-kbv__2_.pdf http://www.karunabv.org/uploads/1/2/6/3/12630738/chanting-kbv__2
http://home.ubalt.edu/ntygfit/ai_01_pursuing_fame/ai_01_tell/jd_morality_.htm http://home.ubalt.edu/ntygfit/ai_01_pursuing_fame/ai_01_tell/jd_morality

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

2. Use txt file extension: underscore_urls.txt will use our plaintext parser and work as expected

Just to follow up, I did this hack and indeed it worked as expected. So, for any future people having this issue, I added this step to my workflow:

- name: Rename .md files to .txt
  run: |
   cd $GITHUB_WORKSPACE
   sudo apt-get install -y rename
   find . -name "*.md" -exec rename 's/\.md$/.txt/' {} +

then ran lychee on **/*.txt instead of .md

Mixed feelings.

This would surely allow fine-grained control over lychee’s behavior. On the other hand, users have to be aware of the underlying issue first and then know about the parameter, which can resolve it.

The entire parsing topic is a rabbit-hole. 🐰

Supporting different (Markdown) parsers sounds like a good alternative to me. We should at consider one that supports Github Flavored Markdown as that is quite a popular variant.

Why even bother trying to parse a <code> block?

Because nobody ever encountered that case and so there is no rule to exclude <code> blocks yet.

We could check the element type here and exclude codeblocks: https://github.com/lycheeverse/lychee/blob/1865f7a309cba21af1c2d7e98e2ccaad0810c120/lychee-lib/src/extract.rs#L104-L108

you can always generate HTML first and then check the links.

This is easier for both of us.

we’d provide an optional feature to support gfm via https://lib.rs/crates/comrak

this is not always desirable, so it should be optional.

Thanks for reporting @buddhist-uni.

The underscores (_) have a special meaning in Markdown and emphasize the enclosed text, so _hello_ becomes hello. As a result, our Markdown parser emits separate events for the different sections of the URL:

example.org/_hello_ would emit MDEvent::Text("example.org") and MDEvent::Text("hello") as individual events; so by the time the text reaches the link parser, it is already split up.

There are a few ways to solve the issue:

  1. Use proper Markdown links. This should work: [example](example.org/_hello_).
  2. Use txt file extension: underscore_urls.txt will use our plaintext parser and work as expected.

We could also add a --fuzzy flag to lychee, which would always use the plaintext parser (no matter the file extension), but it sounds like a hack.