lychee: Underscores in URLs sometimes break the (markdown?) link parser

The following URLs in markdown files in my repo are being parsed incorrectly by v0.7.0:

My URL	Is getting parsed as
`https://upload.wikimedia.org/wikipedia/commons/6/6d/Arahats_-_Flickr_-_odako1.jpg`	`https://upload.wikimedia.org/wikipedia/commons/6/6d/Arahats`
`https://upload.wikimedia.org/wikipedia/commons/e/ec/%28Above%29_sBed-byed_%28Gopaka%29%2C_holding_a_book%3B_Wellcome_V0018272_%28cropped%29.jpg`	`https://upload.wikimedia.org/wikipedia/commons/e/ec/%28Above%29_sBed-byed`
`https://upload.wikimedia.org/wikipedia/commons/5/56/P5290155_%2814971680537%29.jpg`	`https://upload.wikimedia.org/wikipedia/commons/5/56/P5290155`
`https://upload.wikimedia.org/wikipedia/commons/5/54/Monks_in_Wat_Phra_Singh_-_Chiang_Mai.jpg`	`https://upload.wikimedia.org/wikipedia/commons/5/54/Monks_in_Wat_Phra_Singh`
`http://www.ahandfulofleaves.org/documents/The%20Buddhist%20Religion%20A%20Historical%20Introduction_%20Robinson.pdf`	`http://www.ahandfulofleaves.org/documents/The%20Buddhist%20Religion%20A%20Historical%20Introduction`
`http://www.karunabv.org/uploads/1/2/6/3/12630738/__applying_compassion_bbodhi_110312.mp3`	`http://www.karunabv.org/uploads/1/2/6/3/12630738/`
`http://www.karunabv.org/uploads/1/2/6/3/12630738/chanting-kbv__2_.pdf`	`http://www.karunabv.org/uploads/1/2/6/3/12630738/chanting-kbv__2`
`http://home.ubalt.edu/ntygfit/ai_01_pursuing_fame/ai_01_tell/jd_morality_.htm`	`http://home.ubalt.edu/ntygfit/ai_01_pursuing_fame/ai_01_tell/jd_morality`

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 18 (10 by maintainers)

Most upvoted comments

2. Use txt file extension: underscore_urls.txt will use our plaintext parser and work as expected

Just to follow up, I did this hack and indeed it worked as expected. So, for any future people having this issue, I added this step to my workflow:

- name: Rename .md files to .txt
  run: |
   cd $GITHUB_WORKSPACE
   sudo apt-get install -y rename
   find . -name "*.md" -exec rename 's/\.md$/.txt/' {} +

then ran lychee on **/*.txt instead of .md

khemarato on May 27, 2021

Mixed feelings.

This would surely allow fine-grained control over lychee’s behavior. On the other hand, users have to be aware of the underlying issue first and then know about the parameter, which can resolve it.

The entire parsing topic is a rabbit-hole. 🐰

Supporting different (Markdown) parsers sounds like a good alternative to me. We should at consider one that supports Github Flavored Markdown as that is quite a popular variant.

mre on May 26, 2021

Why even bother trying to parse a <code> block?

Because nobody ever encountered that case and so there is no rule to exclude <code> blocks yet.

We could check the element type here and exclude codeblocks: https://github.com/lycheeverse/lychee/blob/1865f7a309cba21af1c2d7e98e2ccaad0810c120/lychee-lib/src/extract.rs#L104-L108

mre on May 26, 2021

you can always generate HTML first and then check the links.

This is easier for both of us.

lebensterben on May 25, 2021

we’d provide an optional feature to support gfm via https://lib.rs/crates/comrak

this is not always desirable, so it should be optional.

lebensterben on May 25, 2021

Thanks for reporting @buddhist-uni.

The underscores (_) have a special meaning in Markdown and emphasize the enclosed text, so _hello_ becomes hello. As a result, our Markdown parser emits separate events for the different sections of the URL:

example.org/_hello_ would emit MDEvent::Text("example.org") and MDEvent::Text("hello") as individual events; so by the time the text reaches the link parser, it is already split up.

There are a few ways to solve the issue:

Use proper Markdown links. This should work: [example](example.org/_hello_).
Use txt file extension: underscore_urls.txt will use our plaintext parser and work as expected.

We could also add a --fuzzy flag to lychee, which would always use the plaintext parser (no matter the file extension), but it sounds like a hack.

mre on May 24, 2021