rust-url: URL crate is failing to parse these existing URLs
Hi there, I have these five URLs that exist but can’t be parsed by url crate due to invalid domain name. This issue is similar to #483 with subdomains ending on - character, but these ones use punycode.
use url::Url;
fn main() {
let urls = vec![
"http://mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/indexx.php",
"http://shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/sitemap.html",
"http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/bvv",
"http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/index.php",
"http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/index.php"
];
for url in urls {
match url.parse::<Url>() {
Ok(_) => println!("Parsing successful: {}", url),
Err(e) => println!("Parsing failed: {}, {}", e, url),
}
}
}
Parsing failed: invalid international domain name, http://mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/indexx.php
Parsing failed: invalid international domain name, http://shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/sitemap.html
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/bvv
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/index.php
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/index.php
All of them ping and respond.
http get http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/index.php
HTTP/1.1 200 OK
cache-control: max-age=0, private, must-revalidate
connection: close
content-length: 384
content-type: text/html; charset=utf-8
date: Fri, 01 Mar 2019 15:12:05 GMT
server: nginx
set-cookie: sid=603521ac-3c34-11e9-b489-941a3615a412; path=/; domain=xn----9mcjf9b4dbm09f.com; HttpOnly
<html><head><title>Loading...</title></head><body><script type='text/javascript'>window.location.replace('http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/index.php?js=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqcyI6MX0.fADWc9hUOlh58R9UzufQBROmie3I7c7vE835oE6YmU4&uuid=603521ac-3c34-11e9-b489-941a3615a412');</script></body></html>
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 15 (5 by maintainers)
These URLs are reported as invalid because they violate rules for “Right-to-Left Scripts for Internationalized Domain Names” that are specified in RFC-5893 https://tools.ietf.org/html/rfc5893#section-2 In particular they violate rule #1 in section 2 that says that label in bidi domain name should not start with an ascii digit (Bidi property EN)
So, in the URL "“http://mail.163.com.روغن-کنجد.com/iloystgnjfrgthteawvo/indexx.php” domain name is the Bidi domain and it does contain label “163” that violate Rule #1, so the URL is reported as invalid.
But actually, I think that there is some misinterpretation of standard. Yes, label “163” violate Bidi rules, but the following paragraph of sections 2 of RFC-5893 says following:
So, in order to be a valid domain name Bidi domain all labels in the domain name should satisfy bidi rules with a only exception that LDH label (label that contains only an ascii letters, digits and hyphens) can start with a European digit if it comes before any RTL label. The following parts of RFC-5893 make that even more clear and provide motivation for such decision. Section 5. “Troublesome Situations and Guidelines”
Section 7. “Compatibility Considerations” is most straightforward
So, in general it is not recommended (while not strictly forbidden) to use RTL labels as a subdomain of a digit-leading label, but not vice-versa.
But another issue it that while RFC-5893 does allow labels starting with a digit, the “UNICODE IDNA COMPATIBILITY PROCESSING” standard, which is used as a base for that crate implementation provides a publicly available test files that contains some domain names that are marked as violating Bidi rule #1, while they should be considered as valid according to RFC-5893. For example domain “0a.א” is a part of IDNA test as a domain that violate Bidi rule #1, while according to RFC-5893 it should be considered as valid because a label starting with a digit precedes RTL label.
I tried to look at other implementations of UTS46, for example python implementation https://github.com/kjd/idna/blob/master/tests/test_idna_uts46.py They explicitly skip the test that forces Bidi rules for not-bidi labels, saying that “These appear to be errors in the test vector”.
Anyway I personally think that practicality aspect should outweigh aiming to following the specification, and if current realization considers existing domains as invalid, it’s better to relax some parsing restrictions.
“Need” is an interesting word.
I’ll just leave this here: https://users.rust-lang.org/t/help-wanted-maintaining-rust-url/10707
Not exactly. RFC-5893 says that URLS that contains labels starting with a digit are absolutly valid if such a label comes before any RTL label. While domain names that contains digit-leading labels after RTL label should be forbiden for registration.
Also I don’t sure about term “new” and “old” spec. As I know RFC-5893 is an actual spec for bidirectional international domain names, and it wasn’t replaced by any newer spec. As I understand this crate is based on a quite different doc, “Unicode® Technical Standard #46” UNICODE IDNA COMPATIBILITY PROCESSING" https://unicode.org/reports/tr46/ It is not replacement for aforementioned RFC, instead it refers to RFC. But this doc says that domain that contains any RTL label cannot contain digit-leading label. Frankly, I don’t know the full relationshps between these two docs and can’t say what doc should have a priority in any given case