nushell: `url parse | url join` doesn't roundtrip when there is no query string
Describe the bug
When url join is fed the output of url parse, it doesn’t reproduce the original URL in the (common) case when there is no query string. Instead it adds a ?, denoting an empty query string rather than an absent query string.
The root of the issue seems to be that when url parse processes a URL with no query string, it produces a record with an empty string for the query field, and an empty record for the params field:
❯ "https://example.com/foo" | url parse
╭──────────┬───────────────────╮
│ scheme │ https │
│ username │ │
│ password │ │
│ host │ example.com │
│ port │ │
│ path │ /foo │
│ query │ │
│ fragment │ │
│ params │ {record 0 fields} │
╰──────────┴───────────────────╯
In the docs for url join, the only examples where it isn’t intended for a query string to appear in the output use input records that simply don’t have the query or params fields. And indeed removing the params field makes url join:
❯ "https://example.com/foo" | url parse | reject params | url join
https://example.com/foo
Fundamentally, the problem is that representing the query string as a record of parameters isn’t capable of representing the distinction between an empty query string and an absent query string. According to RFC3986 these are not the same:
Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI
http://example.com/?cannot be assumed to be equivalent to any of the examples above.
This doesn’t seem to be possible to solve perfectly without making the type of the output of url parse less predictable. Either it needs to omit the params and query fields when the query string isn’t present, or they need to be set to a value that can be distinguished from an empty string/record (such as null).
Similar considerations probably apply to some of the other fields (all of the fields corresponding to absent parts of the URL come out as empty strings in my testing, which isn’t information-preserving if it’s valid to set the part to an empty string explicitly), though I’m not sure whether all of them are valid to be explicitly-empty. The query-string case is particularly unfortunate though as we end up preferring the interpretation that is less likely to be what was actually meant.
How to reproduce
❯ "https://example.com/foo" | url parse | url join
https://example.com/foo?
Expected behavior
I expect url parse to output sufficient information for url join to recreate common URLs. I should see:
❯ "https://example.com/foo" | url parse | url join
https://example.com/foo
Screenshots
No response
Configuration
| key | value |
|---|---|
| version | 0.80.0 |
| branch | |
| commit_hash | |
| build_os | linux-x86_64 |
| build_target | x86_64-unknown-linux-gnu |
| rust_version | rustc 1.69.0 (84c898d65 2023-04-16) (built from a source tarball) |
| cargo_version | cargo 1.69.0 |
| build_time | 1980-01-01 00:00:00 +00:00 |
| build_rust_channel | release |
| features | default, zip |
| installed_plugins |
Additional context
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 15 (8 by maintainers)
Thanks @gdhuper for getting this fix so quickly!
@cumber @amtoine I have a quick fix for handling empty query / params: https://github.com/nushell/nushell/pull/9356
However, I want to clarify a couple of things:
Noneand then check forNoneinstead of an empty string?%20. But ishttps://example.com/foo bareven a valid URL?usernameandpasswordscenario, do we want to remove the functionality altogether since it has been deprecated?@amtoine yes please! Thanks
After a little bit more testing, there are some more infelicities.
url joinseems to treat all of the parts as missing if their field in the record consists only of whitespace; I’m pretty sure all-whitespace is valid for at least the query and fragment parts (though unusual for both).url parsepercent-encodes some things in its input but not others:url joinjust passes these straight through. (I suppose it’s good thaturl joinis inconsistent and only encodes some things, otherwise we’d risk parsing already percent-encoded things like%20to%2520, which is totally different). I’d expect that either they both leave percent encoding alone, or thaturl parsedecodes andurl joinencodes (and do both consistently either to all reserved characters, or even to all non-unreserved characters).Even worse,
url parseseems to then decode percent-encoding (including the spaces it just encoded) in theparamsrecord but not thequerystring;url jointhen fails because it checks that they match:url joinalso doesn’t put the user part into the URL unless both a username and password is present. So the roundtrip just silently drops a username without password:RFC3986 actually says that the
user:password@form is deprecated to stop people storing plaintext passwords in URLs, so we definitely shouldn’t be requiring it!url parsetrims trailing whitespace on its input. Probably not a huge problem really, but it does do things like treat query strings ending in whitespace differently depending on whether they’re followed by a fragment.It looks like
url joinalready treats most of these fields as being missing if their value isnull(the only exception seems to beportandparams). My vote for how to fix this would be:url parseemit nulls for missing partsurl jointreat empty or all-whitespace strings as values (failing only if those are invalid values), not as missing (and handlenullwhere it doesn’t already).url parsepercent-encoding spaces (and anything else?) in its inputurl parsecontinues to percent-decode in theparamsrecord for convenience,url joinshould re-encode it. This might need some flags though; “encode then decode” is always a safe roundtrip, but “decode then encode” isn’t unless you have specific knowledge of the input, so being able to opt-out of it is probably needed.It’d be really easy to fix just the issue in the title by simply flipping
url jointo not emit a?when theparamsrecord is empty, but it would be nicer to be more careful with how everything is handled so that weird-but-valid cases (and not-so-weird like username only or no query string!) don’t cause problems.