nushell: `url parse | url join` doesn't roundtrip when there is no query string

Describe the bug

When url join is fed the output of url parse, it doesn’t reproduce the original URL in the (common) case when there is no query string. Instead it adds a ?, denoting an empty query string rather than an absent query string.

The root of the issue seems to be that when url parse processes a URL with no query string, it produces a record with an empty string for the query field, and an empty record for the params field:

❯ "https://example.com/foo" | url parse
╭──────────┬───────────────────╮
│ scheme   │ https             │
│ username │                   │
│ password │                   │
│ host     │ example.com       │
│ port     │                   │
│ path     │ /foo              │
│ query    │                   │
│ fragment │                   │
│ params   │ {record 0 fields} │
╰──────────┴───────────────────╯

In the docs for url join, the only examples where it isn’t intended for a query string to appear in the output use input records that simply don’t have the query or params fields. And indeed removing the params field makes url join:

❯ "https://example.com/foo" | url parse | reject params | url join
https://example.com/foo

Fundamentally, the problem is that representing the query string as a record of parameters isn’t capable of representing the distinction between an empty query string and an absent query string. According to RFC3986 these are not the same:

Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI http://example.com/? cannot be assumed to be equivalent to any of the examples above.

This doesn’t seem to be possible to solve perfectly without making the type of the output of url parse less predictable. Either it needs to omit the params and query fields when the query string isn’t present, or they need to be set to a value that can be distinguished from an empty string/record (such as null).

Similar considerations probably apply to some of the other fields (all of the fields corresponding to absent parts of the URL come out as empty strings in my testing, which isn’t information-preserving if it’s valid to set the part to an empty string explicitly), though I’m not sure whether all of them are valid to be explicitly-empty. The query-string case is particularly unfortunate though as we end up preferring the interpretation that is less likely to be what was actually meant.

How to reproduce

❯ "https://example.com/foo" | url parse | url join
https://example.com/foo?

Expected behavior

I expect url parse to output sufficient information for url join to recreate common URLs. I should see:

❯ "https://example.com/foo" | url parse | url join
https://example.com/foo

Screenshots

No response

Configuration

key value
version 0.80.0
branch
commit_hash
build_os linux-x86_64
build_target x86_64-unknown-linux-gnu
rust_version rustc 1.69.0 (84c898d65 2023-04-16) (built from a source tarball)
cargo_version cargo 1.69.0
build_time 1980-01-01 00:00:00 +00:00
build_rust_channel release
features default, zip
installed_plugins

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 15 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks @gdhuper for getting this fix so quickly!

@cumber @amtoine I have a quick fix for handling empty query / params: https://github.com/nushell/nushell/pull/9356

However, I want to clarify a couple of things:

  • In this fix, I am just doing a simple check in join to see if the query value is an empty string (because the parse function returns an empty string for missing params). Would we rather have the parse function return None and then check for None instead of an empty string?
  • In the example, the space is automatically replaced with %20. But is https://example.com/foo bar even a valid URL?
❯ "https://example.com/foo bar" | url parse | get path
/foo%20bar
  • For the username and password scenario, do we want to remove the functionality altogether since it has been deprecated?

@amtoine yes please! Thanks

After a little bit more testing, there are some more infelicities.

  1. url join seems to treat all of the parts as missing if their field in the record consists only of whitespace; I’m pretty sure all-whitespace is valid for at least the query and fragment parts (though unusual for both).

  2. url parse percent-encodes some things in its input but not others:

    ❯ "https://example.com/foo bar" | url parse | get path
    /foo%20bar
    
    ❯ "https://example.com/foo%bar" | url parse | get path
    /foo%bar
    

    url join just passes these straight through. (I suppose it’s good that url join is inconsistent and only encodes some things, otherwise we’d risk parsing already percent-encoded things like %20 to %2520, which is totally different). I’d expect that either they both leave percent encoding alone, or that url parse decodes and url join encodes (and do both consistently either to all reserved characters, or even to all non-unreserved characters).

  3. Even worse, url parse seems to then decode percent-encoding (including the spaces it just encoded) in the params record but not the query string; url join then fails because it checks that they match:

    ❯ "https://example.com?foo= %28" | url parse | url join
    Error: nu::shell::incompatible_parameters
    
      × Incompatible parameters.
       ╭─[entry #184:1:1]
     1 │ "https://example.com?foo= %28" | url parse | url join
       ·                                  ────┬────┬
       ·                                      │    ╰── instead query is: ?foo=%20%28
       ·                                      ╰── Mismatch, qs from params is: ?foo= (
       ╰────
    
  4. url join also doesn’t put the user part into the URL unless both a username and password is present. So the roundtrip just silently drops a username without password:

    ❯ "https://user@example.com" | url parse | url join
    https://example.com/?
    

    RFC3986 actually says that the user:password@ form is deprecated to stop people storing plaintext passwords in URLs, so we definitely shouldn’t be requiring it!

  5. url parse trims trailing whitespace on its input. Probably not a huge problem really, but it does do things like treat query strings ending in whitespace differently depending on whether they’re followed by a fragment.

It looks like url join already treats most of these fields as being missing if their value is null (the only exception seems to be port and params). My vote for how to fix this would be:

  1. Make url parse emit nulls for missing parts
  2. Make url join treat empty or all-whitespace strings as values (failing only if those are invalid values), not as missing (and handle null where it doesn’t already).
  3. Stop url parse percent-encoding spaces (and anything else?) in its input
  4. If url parse continues to percent-decode in the params record for convenience, url join should re-encode it. This might need some flags though; “encode then decode” is always a safe roundtrip, but “decode then encode” isn’t unless you have specific knowledge of the input, so being able to opt-out of it is probably needed.

It’d be really easy to fix just the issue in the title by simply flipping url join to not emit a ? when the params record is empty, but it would be nicer to be more careful with how everything is handled so that weird-but-valid cases (and not-so-weird like username only or no query string!) don’t cause problems.