wasm-bindgen: Incorrect handling of unpaired surrogates in JS strings

Describe the Bug

It was brought to my attention in https://github.com/Pauan/rust-dominator/issues/10 that JavaScript strings (and DOMString) allow for unpaired surrogates.

When using TextEncoder, it will convert those unpaired surrogates into U+FFFD (the replacement character). According to the Unicode spec, this is correct behavior.

The issue is that because the unpaired surrogates are replaced, this is lossy, and that lossiness can cause serious issues.

You can read the above dominator bug report for the nitty gritty details, but the summary is that with <input> fields (and probably other things), it will send two input events, one for each surrogate.

When the first event arrives, the surrogate is unpaired, so because the string is immediately sent to Rust, the unpaired surrogate is converted into the replacement character.

Then the second event arrives, and the surrogate is still unpaired (because the first half was replaced), so the second half also gets replaced with the replacement character.

This has a lot of very deep implications, including for international languages (e.g. Chinese).

I did quite a bit of reading, and unfortunately I think the only real solution here is to always use JsString, and not convert into Rust String, because that is inherently lossy. Or if a conversion is done, it needs to do some checks to make sure that there aren’t any unpaired surrogates.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 36 (23 by maintainers)

Commits related to this issue

Most upvoted comments

If we’re leaving as_string as-is, can we at least rename it to as_string_lossy?

That would be more consistent with the standard library, where String::from_utf16_lossy does the same thing (and String::from_utf16 returns an error).

@alexcrichton I think right now we just need a way of detecting if a JS string is valid UTF-16 or not.

I see a couple options:

  1. The as_string method already says that it returns None if it’s invalid UTF-8, but that’s currently not true.

    So it could be changed so it actually does a validity check and returns None if the JS string isn’t valid UTF-16.

  2. Add a new is_valid_utf16_string method to JsValue which does the validity checking and returns bool.

Either solution works fine. With that basic building block, it’s now possible for the input event to detect whether the string is valid UTF-16 or not (and skip the event if it’s not).

This completely sidesteps the issue: we no longer need to define DomString, and we don’t need to use WTF-8 either. And that means the marshalling costs are the same as they currently are, since we’re just using String.

So modulo some naming, I think given all this I’d propose a solution that looks like:

  • Add a DomString and DomStr types to the wasm-bindgen crate (again, modulo naming, that’s hopefully separate)
  • A DomString internally stores Vec<u16>
  • DomString derefs to DomStr
  • DomStr has APIs such as:
    • iter() - iterator of the u16 values
    • from_iter() - creation from u16 values
    • to_string{,_lossy}() - attempt to interpret as utf-16, returning None for to_string if it fails.
    • Maybe some extend-like APIs, or just exposing the raw underlying Vec<u16> somehow
  • Both DomString and DomStr can be used in boundary APIs the exact same way Vec<u16> is used.
    • Entering WebAssembly we use charCodeAt to create a list
    • Exiting WebAssembly we use fromCharCode and create a concatenated string

The intention here is that there’s optionally a type that we can use in wasm-bindgen which is a lossless representation of a string in JS. It isn’t necessarily the default, and we’ll still support str and String everywhere (althoguh we’ll update the documentation to note this gotcha). Additionally I’m thinking that Vec<u16> may be a natural internal representation for now because the main purpose of DomString would be to build something up to later interpret as utf-16. The main alternative would be WTF-8 with Vec<u8> internally, but the only real point of doing that would be to support an allocation-less to_str() method (like OsStr on Windows) and if DomString is only an intermediate type before we interpret UTF-16, it seems like to_str isn’t that well motivated for our use case.

I think I’d probably prefer to also say we shouldn’t change js-sys or web-sys at this time. We’ll still use String and str (and Text{Decoder,Encoder}) everywhere for those. Over time I think we can possibly consider whitelisting APIs to instead use DomString (or similar), but if what @chris-morgan says is right in that unpaired surrogates are quite rare we may be able to get by mostly with this. In the meantime libraries like your @Pauan would probably bind the API internally (e.g. not use web-sys for these APIs where you want DomString)


Ok and for naming, one obvious other name would be JsString and JsStr, which while I think are more appropriate clash with js_sys::JsString, so we’d need to figure out what to do with that.

I know ~nothing about what this code does, but why is the string that was just read from the text input immediately written back to it?

I’m not Pauan, but I believe that code is implementing a React-style controlled component, where the DOM always reflects the current state as understood by the application.