wasm-bindgen: Incorrect handling of unpaired surrogates in JS strings
Describe the Bug
It was brought to my attention in https://github.com/Pauan/rust-dominator/issues/10 that JavaScript strings (and DOMString) allow for unpaired surrogates.
When using TextEncoder, it will convert those unpaired surrogates into U+FFFD (the replacement character). According to the Unicode spec, this is correct behavior.
The issue is that because the unpaired surrogates are replaced, this is lossy, and that lossiness can cause serious issues.
You can read the above dominator bug report for the nitty gritty details, but the summary is that with <input> fields (and probably other things), it will send two input events, one for each surrogate.
When the first event arrives, the surrogate is unpaired, so because the string is immediately sent to Rust, the unpaired surrogate is converted into the replacement character.
Then the second event arrives, and the surrogate is still unpaired (because the first half was replaced), so the second half also gets replaced with the replacement character.
This has a lot of very deep implications, including for international languages (e.g. Chinese).
I did quite a bit of reading, and unfortunately I think the only real solution here is to always use JsString, and not convert into Rust String, because that is inherently lossy. Or if a conversion is done, it needs to do some checks to make sure that there aren’t any unpaired surrogates.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 36 (23 by maintainers)
Commits related to this issue
- Add warnings about UTF-16 vs UTF-8 strings This commit aims to address #1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Nota... — committed to alexcrichton/wasm-bindgen by alexcrichton 5 years ago
- Add warnings about UTF-16 vs UTF-8 strings This commit aims to address #1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Nota... — committed to alexcrichton/wasm-bindgen by alexcrichton 5 years ago
- Add warnings about UTF-16 vs UTF-8 strings This commit aims to address #1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Nota... — committed to alexcrichton/wasm-bindgen by alexcrichton 5 years ago
- Add warnings about UTF-16 vs UTF-8 strings This commit aims to address #1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Nota... — committed to alexcrichton/wasm-bindgen by alexcrichton 5 years ago
- Add warnings about UTF-16 vs UTF-8 strings This commit aims to address #1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Nota... — committed to alexcrichton/wasm-bindgen by alexcrichton 5 years ago
Gecko bug, Chrome bug
If we’re leaving
as_stringas-is, can we at least rename it toas_string_lossy?That would be more consistent with the standard library, where
String::from_utf16_lossydoes the same thing (andString::from_utf16returns an error).@alexcrichton I think right now we just need a way of detecting if a JS string is valid UTF-16 or not.
I see a couple options:
The
as_stringmethod already says that it returnsNoneif it’s invalid UTF-8, but that’s currently not true.So it could be changed so it actually does a validity check and returns
Noneif the JS string isn’t valid UTF-16.Add a new
is_valid_utf16_stringmethod toJsValuewhich does the validity checking and returnsbool.Either solution works fine. With that basic building block, it’s now possible for the
inputevent to detect whether the string is valid UTF-16 or not (and skip the event if it’s not).This completely sidesteps the issue: we no longer need to define
DomString, and we don’t need to use WTF-8 either. And that means the marshalling costs are the same as they currently are, since we’re just usingString.So modulo some naming, I think given all this I’d propose a solution that looks like:
DomStringandDomStrtypes to thewasm-bindgencrate (again, modulo naming, that’s hopefully separate)DomStringinternally storesVec<u16>DomStringderefs toDomStrDomStrhas APIs such as:iter()- iterator of the u16 valuesfrom_iter()- creation from u16 valuesto_string{,_lossy}()- attempt to interpret as utf-16, returningNoneforto_stringif it fails.extend-like APIs, or just exposing the raw underlyingVec<u16>somehowDomStringandDomStrcan be used in boundary APIs the exact same wayVec<u16>is used.charCodeAtto create a listfromCharCodeand create a concatenated stringThe intention here is that there’s optionally a type that we can use in
wasm-bindgenwhich is a lossless representation of a string in JS. It isn’t necessarily the default, and we’ll still supportstrandStringeverywhere (althoguh we’ll update the documentation to note this gotcha). Additionally I’m thinking thatVec<u16>may be a natural internal representation for now because the main purpose ofDomStringwould be to build something up to later interpret as utf-16. The main alternative would be WTF-8 withVec<u8>internally, but the only real point of doing that would be to support an allocation-lessto_str()method (likeOsStron Windows) and ifDomStringis only an intermediate type before we interpret UTF-16, it seems liketo_strisn’t that well motivated for our use case.I think I’d probably prefer to also say we shouldn’t change
js-sysorweb-sysat this time. We’ll still useStringandstr(andText{Decoder,Encoder}) everywhere for those. Over time I think we can possibly consider whitelisting APIs to instead useDomString(or similar), but if what @chris-morgan says is right in that unpaired surrogates are quite rare we may be able to get by mostly with this. In the meantime libraries like your @Pauan would probably bind the API internally (e.g. not useweb-sysfor these APIs where you wantDomString)Ok and for naming, one obvious other name would be
JsStringandJsStr, which while I think are more appropriate clash withjs_sys::JsString, so we’d need to figure out what to do with that.I’m not Pauan, but I believe that code is implementing a React-style controlled component, where the DOM always reflects the current state as understood by the application.