Skip to content

Stay reasonably neutral in problem statement #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
147a84f
Stay reasonably neutral in problem statement
dcodeIO Sep 15, 2022
c32cfa5
Update README.md
dcodeIO Sep 15, 2022
8acfac3
focus on neutrality with slightly more background
dcodeIO Sep 16, 2022
4832d5d
slightly better
dcodeIO Sep 16, 2022
9a44455
more balance
dcodeIO Sep 16, 2022
7cd3d8d
clarify
dcodeIO Sep 16, 2022
6bb703d
condense
dcodeIO Sep 17, 2022
2890191
less is more
dcodeIO Sep 17, 2022
c7deea1
ESM complications
dcodeIO Sep 17, 2022
7cdb1b6
context
dcodeIO Sep 17, 2022
daa77a4
leave doors open
dcodeIO Sep 17, 2022
2aeec56
neutral
dcodeIO Sep 17, 2022
d376e18
clarify
dcodeIO Sep 17, 2022
e13596b
JSON precedent
dcodeIO Sep 17, 2022
d5bee50
bridge
dcodeIO Sep 17, 2022
fa5f7eb
better not
dcodeIO Sep 17, 2022
dcf8b79
neutralize
dcodeIO Sep 17, 2022
573b7ec
concrete
dcodeIO Sep 17, 2022
81ca23a
mention stringref
dcodeIO Sep 17, 2022
b73a4f6
but mention at the side for WTF-8 context
dcodeIO Sep 17, 2022
01a14dc
fix copy pasta
dcodeIO Sep 17, 2022
4623be6
too much nesting
dcodeIO Sep 17, 2022
1256d4d
move details to FAQ
dcodeIO Sep 17, 2022
eaf162b
clarify
dcodeIO Sep 17, 2022
acb3a51
minus redundancy
dcodeIO Sep 17, 2022
44c4930
Update README.md
dcodeIO Sep 17, 2022
85bdd14
less verbose
dcodeIO Sep 17, 2022
ad12765
mention strategy
dcodeIO Sep 17, 2022
afa8147
actually redundant
dcodeIO Sep 17, 2022
be47eba
neutral
dcodeIO Sep 17, 2022
321481f
references
dcodeIO Sep 17, 2022
af1e54e
clarify
dcodeIO Sep 17, 2022
e2a67aa
redundant links should be fine
dcodeIO Sep 17, 2022
0fa2989
depends
dcodeIO Sep 17, 2022
06ecb99
reuse exact wording
dcodeIO Sep 17, 2022
a013bb1
superfluous
dcodeIO Sep 17, 2022
1f473a8
condense
dcodeIO Sep 18, 2022
c7ebcc2
wording
dcodeIO Sep 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ Status: stage 2

## Problem Statement

[ECMAScript string values](https://tc39.es/ecma262/multipage/overview.html#sec-terms-and-definitions-string-value) are a finite ordered sequence of zero or more 16-bit unsigned integer values. However, ECMAScript does not place any restrictions or requirements on the integer values except that they must be 16-bit unsigned integers. In *well-formed* strings, each integer value in the sequence represents a single 16-bit code unit of UTF-16-encoded Unicode text. However, not all sequences of UTF-16 code units represent UTF-16-encoded Unicode text. In well-formed strings, code units in the range `0xD800..0xDBFF` (leading surrogates) and `0xDC00..0xDFFF` (trailing surrogates) must appear paired and in order. Strings with unpaired or out-of-order surrogates are *ill-formed*.
[ECMAScript string values](https://tc39.es/ecma262/multipage/overview.html#sec-terms-and-definitions-string-value) are a finite ordered sequence of zero or more 16-bit unsigned integer values (in Unicode terms: individual UTF-16 code units). Unlike Unicode text, which is a sequence of Unicode Scalar Values that excludes the surrogate code points, ECMAScript does not place any constraints on the code units except that they must be 16-bit unsigned integers. Hence, not all sequences of UTF-16 code units in ECMAScript strings represent Unicode text. More precisely, the invariant that code units in the range `0xD800..0xDBFF` (leading surrogates) and `0xDC00..0xDFFF` (trailing surrogates) must appear paired and in order is not eagerly enforced for compatibility reasons. Strings with unpaired or out-of-order surrogates are called *ill-formed*.

In WebIDL, possibly-ill-formed Strings are referred to using the [DOMString](https://webidl.spec.whatwg.org/#idl-DOMString) type. But for interfaces which operate on Unicode text, WebIDL also defines the [USVString](https://webidl.spec.whatwg.org/#idl-USVString) type as the set of all possible sequences of Unicode Scalar Values, which are all of the Unicode code points apart from the surrogate code points (`U+0000..U+D7FF` and `U+E000..U+10FFFF`), representing Unicode text.
In WebIDL, potentially ill-formed Strings are referred to using the [DOMString](https://webidl.spec.whatwg.org/#idl-DOMString) type. Well-formed strings are referred to using the [USVString](https://webidl.spec.whatwg.org/#idl-USVString) type. Due to their asymmetric value spaces, conversions from `DOMString` to `USVString` are lossy, whereas `DOMString` can represent any `USVString`. Upon conversion from `DOMString` to `USVString`, common options are to replace unpaired surrogates with the replacement character (`U+FFFD`) or to throw an error. Where the side-effect of silent mutation or erroring is undesired, mixed UTF-8/16 systems preserve integrity instead by utilizing [WTF-8](https://simonsapin.github.io/wtf-8/). JSON preserves `DOMString` through the use of escape sequences.

The WebAssembly [Component Model](https://github.com/WebAssembly/component-model) requires well-formed strings, as do some compile-to-JS programming languages, many data encodings, network interfaces, filesystem interfaces, etc. Interfacing JavaScript strings with such APIs is a common use case that therefore suffers from conversion burdens. In particular because conversion from `DOMString` to `USVString` is lossy (common options are to replace unpaired surrogates or to throw an error) there is a regular need for string validation both within the platform and for certain userland use case scenarios.
The WebAssembly [Component Model](https://github.com/WebAssembly/component-model) mandates sequences of Unicode Scalar Values (`USVString`) on component boundaries, choosing to eagerly sanitize `DOMString`s including when integrating with JavaScript ([see FAQ](#how-does-this-proposal-relate-to-the-webassembly-component-model)). Independently, there is a significant number of data encodings, network interfaces, filesystem interfaces, compile-to-JS programming languages, etc. designed for Unicode text (`USVString`) with no requirement to support `DOMString`. Therefore, interfacing `DOMString`s with such APIs is a common use case that suffers from the outlined conversion burdens. To detect respectively address the side-effect of silent data mutation or thrown errors early, there is thus a need for manual string validation and sanitization both within the platform and for certain userland use case scenarios.

## Proposal

Expand Down Expand Up @@ -70,3 +70,7 @@ Performance optimizations are up to implementations and are not guaranteed by th
### Are consumers going to do anything other than convert when they encounter ill-formed strings? If so, why not provide only a conversion method with a fast path for well-formed strings?

Consumers may want to throw/error when encountering ill-formed strings. Also, consumers may want to defer the conversion or the error until later when the String is actually interpreted as Unicode text. These use cases justify the test-only method.

### How does this proposal relate to the WebAssembly Component Model?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems fine to me now, if this entire superfluous section is removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several direct connections to this proposal in this section of the FAQ (that I moved there from the problem statement to do ya'll a favor btw). In addition to explaining why the CM is probably the most notable precedent for this proposal, why timings overlap, why this is special, why performance is important, etc., the paragraph also aids future discussion. Today, this proposal uses the CM as an argument, and it is likely that tomorrow the CM will use this proposal as an argument to bolster its choices because "but you can check". This is highly relevant context that, if withheld for editorial reasons, will not be obvious and likely lead to worse results overall than if it was provided. Please let's be responsible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a JS developer, I should not need to understand or read anything about wasm beyond "wasm requires well-formed strings". None of the info about the component model is relevant to me. A TC39 proposal is not an appropriate place to create leverage for an argument within wasm.

Copy link
Contributor Author

@dcodeIO dcodeIO Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not intended as leverage, really. If you can point me to sentences that feel like it to you, please tell me, and I'll do my best to phrase as objectively as I humanly can. Hardly anyone knows about the nitty gritty details in play here (this proposal helps to some extent), even less people know how all this is connected, but exactly this knowledge is important, now more than ever with the CM doing things differently than the platform has done it before. As a JS developer, you should care because you have dependencies, and these dependencies have dependencies, you upgrade them, you bundle them. This string stuff is subtle, but also fundamental, and errors are delayed, almost impossible to spot. With the CM, JS developers need a bunch of highly specialized knowledge now that they didn't necessarily need to care about before. I do think that we are acting responsibly when we try our best to let them know about the what and why, so they can address it, either in code or in the spec. Please give me a realistic chance to improve this FAQ section. I think it's important that people are aware. That there is discourse about this, knowledge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the place that awareness should be spread is a wasm arena, not a JS one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, this is about helping JS devs. To help TC39. ECMAScript. I am out of options in Wasm, nobody cares there :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly the point - if wasm's decisions here aren't going to change, why is this extra context useful to JS devs? All they need to know is what a well-formed string is, how to check for it, and how to sanitize one if needed - which this proposal already provides for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the CM is in phase 1, it's too early to know for sure whether Wasm's decisions are going to last. And as I said in my prior comment, JS devs, in particular those intending to consume Wasm modules, even more so those with an interest to use JS and Wasm in tandem (which is a fantastic use case btw and one of Wasm's stated goals), should care:

As a JS developer, you should care because you have dependencies, and these dependencies have dependencies, you upgrade them, you bundle them. This string stuff is subtle, but also fundamental, and errors are delayed, almost impossible to spot. With the CM, JS developers need a bunch of highly specialized knowledge now that they didn't necessarily need to care about before. I do think that we are acting responsibly when we try our best to let them know about the what and why, so they can address it, either in code or in the spec.

If they are kept in the dark, this will just remain the self-fulfilling prophecy it already is: Wasm does its thing, while all attempts to inform or include JS devs are stonewalled. Greetings to the same Wasm folks up- and downvoting here btw. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If wasm makes decisions that make JS interop harder in practice, that hurts wasm devs, not JS devs, because that will just inhibit use of wasm. I don't see what JS devs can do about it - nor do I think that the entry to the wasm funnel will be on this particular TC39 proposal.

Copy link
Contributor Author

@dcodeIO dcodeIO Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound like a good outcome, rather like an argument to better have an FAQ entry so nobody is hurt unknowingly. I think there's a lot JS devs with an interest for Wasm can do btw. (I'm one of them, and look at me), just not if relevant context is withheld from them. And for sure this proposal is the entry point: Want to know what the problem is? Direct here. Ran into an issue? Direct here. Component Model? Soon directs here. In a sense, JS and Wasm meet exactly here, where the problem is addressed with manual checks.


The WebAssembly [Component Model](https://github.com/WebAssembly/component-model) motivates this proposal in part, both in terms of functionality and timing. Unlike ECMAScript and similar languages like Java, which use `DOMString` respectively identical semantics to represent language-level `String`s, WebIDL, which [recommends to use `DOMString` when in doubt](https://webidl.spec.whatwg.org/#idl-USVString), and JSON, which preserves `DOMString` [through escape sequences](https://github.com/tc39/proposal-well-formed-stringify), the authors of the Component Model [obtained majority agreement](https://github.com/WebAssembly/meetings/blob/main/main/2021/CG-08-03.md#discussions) to allow only `USVString` on component boundaries, including when both producer and receiver prefer `DOMString`. The decision extends to the Component Model's planned [Web platform / ESM integration](https://github.com/WebAssembly/component-model/blob/main/design/high-level/Goals.md), increasing the usefulness of this proposal given that both new as well as existing code originally unconcerned with `USVString` is affected. In particular once JavaScript modules are upgraded to or otherwise replaced with WebAssembly components via `import`s, manual checks will become necessary to detect and work around now materializing silent mutation where undesired. As a result, such checks will be needed more frequently than pre Component Model, in turn emphasizing the importance of this proposal's suggested performance optimizations. The Component Model is currently in [phase 1](https://github.com/WebAssembly/proposals) of the WebAssembly process and there is ongoing debate whether its choices are desirable, including a [detailed objection](https://www.assemblyscript.org/standards-objections.html#component-model-2022-09) with more background. A related proposal for [Reference-Typed Strings for WebAssembly](https://github.com/WebAssembly/stringref) exists, that, contrary to the Component Model, preserves integrity by utilizing [WTF-8](https://simonsapin.github.io/wtf-8/). It cannot be ruled out that the Component Model's `string` and Reference-Typed Strings' `stringref` align in the future for consistency reasons, then either subtracting from or adding to this proposal's problem statement.