-
Notifications
You must be signed in to change notification settings - Fork 12
subtract a not-proper-subset #32
Comments
@ljharb, does this rationale address your concerns? Should we add this to the FAQ or discuss it further? |
I'd certainly add it to the FAQ. I'm not 100% convinced - programming simply isn't math, and the rules of math don't need to apply in programming - but I don't feel strongly enough to insist on any changes at this time. |
The example from the slides was:
Perhaps a more illustrative example for this specific concern is the following:
The set of strings to which |
Is that kind of change likely? Wouldn’t removal of a character from one of these sets be a likely breaking change to JS anyways, especially with this proposal increasing reliance on set membership? |
That kind of change isn't, but many are, because Characters are continually
being added, and properties do get refined over time especially for longer
tail scripts.
You really don't want the following to suddenly throw an exception or have
other bizarre behavior:
[\p{prop1}--\p{prop2}]
Moreover there are lots of circumstances where you don't want to have to
figure out whether something is a proper subset or not. It would just make
expressions needlessly complicated.
[\p{script=greek}--\p{lowercase_letter}]
That also goes for intersection as well as set subtraction.
[\p{prop1}&&\p{prop2}]
…On Mon, May 31, 2021, 23:25 Jordan Harband ***@***.***> wrote:
Is that kind of change likely? Wouldn’t removal of a character from one of
these sets be a likely breaking change to JS anyways, especially with this
proposal increasing reliance on set membership?
|
It would also be an implementation/performance concern, as it’d require knowledge of all strings in each set at parse time — something we’ve explicitly been trying to avoid in our proposal. |
Characters being added is one of my concerns. If the character suddenly gets added, then any subtraction can go from a noop to removing a character, without the pattern or flags changing. |
Note that the question of subtracting a not-proper-subset applies to properties of characters just as much as to properties of strings.
In general, this is true, but I am not aware of a commonly used set implementation in programming that does not adhere to what people think of as sets in the mathematical sense; in particular, I am not aware of any that treats removal of a key that is not in the set as an error.
In regular expression engines that support set subtraction, I also don't see any mention of treating this as an error. They are defined as you would expect. See links to several here: https://github.com/tc39/proposal-regexp-set-notation#whats-the-precedent-in-other-regexp-flavors Example spec text from XML Schema: “For any ·positive character group· or ·negative character group· G, and any ·character class expression· C, G-C is a valid ·character class subtraction·, identifying the set of all characters in C(G) that are not also in C(C).” In the .Net regex, limited as the syntax is, there are actually examples that subtract a not-proper-subset:
|
When you use properties, you want the regular expression to follow along with Unicode versions. If you want perfect stability, then you need to hardcode all ranges. It's a trade-off, stability vs. auto-updating. When dealing with natural languages, there is a benefit to using Trivial example: Pick a range of code points that is unassigned now but where Unicode 14 adds a new script: |
Examples from https://github.com/tc39/proposal-regexp-set-notation#illustrative-examples which we got from real code:
It would be very awkward having to write these as subtracting proper subsets. |
Cc stage 3 reviewers @waldemarhorwat @gibson042 @msaboff I feel strongly that we should treat character classes like mathematical sets, as the spec says already, and as motivated by the examples above and by comparison with other regex implementations. Thus I suggest that we add this to the FAQ and then close this issue. |
PR: #36 |
In the TC39 meeting this week (2021-may-26) someone asked what happens in subtraction A--B when B is not a proper subset of A.
In my mind, character classes define sets in the mathematical sense, and set operations should behave as usual. From the current spec: “A CharSet is a mathematical set of characters.” and the semantics evaluate a CharacterClass and a CharacterClassEscape to a CharSet, using union operations as appropriate (“Return the union of ...”).
The text was updated successfully, but these errors were encountered: