-
Notifications
You must be signed in to change notification settings - Fork 1.5k
<regex>
: Correctly reset capture groups for ECMAScript grammar
#5456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
muellerj2
wants to merge
13
commits into
microsoft:main
Choose a base branch
from
muellerj2:regex-fix-capture-group-reset
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
<regex>
: Correctly reset capture groups for ECMAScript grammar
#5456
muellerj2
wants to merge
13
commits into
microsoft:main
from
muellerj2:regex-fix-capture-group-reset
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…its uses by `_St._Cur`
…re groups before each repetition
b86b415
to
90f84c2
Compare
<regex>
: Correctly reset capture groups for ECMAScript grammars<regex>
: Correctly reset capture groups for ECMAScript grammar
muellerj2
commented
May 1, 2025
StephanTLavavej
approved these changes
May 8, 2025
Thanks, this is great! 😻 I pushed a conflict-free merge with |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
StephanTLavavej
added a commit
to StephanTLavavej/STL
that referenced
this pull request
May 9, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #5365 and does some preparatory work for #174.
ECMAScript requires that all capture groups inside a loop are reset to "unmatched" at the start of each repetition. To do this correctly, we have to know the first capture group that appears inside the loop. We actually do not have to know the last one to reset the capture groups correctly: If a loop is repeated, then all capture groups between the first capture group inside the loop and the last one in the regex appear inside this loop or are already unmatched. (Knowing the last capture group inside the loop would avoid some unnecessary work, but given that the number of capture groups is usually low in practice it's not clear that this additional work is worth it when traded against more NFA traversal at the time of matching. This would require some benchmarking.)
To find the first capture group, the matcher now traverses the NFA until it encounters a capture group or has visited all of the nodes inside a loop. For POSIX regexes, the traversal is skipped; instead the first capture group is set to the number of capture groups in the whole expression, which means that no capture groups will be reset when processing loops as required by POSIX semantics.
To avoid this NFA traversal for each repetition, the result of this traversal is now cached in the loop state by extending the type
_Loop_vals_t
. But this means the type_Loop_vals_t
and_Matcher
have to be renamed for binary compatibility. At this opportunity, I also make some minor cleanups.I split this PR into several commits to make reviewing easier:
_Matcher
to_Matcher2
and add a template parameter for an allocator to prepare for <regex>: Should regex_match() use the match_results allocator? #174 as suggested by STL. For now, I'm passingvoid
as the allocator argument because it's obviously not an allocator, so we don't have to worry about ABI when allocator support is actually added._Ncap
member variable: It should beunsigned int
, notint
._Get_ncap()
: It has become superfluous because the type of_Ncap
has been corrected._First
in favor of_Begin
(to match_End
)._Cur_iter
in_Do_rep
to minimally reduce stack consumption.<regex>
: Implementation divergence for capture group behavior #5365:_Loop_vals_t
and rename it to_Loop_vals_v2_t
._Find_first_inner_capture_group
that recursively traverses the NFA to find the first capture group. The recursive traversal is protected against stack overflow like all other recursive calls in the matcher._Do_rep
into_Do_rep
(doing the main work for any repetition) and_Do_rep_first
(doing the preparation before the first repetition)._Do_rep
to reset capture groups._Do_rep0
to reset capture groups in one for loop. (They already are reset correctly in the other for loop.)_Do_rep_first
to perform the NFA traversal if necessary and update the loop state accordingly.