Skip to content

<regex>: Correctly reset capture groups for ECMAScript grammar #5456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented May 1, 2025

Fixes #5365 and does some preparatory work for #174.

ECMAScript requires that all capture groups inside a loop are reset to "unmatched" at the start of each repetition. To do this correctly, we have to know the first capture group that appears inside the loop. We actually do not have to know the last one to reset the capture groups correctly: If a loop is repeated, then all capture groups between the first capture group inside the loop and the last one in the regex appear inside this loop or are already unmatched. (Knowing the last capture group inside the loop would avoid some unnecessary work, but given that the number of capture groups is usually low in practice it's not clear that this additional work is worth it when traded against more NFA traversal at the time of matching. This would require some benchmarking.)

To find the first capture group, the matcher now traverses the NFA until it encounters a capture group or has visited all of the nodes inside a loop. For POSIX regexes, the traversal is skipped; instead the first capture group is set to the number of capture groups in the whole expression, which means that no capture groups will be reset when processing loops as required by POSIX semantics.

To avoid this NFA traversal for each repetition, the result of this traversal is now cached in the loop state by extending the type _Loop_vals_t. But this means the type _Loop_vals_t and _Matcher have to be renamed for binary compatibility. At this opportunity, I also make some minor cleanups.

I split this PR into several commits to make reviewing easier:

  • Rename _Matcher to _Matcher2 and add a template parameter for an allocator to prepare for <regex>: Should regex_match() use the match_results allocator? #174 as suggested by STL. For now, I'm passing void as the allocator argument because it's obviously not an allocator, so we don't have to worry about ABI when allocator support is actually added.
  • Fix the type of the _Ncap member variable: It should be unsigned int, not int.
  • Remove the member function _Get_ncap(): It has become superfluous because the type of _Ncap has been corrected.
  • Remove the matcher's duplicate member variable _First in favor of _Begin (to match _End).
  • Remove the unnecessary local variable _Cur_iter in _Do_rep to minimally reduce stack consumption.
  • Actual fix of <regex>: Implementation divergence for capture group behavior #5365:
    • Extend _Loop_vals_t and rename it to _Loop_vals_v2_t.
    • Add function _Find_first_inner_capture_group that recursively traverses the NFA to find the first capture group. The recursive traversal is protected against stack overflow like all other recursive calls in the matcher.
    • Split original _Do_rep into _Do_rep (doing the main work for any repetition) and _Do_rep_first (doing the preparation before the first repetition).
    • Update _Do_rep to reset capture groups.
    • Update _Do_rep0 to reset capture groups in one for loop. (They already are reset correctly in the other for loop.)
    • Implement _Do_rep_first to perform the NFA traversal if necessary and update the loop state accordingly.

@muellerj2 muellerj2 requested a review from a team as a code owner May 1, 2025 17:16
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 1, 2025
@muellerj2 muellerj2 force-pushed the regex-fix-capture-group-reset branch from b86b415 to 90f84c2 Compare May 1, 2025 17:19
@muellerj2 muellerj2 changed the title <regex>: Correctly reset capture groups for ECMAScript grammars <regex>: Correctly reset capture groups for ECMAScript grammar May 1, 2025
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels May 1, 2025
@StephanTLavavej StephanTLavavej self-assigned this May 1, 2025
@StephanTLavavej
Copy link
Member

Thanks, this is great! 😻 I pushed a conflict-free merge with main and minor stylistic cleanups.

@StephanTLavavej StephanTLavavej removed their assignment May 8, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 8, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 9, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej added a commit to StephanTLavavej/STL that referenced this pull request May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regex meow is a substring of homeowner
Projects
Status: Merging
Development

Successfully merging this pull request may close these issues.

<regex>: Implementation divergence for capture group behavior
2 participants