Fix #269 and rewrite DFAMin #270

nicuveo · 2025-04-03T22:51:47Z

Overview

This PR fixes a long-standing bug in the minimizer. As not all states are in the queue Q (or W as it is known in the reference implementation of Hopcroft's algorithm), it is possible in some rare cases for some sets to not be properly subdivided. This can result in the wrong rule being selected at runtime.

This PR adds two new tests: one that demonstrates the issue with older versions of the minimizer (pre 4f0b51b) and one that demonstrates the issue with current versions of the minimizer.

Implementation details

The gist of this change is simple: we now initialize R to the empty set, and W (renamed from its previous nonstandard name of Q) to a set of subsets of stages. I've additionally used this PR as an opportunity to do several small cleanups.

First, this PR rewrites the Note that describes the algorithm. It properly introduces some notation that was left unexplained, fixes the error regarding the initial sets, and fixes some typos.

As far as groupEquivalentStates is concerned, beyond the aforementioned core change, this PR extracts several helper functions to the top level, renames them, and adds some basic documentation. For instance, what was bigmap is now created by a top-level function called generateReverseTransitionCache.

Beyond this, it also makes use of list functions to reduce the amount of manual list construction / deconstruction, and makes some minor changes to minimizeDFA for readability, and to avoid needing to silence shadowing warnings.

Open questions

Does this PR need a Changelog entry, or are those only for releases?
Why were some boot files listed in the Cabal file? (I had to remove them to compile and run tests.)

Bodigrim · 2025-04-03T22:58:23Z

Why were some boot files listed in the Cabal file? (I had to remove them to compile and run tests.)

They are populated by

alex/Makefile

Lines 23 to 24 in eb6c78d

    
           mv src/Parser.y src/Parser.y.boot 
        
           mv src/Scan.x src/Scan.x.boot

src/DFAMin.hs

andreasabel · 2025-04-04T19:34:12Z

Excellent work, @nicuveo !
So you fixed a bug that existed likely in all versions of alex 3 so far! (We'll have to deprecate them all.)

I took some hours working through the new code for DFAMin, and I added some commits that resulted from my review:

lots of comments
use aliases OldSNum and NewSNum for state numbers rather than Int where possible
optimize the numbers computation by observing that a start state can be only in one equivalence class
optimize the computation of the transition tables of the new states by observing that a single old state out of an equivalence class already contains all the information we need for the new state (I mean, after all, all old states in a class are equivalent)
replace restrictKeys by a plain filter

The last point is something we should/could discuss. My intuition is that restrictKeys has some overhead since it needs to return a balanced IntMap---only to have it crushed by the fold. So aren't we better off by turning it into a list, filtering it, and then fold together the remaining elements?

If you have any comments on my other changes, let me know as well.

My plan is to squash all commits into one.

P.S.: In general, I wonder whether the testsuite of alex is not too skimpy. After all, it did not discover the bug you found.

Bodigrim · 2025-04-04T20:27:44Z

To fix CI failures try bumping v2 to v3 in

alex/.github/workflows/emulated.yml

Line 28 in eb6c78d

- uses: uraimo/run-on-arch-action@v2

nicuveo · 2025-04-04T20:58:26Z

Thank you @andreasabel!

Regarding your changes to minimizeDFA: looks good to me! I had purposefully not made any deep change to it to avoid scope creep (beyond fixing obvious surface-level stuff), with the idea of doing further cleanups in a subsequent PR: separating the actual bug fix from the more cosmetic changes. I don't know what the policy / habit is for this project, though? But beyond that, yeah, again, your changes look sensible to me!

Regarding restrictKeys: yeah, i think your version is better! And it allows us to get rid of CPP, which is great. I'd prefer a do block however, like so; what do you think?

let x = IntSet.unions $ do
      (target, sources) <- IntMap.toList allPreviousStates
      guard $ target `InsSet.member` a
      pure sources

(An extremely pedantic nitpick, please feel free to ignore: 4522b8d introduces a nested where, and IME those are better avoided, do you think we could redo error handling there to avoid that?)

Regarding the test suite: i agree. A big challenge i faced with this work was managing to find a broken example expressed by rules, rather than as a quickcheck property / unit test that tests the invariants of groupEquivalentStates. Having a haskell test suite to validate some internals might help.

In general, i think this project could use a bit of TLC: it really lacks standardization, the naming conventions and comment styles are a bit all over the place... it looks, i'm sorry to say, like the project has been a bit neglected. I'd be happy to spend some time making cleanup / harmonization / modernization PRs; is that something that'd be welcome?

Thanks!

nicuveo · 2025-04-04T21:00:39Z

To fix CI failures try bumping v2 to v3 [...]

Ah, thanks! I've noticed the failure seems random, the emulated check has passed at least once... I'll open a separate PR to avoid crowding this one with an unrelated CI change.

nicuveo · 2025-04-04T22:03:38Z

I've pushed the changes to restrictKeys, and a small tentative change: i rewrote the processing of one X set to use foldl' instead of concat, concatMap, unzip, and ++. The hope is that, by only constructing the resulting list with :, we avoid paying a performance cost. I haven't benchmarked this however, and i am not sure how much list fusion would impact the original code.

andreasabel · 2025-04-05T07:28:55Z

Regarding your changes to minimizeDFA: looks good to me! I had purposefully not made any deep change to it to avoid scope creep (beyond fixing obvious surface-level stuff), with the idea of doing further cleanups in a subsequent PR: separating the actual bug fix from the more cosmetic changes. I don't know what the policy / habit is for this project, though? But beyond that, yeah, again, your changes look sensible to me!

I follow in general to separate bug fixes from refactorings, but in this case most of the lines of DFAMin had already been modified so I allowed myself to go all-in and freely rewrite some code that asked for improvement.

Regarding restrictKeys: yeah, i think your version is better! And it allows us to get rid of CPP, which is great. I'd prefer a do block however, like so; what do you think?
let x = IntSet.unions $ do
      (target, sources) <- IntMap.toList allPreviousStates
      guard $ target `InsSet.member` a
      pure sources

Yes, sure!

(An extremely pedantic nitpick, please feel free to ignore: 4522b8d introduces a nested where, and IME those are better avoided, do you think we could redo error handling there to avoid that?)

I agree with the general style to not deeply nest where. Rather, have top-level functions that can independently be given a clear semantics!
In case of a panic string however I would rather see this as distraction. Unless the code is incorrect (fingers cross), these panics are anyway dead code, so hiding them in some local where seems appropriate.

Regarding the test suite: i agree. A big challenge i faced with this work was managing to find a broken example expressed by rules, rather than as a quickcheck property / unit test that tests the invariants of groupEquivalentStates. Having a haskell test suite to validate some internals might help.

In general, i think this project could use a bit of TLC: it really lacks standardization, the naming conventions and comment styles are a bit all over the place... it looks, i'm sorry to say, like the project has been a bit neglected. I'd be happy to spend some time making cleanup / harmonization / modernization PRs; is that something that'd be welcome?

Refactorings and clean-ups are a bit of a two-edged sword. One the one hand, they improve the quality of the code, on the other hand, there is the risk of introducing regressions. After all, the current code, has passed the "test of time" (which is some correctness guarantee, yet not enough as the present bug shows.).
Therisk of introducing regressions can however be substantially reduced by a large test suite. So, I think work should be done in this order:

Enhance the testsuite to a level that gives us confidence to risk refactorings/clean ups.
Do refactorings/clean ups.

(The risk of regressions is real; e.g. see the "Hall of fame" for the make-over of the sister project happy: https://github.com/haskell/happy/issues?q=is%3Aissue%20state%3Aclosed%20label%3A%22regression%20in%20happy-2.*%22)

andreasabel · 2025-04-05T07:29:33Z

I'd be happy to merge the PR in the current state, if you agree @nicuveo .

This reverts commit e0126a8.

A start state cannot belong to several equivalence classes (they are disjoint), so we can remove already assigned start states from the list of start states we want to assign.

Since all old states in an equivalence class are equivalent, we only need the data from one of them to construct the new state.

nicuveo · 2025-04-05T10:33:26Z

Refactorings and clean-ups are a bit of a two-edged sword. One the one hand, they improve the quality of the code, on the other hand, there is the risk of introducing regressions. After all, the current code, has passed the "test of time" (which is some correctness guarantee, yet not enough as the present bug shows.). Therisk of introducing regressions can however be substantially reduced by a large test suite. So, I think work should be done in this order:

Enhance the testsuite to a level that gives us confidence to risk refactorings/clean ups.

Do refactorings/clean ups.

Sounds good! What's the proper way to get involved in such future plans? I have a bit of bandwidth and would be happy to help.

nicuveo · 2025-04-05T10:35:52Z

I'd be happy to merge the PR in the current state, if you agree @nicuveo .

I rebased this PR to include #271, but sadly it didn't seem to reliably the transient broken test. But if that's not a blocker, yep, i'm okay with merging this!

nicuveo · 2025-04-05T10:36:45Z

Oh wait no the new test failure is something we introduced. I'll fix that ASAP.

## Changes in 3.5.3.0 * Fix critical bug in automaton minimizer ([PR #270](haskell/alex#270)), thanks Antoine Leblanc! * Tested with GHC 8.0 - 9.12.2.

nicuveo mentioned this pull request Apr 3, 2025

Alex selects the wrong rule. #269

Closed

Bodigrim reviewed Apr 3, 2025

View reviewed changes

src/DFAMin.hs Outdated Show resolved Hide resolved

Bodigrim reviewed Apr 3, 2025

View reviewed changes

src/DFAMin.hs Outdated Show resolved Hide resolved

andreasabel added the pr: squash PR should be squashed upon merge label Apr 4, 2025

andreasabel self-assigned this Apr 4, 2025

andreasabel added this to the 3.5.3.0 milestone Apr 4, 2025

nicuveo force-pushed the issue269 branch from 5ab6ca2 to b7daf05 Compare April 4, 2025 21:49

nicuveo mentioned this pull request Apr 4, 2025

Fix transient CI failures #271

Merged

andreasabel approved these changes Apr 5, 2025

View reviewed changes

nicuveo added 13 commits April 5, 2025 11:31

add test that demonstrates the issue

5d1ff9d

remove incorrect extra-source-files entries

ad1dd62

rewrite DFAMin

4db93cd

add test for older versions of the code

bfba770

improve comments

aabfad2

fix call of restrictKeys for older GHC versions

529e2c6

Revert "remove incorrect extra-source-files entries"

e821227

This reverts commit e0126a8.

rename tests

d8b594b

add new tests to extra-source-files

ce8fb61

make use of fold

60e796e

revert formatter changes

ca3d85e

improve comments, fix typos

c3877e9

fix typos in tests

4e38c0a

andreasabel and others added 6 commits April 5, 2025 11:31

Optimize 'number' by removing already assigned start states

baf71bd

A start state cannot belong to several equivalence classes (they are disjoint), so we can remove already assigned start states from the list of start states we want to assign.

Optimize the construction of new states from old states

afe58f5

Since all old states in an equivalence class are equivalent, we only need the data from one of them to construct the new state.

Cosmetics: use OldSNum / NewSNum instead of Int; etc.

e090a9a

Suggestion: Use a plain filter instead of restrictKeys

e88bd3c

remove restrictKeys in favour of filter

e7090e2

use folds to avoid concat

240397c

nicuveo force-pushed the issue269 branch from 5ba5a09 to 240397c Compare April 5, 2025 10:31

add -XDeriveFunctor to emulated workflow

d2b6a9e

andreasabel approved these changes Apr 5, 2025

View reviewed changes

andreasabel merged commit aa6c57b into haskell:master Apr 5, 2025
19 checks passed

nicuveo deleted the issue269 branch April 5, 2025 11:52

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Apr 30, 2025

devel/alex: update to alex-3.5.3.0

fd9c291

## Changes in 3.5.3.0 * Fix critical bug in automaton minimizer ([PR #270](haskell/alex#270)), thanks Antoine Leblanc! * Tested with GHC 8.0 - 9.12.2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #269 and rewrite DFAMin #270

Fix #269 and rewrite DFAMin #270

nicuveo commented Apr 3, 2025 •

edited

Loading

Bodigrim commented Apr 3, 2025

andreasabel commented Apr 4, 2025 •

edited

Loading

Bodigrim commented Apr 4, 2025

nicuveo commented Apr 4, 2025

nicuveo commented Apr 4, 2025

nicuveo commented Apr 4, 2025

andreasabel commented Apr 5, 2025

andreasabel commented Apr 5, 2025

nicuveo commented Apr 5, 2025 •

edited

Loading

nicuveo commented Apr 5, 2025

nicuveo commented Apr 5, 2025

Fix #269 and rewrite DFAMin #270

Fix #269 and rewrite DFAMin #270

Conversation

nicuveo commented Apr 3, 2025 • edited Loading

Overview

Implementation details

Open questions

Bodigrim commented Apr 3, 2025

andreasabel commented Apr 4, 2025 • edited Loading

Bodigrim commented Apr 4, 2025

nicuveo commented Apr 4, 2025

nicuveo commented Apr 4, 2025

nicuveo commented Apr 4, 2025

andreasabel commented Apr 5, 2025

andreasabel commented Apr 5, 2025

nicuveo commented Apr 5, 2025 • edited Loading

nicuveo commented Apr 5, 2025

nicuveo commented Apr 5, 2025

nicuveo commented Apr 3, 2025 •

edited

Loading

andreasabel commented Apr 4, 2025 •

edited

Loading

nicuveo commented Apr 5, 2025 •

edited

Loading