Fix anchor bugs in pythonRegexToCore; add differential test suite#517
Fix anchor bugs in pythonRegexToCore; add differential test suite#517
Conversation
Adds DiffTestCore (a lake exe) that reads (regex, string, mode) triples from stdin, converts each regex via pythonRegexToCore, constructs a Core program asserting str.in.re, and runs the SMT verifier. Results are printed as tab-separated lines for comparison. Adds diff_test.py, a Python driver with a 93-case corpus covering literals, anchors, quantifiers, character classes, alternation, groups, anchor-in-group interactions, search/match/fullmatch modes, invalid regexes, and unimplemented features (lookahead/lookbehind). The script runs both Python's re module and DiffTestCore in one subprocess call, classifies each pair as agree/bug/known_gap/investigate, and exits non-zero if any bugs are found. Also fixes ASTToCST → ASTtoCST import casing in ReToCoreTests.lean to prevent macOS case-insensitive filesystem from corrupting the cached .c.o.export symbol and breaking the DiffTestCore linker step. Initial run: 85 agree, 4 known gaps (lookahead/lookbehind), 4 bugs found (anchor semantics in match/search modes and [^A-Z]+ complement). By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- DiffTestCore: add optional --log-dir <path> flag; when set, each generated .core.st program is written to <path>/<n>_<mode>.core.st with a header comment showing the original regex, string, and mode - diff_test.py: wire through --log-dir to the DiffTestCore subprocess - ASTtoCST: fix typo "Str.ToRegeEx" → "Str.ToRegEx" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs fixed in ReToCore.lean: 1. Negated character classes ([^...]) were encoded as re.comp(X), which is complement over all strings in SMT-LIB. This incorrectly matched multi-character strings (e.g. [^A-Z]+ matched "Hello" because "Hello" is not a single A-Z char). Fix: use re.inter(re.allchar(), re.comp(X)) to restrict to single characters. 2. In match/search modes, ^ and $ anchors were being swallowed by the prepended/appended dotStar. The old approach wrapped the AST with .* then called toCore once with (atStart=true, atEnd=true); since dotStar can match empty, anchors saw atEnd/atStart=true and became emptyRegex, but dotStar could still consume extra content. Fix: compute toCore for each relevant (atStart, atEnd) combination and union the results — anchor-bearing variants where the flag is false yield re.none() and vanish from the union naturally. Outcome: diff test now shows 89 agree, 0 bugs, 4 known gaps (lookahead/ lookbehind only). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…st.py Extends the differential test corpus from 199 to 255 cases: - 56 new known-gap tests covering non-greedy quantifiers (*?, +?, ??), shorthand classes (\d, \w, \s and their complements), word boundaries (\b, \B), absolute anchors (\A, \Z), non-capturing groups (?:...), named groups (?P<name>...)), inline flags (?i, ?m, ?s), and backreferences (\1). Adds runtime 1-based case numbering ([#N]) to bug/gap/investigate report lines, matching the filenames written by --log-dir so a reported case can be located immediately as <N>_<mode>.core.st. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Patterns like `a$.*` (fullmatch/match/search) and `.*^a` (match) were
incorrectly reported as matching by Strata when Python says no match.
Root cause: in three of the four alwaysConsume-based concat dispatch
branches, the atStart/atEnd flags were passed naively when one side
could match the empty string, allowing anchors to fire at wrong positions.
Fix: for each affected branch, when the relevant anchor is present and the
potentially-empty side has non-anchor content, split into:
Case A — empty side matches "", anchor fires at its correct position
Case B — empty side matches non-empty, anchor must not fire
Branches fixed:
| true, false => $ in r1 fires beyond true end when r2 matches non-empty
| false, true => ^ in r2 fires after r1 consumed content
| false, false => ^ in r2 fires after r1 consumed content (e.g. .*^a
parses as concat(concat(.*,^),a), hitting this branch)
Also adds regression test cases to diff_test.py and removes debug #eval
calls from ReToCoreTests.lean.
Result: 214 agree, 0 bugs, 60 known gaps (was 6 bugs before this fix).
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Extends the diff_test.py corpus with 27 new cases targeting the three
concat branches fixed in the previous commit:
| true, false => multi-char r1 (ab$.*), range r1 ([a-z]$.*),
grouped r1 ((ab)$.*), including search mode
| false, true => a?(^b) — optional r1 with always-consuming r2
that contains ^; exercises the split-case fix where
r1 being non-empty must suppress ^ in r2
| false, false => a?^b — optional r1 / anchor r2 (both non-consuming);
.*^a in search mode (^ only fires at position 0)
All 27 new cases agree between Python and Strata (241 agree total, 0 bugs).
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
…ranch tests
The | false => branches of .star and .loop(0,m) used re.*(toCore r1 atStart
atEnd) and re.loop(r1b, 0, m) respectively, passing both anchor flags to the
inner regex for every iteration. When r1 contains ^ or $ and can also match
the empty string, this is wrong: anchors fire on all iterations instead of
only the first or last.
Example: (^a?)* fullmatch "aa" — translated to re.*(union("","a")), which
matches "aa", but Python says no match (^ fails at position 1 for the 2nd
iteration).
Fix: when r1.containsAnchorStart || r1.containsAnchorEnd, apply the same
first/middle/last split already used by the | true => branches:
first iteration: toCore r1 atStart false
middle iterations: toCore r1 false false (anchors become unmatchable)
last iteration: toCore r1 false atEnd
This limits the anchor to firing only on the first (for ^) or last (for $)
non-empty iteration. Middle iterations correctly become unmatchable, capping
the star at one useful iteration for anchored non-consuming patterns.
Also adds 42 new test cases exercising all alwaysConsume branches of .star
and .loop with anchor-containing inner regexes (| true => and | false =>)
as well as the n≥1 loop branch that recurses via .concat.
Result: 280 agree, 0 bugs, 60 known gaps.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Both the alwaysConsume=true branch and the alwaysConsume=false+anchors branch used identical first/middle/last split code. Replace the match/if nesting with a single condition: alwaysConsume r1 || r1.containsAnchorStart || r1.containsAnchorEnd When true, use the split; otherwise use the simple re.*/re.loop form. No behaviour change. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
- Reorganize CORPUS into clearer sections with a ground-truth answer comment on every entry; fix one incorrect comment for (a$)* in match mode - Add 7 new test cases for a?^b? to cover the concat:ff:split branch - Add regex differential test step to CI (build_and_test_lean job) - Add stdin usage example to DiffTestCore.lean comment By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MikaelMayer
left a comment
There was a problem hiding this comment.
This PR delivers a comprehensive fix for anchor handling bugs in Python regex translation and adds a robust differential testing framework. The implementation is well-documented with clear explanations of the anchor semantics and case-splitting logic. The test corpus is thorough with 347 cases covering edge cases systematically. The typo fix, complement fix (intersecting with re.allchar()), and all anchor-splitting logic are sound. CI passes on all checks.
One improvement opportunity: The false,true and false,false concat cases have nearly identical anchor-splitting logic that could be extracted into a helper function to reduce duplication.
Co-authored-by: Joe Hendrix <joehx@amazon.com>
Co-authored-by: Joe Hendrix <joehx@amazon.com>
Co-authored-by: Joe Hendrix <joehx@amazon.com>
Co-authored-by: Joe Hendrix <joehx@amazon.com>
Co-authored-by: Joe Hendrix <joehx@amazon.com>
…elCase When both sides of a concat may be empty and r1 contains $ and r2 has non-anchor content, forwarding atEnd to r1 is wrong if r2 matches non-empty. Add the symmetric split (mirroring the true,false case): Case 1 (r2="") lets $ in r1 fire; Case 2 (r2 non-empty) blocks it. Also renames snake_case locals (r1b_start, r1b_mid, etc.) to camelCase (r1bStart, r1bMid, etc.) throughout RegexAST.toCore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the three hand-rolled tree walks (containsAnchorEnd, containsAnchorStart, hasNonAnchorContent) with a private anyNode fold. Each predicate now explicitly handles .complement to document that its child is a character-set description that cannot contain positional nodes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
FYI: I'm going to refactor the code in ReToCore since it's getting unwieldy. Too many case splits... |
If the subprocess fails totally or partially, surface it as a hard error (exit 2) rather than silently classifying missing results as 'investigate'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove "BUG: Strata says match" annotations from a?(^b) search tests (the bug is now fixed) - Add ^?(^b) search tests: exercises the false,true simple path where r1 has no non-anchor content; the atStart fix matters here too - Add a?$b? match and search tests: exercises the false,false $-split under all atStart/atEnd combinations from pythonRegexToCore - Add (a?$)b? fullmatch tests: exercises the $-split at the outer concat level (vs. a?$b? which fires it at the inner level) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tier 1: Add private abbrev builders (mkReFromStr, mkReRange, mkReAllChar,
mkReComp, mkReUnion, mkReConcat, mkReInter, mkReStar, mkReLoop) to replace
repetitive mkApp () (.op () ...Func.name none) [...] patterns throughout
toCore and pythonRegexToCore.
Tier 2: Extract three structural helpers:
- toCoreStarBody: shared body for .star and .loop {0,m}, parameterized by
`simple` and `center` wrappers (eliminates ~20-line duplication)
- endSplit: $-in-r1 concat split pattern
- startSplit: ^-in-r2 concat split pattern
toCore shrinks from ~115 to ~50 lines with no semantic change.
All 497 differential tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
This is great! I really like the encoding. One nitpick: desugaring One simple test case to add that shows this off would be This is a nitpick because it is probably unlikely to come up in practice very frequently. |
The reason we do this is anchors, yet again. But you're right: I can add a fast path that uses |
- Inline .optional and loop{0,1} (structural in r1)
- Extract toCoreConcat abbrev (shared by .concat and .plus)
- Extract toCoreLoop def (recurses on n, structural in r1)
- Remove partial: all recursive calls are now structural
- Add anchor-free fast paths: .plus emits re.+ directly when r1 has
no anchors; loop{n+1,m} emits re.loop directly likewise
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace 20 near-identical deep-nesting cases for ((((((a+)+)+)+)+)+)+ with 4 representative depth cases plus 28 cases covering structurally distinct patterns: (a+)*, (a*)+ , (a+b+)+, and (a+|b+)+. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
joehendrix
left a comment
There was a problem hiding this comment.
Thanks for fixing the issues found previously.
Summary
DiffTestCore.lean,diff_test.py): pipes(regex, string, mode)triples through Strata's SMT backend and compares results against Python'sremodule; classifies outcomes as agree / bug / known_gap / investigateReToCore.lean:partialre.allchar()to match[^...]semantics)concatcases (true,false/false,true/false,false) — anchors were incorrectly forwarded across potentially-empty siblingsstar | false =>andloop(0,m) | false =>branchesstarandloop(0,m)to unify thealwaysConsumeand anchor-split paths under a single conditiondiff_test.py): Cases covering literals, character classes, quantifiers, alternation, all anchor/mode combinations, and unimplemented features; every entry annotated with its Python ground-truth answer; corpus organized into labeled sectionsRun regex differential testsstep tobuild_and_test_leanjobBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.