Fix anchor bugs in pythonRegexToCore; add differential test suite by shigoel · Pull Request #517 · strata-org/Strata

shigoel · 2026-03-05T01:50:08Z

Summary

Differential testing framework (DiffTestCore.lean, diff_test.py): pipes (regex, string, mode) triples through Strata's SMT backend and compares results against Python's re module; classifies outcomes as agree / bug / known_gap / investigate
Fixes in ReToCore.lean:
- Remove partial
- Fix complement translation (intersect with re.allchar() to match [^...] semantics)
- Fix anchor-interaction bugs in all four concat cases (true,false / false,true / false,false) — anchors were incorrectly forwarded across potentially-empty siblings
- Fix the same anchor bug in star | false => and loop(0,m) | false => branches
- Refactor star and loop(0,m) to unify the alwaysConsume and anchor-split paths under a single condition
Test corpus (diff_test.py): Cases covering literals, character classes, quantifiers, alternation, all anchor/mode combinations, and unimplemented features; every entry annotated with its Python ground-truth answer; corpus organized into labeled sections
CI: add Run regex differential tests step to build_and_test_lean job

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Adds DiffTestCore (a lake exe) that reads (regex, string, mode) triples from stdin, converts each regex via pythonRegexToCore, constructs a Core program asserting str.in.re, and runs the SMT verifier. Results are printed as tab-separated lines for comparison. Adds diff_test.py, a Python driver with a 93-case corpus covering literals, anchors, quantifiers, character classes, alternation, groups, anchor-in-group interactions, search/match/fullmatch modes, invalid regexes, and unimplemented features (lookahead/lookbehind). The script runs both Python's re module and DiffTestCore in one subprocess call, classifies each pair as agree/bug/known_gap/investigate, and exits non-zero if any bugs are found. Also fixes ASTToCST → ASTtoCST import casing in ReToCoreTests.lean to prevent macOS case-insensitive filesystem from corrupting the cached .c.o.export symbol and breaking the DiffTestCore linker step. Initial run: 85 agree, 4 known gaps (lookahead/lookbehind), 4 bugs found (anchor semantics in match/search modes and [^A-Z]+ complement). By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- DiffTestCore: add optional --log-dir <path> flag; when set, each generated .core.st program is written to <path>/<n>_<mode>.core.st with a header comment showing the original regex, string, and mode - diff_test.py: wire through --log-dir to the DiffTestCore subprocess - ASTtoCST: fix typo "Str.ToRegeEx" → "Str.ToRegEx" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two bugs fixed in ReToCore.lean: 1. Negated character classes ([^...]) were encoded as re.comp(X), which is complement over all strings in SMT-LIB. This incorrectly matched multi-character strings (e.g. [^A-Z]+ matched "Hello" because "Hello" is not a single A-Z char). Fix: use re.inter(re.allchar(), re.comp(X)) to restrict to single characters. 2. In match/search modes, ^ and $ anchors were being swallowed by the prepended/appended dotStar. The old approach wrapped the AST with .* then called toCore once with (atStart=true, atEnd=true); since dotStar can match empty, anchors saw atEnd/atStart=true and became emptyRegex, but dotStar could still consume extra content. Fix: compute toCore for each relevant (atStart, atEnd) combination and union the results — anchor-bearing variants where the flag is false yield re.none() and vanish from the union naturally. Outcome: diff test now shows 89 agree, 0 bugs, 4 known gaps (lookahead/ lookbehind only). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…st.py Extends the differential test corpus from 199 to 255 cases: - 56 new known-gap tests covering non-greedy quantifiers (*?, +?, ??), shorthand classes (\d, \w, \s and their complements), word boundaries (\b, \B), absolute anchors (\A, \Z), non-capturing groups (?:...), named groups (?P<name>...)), inline flags (?i, ?m, ?s), and backreferences (\1). Adds runtime 1-based case numbering ([#N]) to bug/gap/investigate report lines, matching the filenames written by --log-dir so a reported case can be located immediately as <N>_<mode>.core.st. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Patterns like `a$.*` (fullmatch/match/search) and `.*^a` (match) were incorrectly reported as matching by Strata when Python says no match. Root cause: in three of the four alwaysConsume-based concat dispatch branches, the atStart/atEnd flags were passed naively when one side could match the empty string, allowing anchors to fire at wrong positions. Fix: for each affected branch, when the relevant anchor is present and the potentially-empty side has non-anchor content, split into: Case A — empty side matches "", anchor fires at its correct position Case B — empty side matches non-empty, anchor must not fire Branches fixed: | true, false => $ in r1 fires beyond true end when r2 matches non-empty | false, true => ^ in r2 fires after r1 consumed content | false, false => ^ in r2 fires after r1 consumed content (e.g. .*^a parses as concat(concat(.*,^),a), hitting this branch) Also adds regression test cases to diff_test.py and removes debug #eval calls from ReToCoreTests.lean. Result: 214 agree, 0 bugs, 60 known gaps (was 6 bugs before this fix). By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Extends the diff_test.py corpus with 27 new cases targeting the three concat branches fixed in the previous commit: | true, false => multi-char r1 (ab$.*), range r1 ([a-z]$.*), grouped r1 ((ab)$.*), including search mode | false, true => a?(^b) — optional r1 with always-consuming r2 that contains ^; exercises the split-case fix where r1 being non-empty must suppress ^ in r2 | false, false => a?^b — optional r1 / anchor r2 (both non-consuming); .*^a in search mode (^ only fires at position 0) All 27 new cases agree between Python and Strata (241 agree total, 0 bugs). By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ranch tests The | false => branches of .star and .loop(0,m) used re.*(toCore r1 atStart atEnd) and re.loop(r1b, 0, m) respectively, passing both anchor flags to the inner regex for every iteration. When r1 contains ^ or $ and can also match the empty string, this is wrong: anchors fire on all iterations instead of only the first or last. Example: (^a?)* fullmatch "aa" — translated to re.*(union("","a")), which matches "aa", but Python says no match (^ fails at position 1 for the 2nd iteration). Fix: when r1.containsAnchorStart || r1.containsAnchorEnd, apply the same first/middle/last split already used by the | true => branches: first iteration: toCore r1 atStart false middle iterations: toCore r1 false false (anchors become unmatchable) last iteration: toCore r1 false atEnd This limits the anchor to firing only on the first (for ^) or last (for $) non-empty iteration. Middle iterations correctly become unmatchable, capping the star at one useful iteration for anchored non-consuming patterns. Also adds 42 new test cases exercising all alwaysConsume branches of .star and .loop with anchor-containing inner regexes (| true => and | false =>) as well as the n≥1 loop branch that recurses via .concat. Result: 280 agree, 0 bugs, 60 known gaps. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Both the alwaysConsume=true branch and the alwaysConsume=false+anchors branch used identical first/middle/last split code. Replace the match/if nesting with a single condition: alwaysConsume r1 || r1.containsAnchorStart || r1.containsAnchorEnd When true, use the split; otherwise use the simple re.*/re.loop form. No behaviour change. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Reorganize CORPUS into clearer sections with a ground-truth answer comment on every entry; fix one incorrect comment for (a$)* in match mode - Add 7 new test cases for a?^b? to cover the concat:ff:split branch - Add regex differential test step to CI (build_and_test_lean job) - Add stdin usage example to DiffTestCore.lean comment By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

MikaelMayer

This PR delivers a comprehensive fix for anchor handling bugs in Python regex translation and adds a robust differential testing framework. The implementation is well-documented with clear explanations of the anchor semantics and case-splitting logic. The test corpus is thorough with 347 cases covering edge cases systematically. The typo fix, complement fix (intersecting with re.allchar()), and all anchor-splitting logic are sound. CI passes on all checks.

One improvement opportunity: The false,true and false,false concat cases have nearly identical anchor-splitting logic that could be extracted into a helper function to reduce duplication.

Strata/Languages/Python/Regex/ReToCore.lean

StrataTest/Languages/Python/Regex/diff_test.py

Strata/Languages/Python/Regex/ReToCore.lean

StrataTest/Languages/Python/Regex/diff_test.py

Strata/Languages/Python/Regex/ReToCore.lean

StrataTest/Languages/Python/Regex/diff_test.py

Co-authored-by: Joe Hendrix <joehx@amazon.com>

…elCase When both sides of a concat may be empty and r1 contains $ and r2 has non-anchor content, forwarding atEnd to r1 is wrong if r2 matches non-empty. Add the symmetric split (mirroring the true,false case): Case 1 (r2="") lets $ in r1 fire; Case 2 (r2 non-empty) blocks it. Also renames snake_case locals (r1b_start, r1b_mid, etc.) to camelCase (r1bStart, r1bMid, etc.) throughout RegexAST.toCore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the three hand-rolled tree walks (containsAnchorEnd, containsAnchorStart, hasNonAnchorContent) with a private anyNode fold. Each predicate now explicitly handles .complement to document that its child is a character-set description that cannot contain positional nodes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

shigoel · 2026-03-06T15:25:22Z

FYI: I'm going to refactor the code in ReToCore since it's getting unwieldy. Too many case splits...

If the subprocess fails totally or partially, surface it as a hard error (exit 2) rather than silently classifying missing results as 'investigate'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove "BUG: Strata says match" annotations from a?(^b) search tests (the bug is now fixed) - Add ^?(^b) search tests: exercises the false,true simple path where r1 has no non-anchor content; the atStart fix matters here too - Add a?$b? match and search tests: exercises the false,false $-split under all atStart/atEnd combinations from pythonRegexToCore - Add (a?$)b? fullmatch tests: exercises the $-split at the outer concat level (vs. a?$b? which fires it at the inner level) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tier 1: Add private abbrev builders (mkReFromStr, mkReRange, mkReAllChar, mkReComp, mkReUnion, mkReConcat, mkReInter, mkReStar, mkReLoop) to replace repetitive mkApp () (.op () ...Func.name none) [...] patterns throughout toCore and pythonRegexToCore. Tier 2: Extract three structural helpers: - toCoreStarBody: shared body for .star and .loop {0,m}, parameterized by `simple` and `center` wrappers (eliminates ~20-line duplication) - endSplit: $-in-r1 concat split pattern - startSplit: ^-in-r2 concat split pattern toCore shrinks from ~115 to ~50 lines with no semantic change. All 497 differential tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

FedericoAureliano · 2026-03-06T19:40:38Z

This is great! I really like the encoding.

One nitpick: desugaring r+ to rr* in RegexAST.toCore causes an exponential blowup when you have nested + constructors. Do we really need to do that? The core language includes a + operator, why not just encode it directly?

One simple test case to add that shows this off would be #eval Std.format $ pythonRegexToCore "((((((a+)+)+)+)+)+)+" .fullmatch.

This is a nitpick because it is probably unlikely to come up in practice very frequently.

shigoel · 2026-03-06T20:20:25Z

This is great! I really like the encoding.

One nitpick: desugaring r+ to rr* in RegexAST.toCore causes an exponential blowup when you have nested + constructors. Do we really need to do that? The core language includes a + operator, why not just encode it directly?

One simple test case to add that shows this off would be #eval Std.format $ pythonRegexToCore "((((((a+)+)+)+)+)+)+" .fullmatch.

This is a nitpick because it is probably unlikely to come up in practice very frequently.

The reason we do this is anchors, yet again. But you're right: I can add a fast path that uses re.+ when the regex doesn't have an anchor.

- Inline .optional and loop{0,1} (structural in r1) - Extract toCoreConcat abbrev (shared by .concat and .plus) - Extract toCoreLoop def (recurses on n, structural in r1) - Remove partial: all recursive calls are now structural - Add anchor-free fast paths: .plus emits re.+ directly when r1 has no anchors; loop{n+1,m} emits re.loop directly likewise Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace 20 near-identical deep-nesting cases for ((((((a+)+)+)+)+)+)+ with 4 representative depth cases plus 28 cases covering structurally distinct patterns: (a+)*, (a*)+ , (a+b+)+, and (a+|b+)+. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joehendrix

Thanks for fixing the issues found previously.

shigoel and others added 13 commits March 3, 2026 19:06

comment fixes

d700b8d

more minor comment fixes

fb4f027

Add more tests

9a67364

Merge remote-tracking branch 'origin/main' into shilpi/regex-fix

0e2a1cf

shigoel marked this pull request as ready for review March 5, 2026 01:51

shigoel requested a review from a team March 5, 2026 01:51

shigoel requested review from fabiomadge and keyboardDrummer as code owners March 5, 2026 01:51

shigoel enabled auto-merge March 5, 2026 01:53

shigoel requested a review from FedericoAureliano March 5, 2026 02:47

shigoel added 2 commits March 5, 2026 11:04

Merge branch 'main' into shilpi/regex-fix

1f02798

Merge branch 'main' into shilpi/regex-fix

557e2e4

MikaelMayer reviewed Mar 5, 2026

View reviewed changes

Strata/Languages/Python/Regex/ReToCore.lean Outdated Show resolved Hide resolved

shigoel added 3 commits March 5, 2026 14:58

Merge branch 'main' into shilpi/regex-fix

3359e50

Merge branch 'main' into shilpi/regex-fix

7e1ab62

Merge branch 'main' into shilpi/regex-fix

8698b8c