Skip to content

Fix anchor bugs in pythonRegexToCore; add differential test suite#517

Merged
shigoel merged 36 commits intomainfrom
shilpi/regex-fix
Mar 9, 2026
Merged

Fix anchor bugs in pythonRegexToCore; add differential test suite#517
shigoel merged 36 commits intomainfrom
shilpi/regex-fix

Conversation

@shigoel
Copy link
Contributor

@shigoel shigoel commented Mar 5, 2026

Summary

  • Differential testing framework (DiffTestCore.lean, diff_test.py): pipes (regex, string, mode) triples through Strata's SMT backend and compares results against Python's re module; classifies outcomes as agree / bug / known_gap / investigate
  • Fixes in ReToCore.lean:
    • Remove partial
    • Fix complement translation (intersect with re.allchar() to match [^...] semantics)
    • Fix anchor-interaction bugs in all four concat cases (true,false / false,true / false,false) — anchors were incorrectly forwarded across potentially-empty siblings
    • Fix the same anchor bug in star | false => and loop(0,m) | false => branches
    • Refactor star and loop(0,m) to unify the alwaysConsume and anchor-split paths under a single condition
  • Test corpus (diff_test.py): Cases covering literals, character classes, quantifiers, alternation, all anchor/mode combinations, and unimplemented features; every entry annotated with its Python ground-truth answer; corpus organized into labeled sections
  • CI: add Run regex differential tests step to build_and_test_lean job

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

shigoel and others added 13 commits March 3, 2026 19:06
Adds DiffTestCore (a lake exe) that reads (regex, string, mode) triples
from stdin, converts each regex via pythonRegexToCore, constructs a
Core program asserting str.in.re, and runs the SMT verifier. Results
are printed as tab-separated lines for comparison.

Adds diff_test.py, a Python driver with a 93-case corpus covering
literals, anchors, quantifiers, character classes, alternation, groups,
anchor-in-group interactions, search/match/fullmatch modes, invalid
regexes, and unimplemented features (lookahead/lookbehind). The script
runs both Python's re module and DiffTestCore in one subprocess call,
classifies each pair as agree/bug/known_gap/investigate, and exits
non-zero if any bugs are found.

Also fixes ASTToCST → ASTtoCST import casing in ReToCoreTests.lean to
prevent macOS case-insensitive filesystem from corrupting the cached
.c.o.export symbol and breaking the DiffTestCore linker step.

Initial run: 85 agree, 4 known gaps (lookahead/lookbehind), 4 bugs
found (anchor semantics in match/search modes and [^A-Z]+ complement).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- DiffTestCore: add optional --log-dir <path> flag; when set, each
  generated .core.st program is written to <path>/<n>_<mode>.core.st
  with a header comment showing the original regex, string, and mode
- diff_test.py: wire through --log-dir to the DiffTestCore subprocess
- ASTtoCST: fix typo "Str.ToRegeEx" → "Str.ToRegEx"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs fixed in ReToCore.lean:

1. Negated character classes ([^...]) were encoded as re.comp(X), which
   is complement over all strings in SMT-LIB. This incorrectly matched
   multi-character strings (e.g. [^A-Z]+ matched "Hello" because "Hello"
   is not a single A-Z char). Fix: use re.inter(re.allchar(), re.comp(X))
   to restrict to single characters.

2. In match/search modes, ^ and $ anchors were being swallowed by the
   prepended/appended dotStar. The old approach wrapped the AST with .*
   then called toCore once with (atStart=true, atEnd=true); since dotStar
   can match empty, anchors saw atEnd/atStart=true and became emptyRegex,
   but dotStar could still consume extra content. Fix: compute toCore for
   each relevant (atStart, atEnd) combination and union the results —
   anchor-bearing variants where the flag is false yield re.none() and
   vanish from the union naturally.

Outcome: diff test now shows 89 agree, 0 bugs, 4 known gaps (lookahead/
lookbehind only).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…st.py

Extends the differential test corpus from 199 to 255 cases:
- 56 new known-gap tests covering non-greedy quantifiers (*?, +?, ??),
  shorthand classes (\d, \w, \s and their complements), word boundaries
  (\b, \B), absolute anchors (\A, \Z), non-capturing groups (?:...),
  named groups (?P<name>...)), inline flags (?i, ?m, ?s), and
  backreferences (\1).

Adds runtime 1-based case numbering ([#N]) to bug/gap/investigate
report lines, matching the filenames written by --log-dir so a reported
case can be located immediately as <N>_<mode>.core.st.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Patterns like `a$.*` (fullmatch/match/search) and `.*^a` (match) were
incorrectly reported as matching by Strata when Python says no match.

Root cause: in three of the four alwaysConsume-based concat dispatch
branches, the atStart/atEnd flags were passed naively when one side
could match the empty string, allowing anchors to fire at wrong positions.

Fix: for each affected branch, when the relevant anchor is present and the
potentially-empty side has non-anchor content, split into:
  Case A — empty side matches "", anchor fires at its correct position
  Case B — empty side matches non-empty, anchor must not fire

Branches fixed:
  | true,  false => $ in r1 fires beyond true end when r2 matches non-empty
  | false, true  => ^ in r2 fires after r1 consumed content
  | false, false => ^ in r2 fires after r1 consumed content (e.g. .*^a
                    parses as concat(concat(.*,^),a), hitting this branch)

Also adds regression test cases to diff_test.py and removes debug #eval
calls from ReToCoreTests.lean.

Result: 214 agree, 0 bugs, 60 known gaps (was 6 bugs before this fix).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Extends the diff_test.py corpus with 27 new cases targeting the three
concat branches fixed in the previous commit:

  | true, false =>  multi-char r1 (ab$.*), range r1 ([a-z]$.*),
                    grouped r1 ((ab)$.*), including search mode

  | false, true =>  a?(^b) — optional r1 with always-consuming r2
                    that contains ^; exercises the split-case fix where
                    r1 being non-empty must suppress ^ in r2

  | false, false => a?^b — optional r1 / anchor r2 (both non-consuming);
                    .*^a in search mode (^ only fires at position 0)

All 27 new cases agree between Python and Strata (241 agree total, 0 bugs).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
…ranch tests

The | false => branches of .star and .loop(0,m) used re.*(toCore r1 atStart
atEnd) and re.loop(r1b, 0, m) respectively, passing both anchor flags to the
inner regex for every iteration. When r1 contains ^ or $ and can also match
the empty string, this is wrong: anchors fire on all iterations instead of
only the first or last.

Example: (^a?)* fullmatch "aa" — translated to re.*(union("","a")), which
matches "aa", but Python says no match (^ fails at position 1 for the 2nd
iteration).

Fix: when r1.containsAnchorStart || r1.containsAnchorEnd, apply the same
first/middle/last split already used by the | true => branches:
  first  iteration: toCore r1 atStart false
  middle iterations: toCore r1 false false  (anchors become unmatchable)
  last   iteration: toCore r1 false atEnd

This limits the anchor to firing only on the first (for ^) or last (for $)
non-empty iteration. Middle iterations correctly become unmatchable, capping
the star at one useful iteration for anchored non-consuming patterns.

Also adds 42 new test cases exercising all alwaysConsume branches of .star
and .loop with anchor-containing inner regexes (| true => and | false =>)
as well as the n≥1 loop branch that recurses via .concat.

Result: 280 agree, 0 bugs, 60 known gaps.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Both the alwaysConsume=true branch and the alwaysConsume=false+anchors
branch used identical first/middle/last split code. Replace the
match/if nesting with a single condition:

  alwaysConsume r1 || r1.containsAnchorStart || r1.containsAnchorEnd

When true, use the split; otherwise use the simple re.*/re.loop form.
No behaviour change.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
- Reorganize CORPUS into clearer sections with a ground-truth answer
  comment on every entry; fix one incorrect comment for (a$)* in match mode
- Add 7 new test cases for a?^b? to cover the concat:ff:split branch
- Add regex differential test step to CI (build_and_test_lean job)
- Add stdin usage example to DiffTestCore.lean comment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shigoel shigoel marked this pull request as ready for review March 5, 2026 01:51
@shigoel shigoel requested a review from a team March 5, 2026 01:51
@shigoel shigoel enabled auto-merge March 5, 2026 01:53
Copy link
Contributor

@MikaelMayer MikaelMayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR delivers a comprehensive fix for anchor handling bugs in Python regex translation and adds a robust differential testing framework. The implementation is well-documented with clear explanations of the anchor semantics and case-splitting logic. The test corpus is thorough with 347 cases covering edge cases systematically. The typo fix, complement fix (intersecting with re.allchar()), and all anchor-splitting logic are sound. CI passes on all checks.

One improvement opportunity: The false,true and false,false concat cases have nearly identical anchor-splitting logic that could be extracted into a helper function to reduce duplication.

shigoel and others added 7 commits March 6, 2026 08:39
Co-authored-by: Joe Hendrix <joehx@amazon.com>
Co-authored-by: Joe Hendrix <joehx@amazon.com>
Co-authored-by: Joe Hendrix <joehx@amazon.com>
Co-authored-by: Joe Hendrix <joehx@amazon.com>
Co-authored-by: Joe Hendrix <joehx@amazon.com>
…elCase

When both sides of a concat may be empty and r1 contains $ and r2 has
non-anchor content, forwarding atEnd to r1 is wrong if r2 matches
non-empty. Add the symmetric split (mirroring the true,false case):
Case 1 (r2="") lets $ in r1 fire; Case 2 (r2 non-empty) blocks it.

Also renames snake_case locals (r1b_start, r1b_mid, etc.) to camelCase
(r1bStart, r1bMid, etc.) throughout RegexAST.toCore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the three hand-rolled tree walks (containsAnchorEnd,
containsAnchorStart, hasNonAnchorContent) with a private anyNode
fold. Each predicate now explicitly handles .complement to document
that its child is a character-set description that cannot contain
positional nodes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shigoel
Copy link
Contributor Author

shigoel commented Mar 6, 2026

FYI: I'm going to refactor the code in ReToCore since it's getting unwieldy. Too many case splits...

shigoel and others added 4 commits March 6, 2026 09:28
If the subprocess fails totally or partially, surface it as a hard
error (exit 2) rather than silently classifying missing results as
'investigate'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove "BUG: Strata says match" annotations from a?(^b) search tests
  (the bug is now fixed)
- Add ^?(^b) search tests: exercises the false,true simple path where
  r1 has no non-anchor content; the atStart fix matters here too
- Add a?$b? match and search tests: exercises the false,false $-split
  under all atStart/atEnd combinations from pythonRegexToCore
- Add (a?$)b? fullmatch tests: exercises the $-split at the outer
  concat level (vs. a?$b? which fires it at the inner level)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tier 1: Add private abbrev builders (mkReFromStr, mkReRange, mkReAllChar,
mkReComp, mkReUnion, mkReConcat, mkReInter, mkReStar, mkReLoop) to replace
repetitive mkApp () (.op () ...Func.name none) [...] patterns throughout
toCore and pythonRegexToCore.

Tier 2: Extract three structural helpers:
- toCoreStarBody: shared body for .star and .loop {0,m}, parameterized by
  `simple` and `center` wrappers (eliminates ~20-line duplication)
- endSplit: $-in-r1 concat split pattern
- startSplit: ^-in-r2 concat split pattern

toCore shrinks from ~115 to ~50 lines with no semantic change.
All 497 differential tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@FedericoAureliano
Copy link

This is great! I really like the encoding.

One nitpick: desugaring r+ to rr* in RegexAST.toCore causes an exponential blowup when you have nested + constructors. Do we really need to do that? The core language includes a + operator, why not just encode it directly?

One simple test case to add that shows this off would be #eval Std.format $ pythonRegexToCore "((((((a+)+)+)+)+)+)+" .fullmatch.

This is a nitpick because it is probably unlikely to come up in practice very frequently.

@shigoel
Copy link
Contributor Author

shigoel commented Mar 6, 2026

This is great! I really like the encoding.

One nitpick: desugaring r+ to rr* in RegexAST.toCore causes an exponential blowup when you have nested + constructors. Do we really need to do that? The core language includes a + operator, why not just encode it directly?

One simple test case to add that shows this off would be #eval Std.format $ pythonRegexToCore "((((((a+)+)+)+)+)+)+" .fullmatch.

This is a nitpick because it is probably unlikely to come up in practice very frequently.

The reason we do this is anchors, yet again. But you're right: I can add a fast path that uses re.+ when the regex doesn't have an anchor.

shigoel and others added 3 commits March 6, 2026 14:22
- Inline .optional and loop{0,1} (structural in r1)
- Extract toCoreConcat abbrev (shared by .concat and .plus)
- Extract toCoreLoop def (recurses on n, structural in r1)
- Remove partial: all recursive calls are now structural
- Add anchor-free fast paths: .plus emits re.+ directly when r1 has
  no anchors; loop{n+1,m} emits re.loop directly likewise

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace 20 near-identical deep-nesting cases for ((((((a+)+)+)+)+)+)+
with 4 representative depth cases plus 28 cases covering structurally
distinct patterns: (a+)*, (a*)+ , (a+b+)+, and (a+|b+)+.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shigoel shigoel requested a review from MikaelMayer March 6, 2026 22:27
Copy link
Contributor

@joehendrix joehendrix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the issues found previously.

@shigoel shigoel added this pull request to the merge queue Mar 9, 2026
Merged via the queue into main with commit e080354 Mar 9, 2026
15 checks passed
@shigoel shigoel deleted the shilpi/regex-fix branch March 9, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants