Skip to content

Comments

Fix BaseChunkedParser consistency and error handling (Closes #926)#929

Merged
JSv4 merged 3 commits intomainfrom
claude/resolve-issue-926-7rmxg
Feb 22, 2026
Merged

Fix BaseChunkedParser consistency and error handling (Closes #926)#929
JSv4 merged 3 commits intomainfrom
claude/resolve-issue-926-7rmxg

Conversation

@JSv4
Copy link
Collaborator

@JSv4 JSv4 commented Feb 22, 2026

Summary

This PR fixes three robustness and consistency issues in BaseChunkedParser:

  1. Config validation errors are now properly wrappedValueError exceptions from invalid max_pages_per_chunk or min_pages_for_chunking are caught and re-raised as DocumentParsingError(is_transient=False) for consistent error handling.

  2. Single-chunk documents now receive consistent chunk-prefixed IDs – Previously, documents below the chunking threshold were returned directly without passing through _reassemble_chunk_results, resulting in unprefixed annotation IDs. Now all results consistently receive c0_ prefixed IDs, ensuring downstream consumers see a uniform format.

  3. Improved test coverage and reliability – Added tests for the backoff cap behavior and fixed a potential race condition in concurrent chunk tests.

Key Changes

  • opencontractserver/pipeline/base/chunked_parser.py:

    • Wrapped calculate_page_chunks() call in try-except to catch ValueError and re-raise as DocumentParsingError(is_transient=False)
    • Route single-chunk results through _reassemble_chunk_results([result], [0]) to ensure consistent c0_ prefixed IDs
    • Updated class docstring to document the uniform chunk-prefixed ID format
  • opencontractserver/tests/test_chunked_parser.py:

    • Updated test_small_doc_no_chunking docstring and added assertion verifying c0_ prefixed IDs
    • Added test_retry_backoff_capped_at_max to verify backoff values are capped at MAX_CHUNK_RETRY_BACKOFF_SECONDS (30s)
    • Added test_invalid_max_pages_per_chunk_raises_document_parsing_error to verify config validation
    • Added test_invalid_min_pages_for_chunking_raises_document_parsing_error to verify config validation
    • Fixed potential race condition in concurrent chunk test by adding slow_chunks_started.wait(timeout=2) before assertion

Implementation Details

The fix ensures that whether a document is chunked or not, all annotation and relationship IDs follow the same c{chunk_index}_ prefix convention. This is achieved by routing single-chunk results through the existing _reassemble_chunk_results function, which was previously only used for multi-chunk documents.

Config validation errors are now consistently handled as non-transient DocumentParsingError exceptions, allowing callers to distinguish between transient parsing failures (which may be retried) and configuration issues (which require intervention).

https://claude.ai/code/session_01RsC25ZupJSYVdY2gmBNpQx

- Wrap calculate_page_chunks ValueError as DocumentParsingError so
  callers catching DocumentParsingError don't miss config errors
- Route single-chunk documents through _reassemble_chunk_results for
  consistent c0_-prefixed annotation/relationship IDs
- Add test exercising MAX_CHUNK_RETRY_BACKOFF_SECONDS cap (attempt 4+
  where 5*2^3=40 > 30)
- Fix theoretical race in concurrent cancellation test by adding
  slow_chunks_started.wait(timeout=2) before assertion
- Add tests for invalid max_pages_per_chunk and min_pages_for_chunking
  config propagating as DocumentParsingError

Closes #926

https://claude.ai/code/session_01RsC25ZupJSYVdY2gmBNpQx
@claude
Copy link

claude bot commented Feb 22, 2026

PR Review: Fix BaseChunkedParser consistency and error handling

Overall this is a solid, well-targeted fix with good test coverage. The changes are small, focused, and properly documented. A few things worth discussing before merge:


Issues

1. Missing document context in ValueError wrapping (chunked_parser.py:194)

The new try/except around calculate_page_chunks drops the doc_id from the error message:

# New code (line 194)
raise DocumentParsingError(str(e), is_transient=False)

# All other error wraps in this same file include doc_id for debuggability:
raise DocumentParsingError(
    f"Cannot determine page count for document {doc_id}: {e}",
    is_transient=False,
)

Suggest:

raise DocumentParsingError(f"Config error for document {doc_id}: {e}", is_transient=False)

2. Potential breaking change: c0_ prefix on single-chunk annotation IDs

The PR routes single-chunk results through _reassemble_chunk_results([result], [0]), which prefixes all annotation and relationship IDs with c0_. This is the correct behavior for uniformity going forward, but it is a behavior change for any previously-parsed small documents.

If annotation IDs generated by a BaseChunkedParser subclass are persisted (e.g. as part of a document's OpenContractDocExport snapshot, embedded in relationship source/target IDs, or used as lookup keys), re-parsing the same document would now produce different IDs. The CHANGELOG notes this as a fix, but it may warrant an explicit callout as a breaking change for anyone who has stored parser output from small documents.

Questions to consider:

  • Are chunk-parser results ever diffed against previously-stored annotation IDs?
  • Do any callers (e.g. save_parsed_data) match on the annotation ID strings produced here?

If those paths exist, a migration note or version bump may be appropriate.


Minor nits

3. test_retry_backoff_capped_at_max relies on implicit DEFAULT_MIN_PAGES_FOR_CHUNKING

The test uses make_test_pdf(10) and leaves min_pages_for_chunking at its default, banking on the 10-page PDF being below the chunking threshold so only one chunk is dispatched. Adding a brief comment (or setting the threshold explicitly like other tests do) would make the intent self-documenting:

parser = PersistentRetryParser()
parser.chunk_retry_limit = 4
parser.min_pages_for_chunking = 75  # keep 10-page PDF as single chunk

4. test_concurrent_failure_cancels_remaining race window is still theoretical

The slow_chunks_started.wait(timeout=2) fix is correct and improves reliability. One remaining edge: if chunk 0 fails before ANY other chunk thread even starts executing (possible on a single-core runner at high load), slow_chunks_started will never be set and the assertion will fail after a 2s wait. This is the same fundamental constraint the original test had — it's asserting a race outcome. A comment acknowledging this or restructuring the assertion to use assertFalse would make the intent clearer (i.e. "verify that threads were dispatched, not that they ran"). Not blocking.


Positives

  • The ValueError → DocumentParsingError(is_transient=False) wrapping is the right pattern and consistent with how other config errors are handled in this file.
  • Routing single-chunk results through _reassemble_chunk_results is a clean, zero-duplication approach — reuses the existing function rather than duplicating the prefix logic.
  • The backoff cap test correctly pins [5, 10, 20, 30] and the math is right (5*2^3 = 40, capped at 30).
  • CHANGELOG entry is clear and well-structured.
  • Test isolation is good — each new test class is self-contained with a fresh ConcreteChunkedParser instance.

The breaking change question (point 2) is the main thing I'd want confirmed before merging. If annotation IDs from single-chunk parses aren't stored/compared downstream, this is ready to go.

@codecov
Copy link

codecov bot commented Feb 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@claude
Copy link

claude bot commented Feb 22, 2026

PR Review: Fix BaseChunkedParser consistency and error handling

Overall this is a clean, well-reasoned PR. The three issues it addresses are real and the fixes are correct. A few observations:


Correctness

Single-chunk ID prefixing (core change): The fix is sound. Routing single-chunk results through _reassemble_chunk_results([result], [0]) applies c0_ prefixes with zero page offset, which is exactly right. The import_annotations utility in importing.py uses id fields only as local-to-import keys for parent-child mapping, so prefixed or unprefixed IDs both work as long as they're internally consistent.

ValueError wrapping: Correct. calculate_page_chunks validates max_pages_per_chunk and min_pages_for_chunking before inspecting page count, so the new try-except fires regardless of document size.


Issues

1. Missing exception chain in ValueError wrapping (chunked_parser.py:194)

raise DocumentParsingError(str(e), is_transient=False)

This should be raise DocumentParsingError(str(e), is_transient=False) from e to preserve the traceback chain. Without from e, the original ValueError context is silently dropped from the exception chain, making debugging harder in production. (Line 176 in the same method has the same gap — both could be fixed together.)

2. Missing doc_id context in the new error message (chunked_parser.py:194)

Every other DocumentParsingError raised in _parse_document_impl includes the document ID for log correlation (e.g. f"Cannot determine page count for document {doc_id}: {e}"). The new raise just passes str(e) through. Suggest: f"Invalid chunking config for document {doc_id}: {e}".

3. No test that _post_reassemble_hook is still called for single-chunk documents

The existing test_post_reassemble_hook_called only exercises the multi-chunk path. After this PR, the single-chunk path also routes through _reassemble_chunk_results before the hook, so the hook still fires — but that behaviour is untested. A small test analogous to test_post_reassemble_hook_called but with a small doc and min_pages_for_chunking set above the page count would close this gap.


Minor

Potential breaking change for subclass overrides of _post_reassemble_hook

Before this PR, the hook received raw (un-prefixed) annotation IDs for single-chunk documents. Now it receives c0_-prefixed IDs. Any existing subclass overriding the hook that inspects or compares annotation IDs will silently change behaviour. The CHANGELOG entry covers the public contract, but the class docstring (and ideally the hook docstring) could note that IDs are always prefixed before the hook is called.

Race in test_concurrent_failure_cancels_remaining

The wait(timeout=2) improvement is the right fix. One small follow-up: if the wait times out and is_set() is still False, the test failure message will just say False is not True. Adding a message to assertTrue (msg="slow_chunks_started never set") makes CI failures self-describing.


What's good

  • Backoff test is precise: expected values [5, 10, 20, 30] are explicitly verified, not just a call count check.
  • The two config-validation tests check both the exception type and the message substring and is_transient=False flag — complete assertions.
  • Changelog entries are thorough and include file paths.
  • The docstring update accurately describes the new invariant.

Summary: Three small things worth addressing before merge — exception chain (from e), missing doc_id in error message, and a hook test for the single-chunk path. The core logic is correct.

@JSv4 JSv4 merged commit 2938c39 into main Feb 22, 2026
3 checks passed
@claude
Copy link

claude bot commented Feb 22, 2026

PR Review: Fix BaseChunkedParser consistency and error handling

Note: This PR is already merged. The following is a post-merge review for tracking purposes and to surface follow-up work.

The three core fixes are correct and the test coverage additions are valuable. Two prior reviews covered the main issues well. I'll focus on a synthesis and any gaps.


Issues that warrant a follow-up PR

1. Exception chain dropped in ValueError wrapping (chunked_parser.py:194)

# Current
raise DocumentParsingError(str(e), is_transient=False)

# Should be
raise DocumentParsingError(f"Invalid chunking config for document {doc_id}: {e}", is_transient=False) from e

Two problems in one line:

  • Missing from e drops the original ValueError from the exception chain, making tracebacks in production logs harder to follow. Every other exception site in the same method uses the same pattern but doesn't drop the chain because they re-raise directly. This should be raise ... from e.
  • Missing doc_id context: every other DocumentParsingError in _parse_document_impl includes the document ID (e.g. f"Cannot determine page count for document {doc_id}: {e}"). The new raise omits it, making log correlation harder.

Line 176 (get_pdf_page_count ValueError) has the same missing-from e gap and could be fixed at the same time.

2. _post_reassemble_hook is untested on the single-chunk path

test_post_reassemble_hook_called (line 435) only exercises the multi-chunk path (100-page PDF, min_pages_for_chunking=75). After this PR, single-chunk documents also pass through _reassemble_chunk_results before the hook, but that path has no corresponding test. A small-doc variant of the test would close the gap:

parser.min_pages_for_chunking = 75  # keep 10-page doc as single chunk
result = parser._parse_document_impl(...)
self.assertTrue(hook_called["value"])
self.assertEqual(result["labelled_text"][0]["id"], "c0_...")  # IDs pre-prefixed before hook

This also validates that subclasses overriding the hook receive pre-prefixed IDs on the single-chunk path — a behaviour change from before this PR.

3. test_retry_backoff_capped_at_max relies on implicit constant

PersistentRetryParser doesn't set min_pages_for_chunking, so the test silently depends on DEFAULT_MIN_PAGES_FOR_CHUNKING = 75 being greater than the 10-page test PDF. If that constant ever changes, the test would exercise multi-chunk retry logic instead. Adding parser.min_pages_for_chunking = 75 makes the intent explicit and matches the pattern used by other tests in the same class.


Minor

assertTrue failure message in test_concurrent_failure_cancels_remaining: After the wait(timeout=2), if the event times out, the test fails with just False is not True. Adding a message helps CI diagnosis:

self.assertTrue(slow_chunks_started.is_set(), "slow_chunks_started never set within timeout")

_post_reassemble_hook docstring could note that annotation/relationship IDs are always c{n}_-prefixed when the hook is invoked. Subclasses that inspect IDs in the hook now see prefixed IDs for single-chunk documents too.


What's solid

  • Routing single-chunk results through _reassemble_chunk_results([result], [0]) is the right approach — reuses existing logic with zero offset and c0_ prefix, no duplication.
  • Config validation tests check exception type, message substring, and is_transient=False — complete assertions.
  • Backoff math verification ([5, 10, 20, 30]) is precise rather than just checking call count.
  • CHANGELOG entries are specific (include file paths and line numbers).
  • Race fix (wait(timeout=2)) is correct and improves CI reliability.

The main follow-up items are the missing from e / doc_id in the error message and the missing single-chunk hook test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants