Fix BaseChunkedParser consistency and error handling (Closes #926)#929
Fix BaseChunkedParser consistency and error handling (Closes #926)#929
Conversation
- Wrap calculate_page_chunks ValueError as DocumentParsingError so callers catching DocumentParsingError don't miss config errors - Route single-chunk documents through _reassemble_chunk_results for consistent c0_-prefixed annotation/relationship IDs - Add test exercising MAX_CHUNK_RETRY_BACKOFF_SECONDS cap (attempt 4+ where 5*2^3=40 > 30) - Fix theoretical race in concurrent cancellation test by adding slow_chunks_started.wait(timeout=2) before assertion - Add tests for invalid max_pages_per_chunk and min_pages_for_chunking config propagating as DocumentParsingError Closes #926 https://claude.ai/code/session_01RsC25ZupJSYVdY2gmBNpQx
PR Review: Fix BaseChunkedParser consistency and error handlingOverall this is a solid, well-targeted fix with good test coverage. The changes are small, focused, and properly documented. A few things worth discussing before merge: Issues1. Missing document context in ValueError wrapping (chunked_parser.py:194)The new try/except around # New code (line 194)
raise DocumentParsingError(str(e), is_transient=False)
# All other error wraps in this same file include doc_id for debuggability:
raise DocumentParsingError(
f"Cannot determine page count for document {doc_id}: {e}",
is_transient=False,
)Suggest: raise DocumentParsingError(f"Config error for document {doc_id}: {e}", is_transient=False)2. Potential breaking change: c0_ prefix on single-chunk annotation IDsThe PR routes single-chunk results through If annotation IDs generated by a Questions to consider:
If those paths exist, a migration note or version bump may be appropriate. Minor nits3. test_retry_backoff_capped_at_max relies on implicit DEFAULT_MIN_PAGES_FOR_CHUNKINGThe test uses parser = PersistentRetryParser()
parser.chunk_retry_limit = 4
parser.min_pages_for_chunking = 75 # keep 10-page PDF as single chunk4. test_concurrent_failure_cancels_remaining race window is still theoreticalThe Positives
The breaking change question (point 2) is the main thing I'd want confirmed before merging. If annotation IDs from single-chunk parses aren't stored/compared downstream, this is ready to go. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
PR Review: Fix BaseChunkedParser consistency and error handlingOverall this is a clean, well-reasoned PR. The three issues it addresses are real and the fixes are correct. A few observations: CorrectnessSingle-chunk ID prefixing (core change): The fix is sound. Routing single-chunk results through ValueError wrapping: Correct. Issues1. Missing exception chain in ValueError wrapping ( raise DocumentParsingError(str(e), is_transient=False)This should be 2. Missing doc_id context in the new error message ( Every other 3. No test that The existing MinorPotential breaking change for subclass overrides of Before this PR, the hook received raw (un-prefixed) annotation IDs for single-chunk documents. Now it receives Race in The What's good
Summary: Three small things worth addressing before merge — exception chain ( |
PR Review: Fix BaseChunkedParser consistency and error handling
The three core fixes are correct and the test coverage additions are valuable. Two prior reviews covered the main issues well. I'll focus on a synthesis and any gaps. Issues that warrant a follow-up PR1. Exception chain dropped in ValueError wrapping (
|
Summary
This PR fixes three robustness and consistency issues in
BaseChunkedParser:Config validation errors are now properly wrapped –
ValueErrorexceptions from invalidmax_pages_per_chunkormin_pages_for_chunkingare caught and re-raised asDocumentParsingError(is_transient=False)for consistent error handling.Single-chunk documents now receive consistent chunk-prefixed IDs – Previously, documents below the chunking threshold were returned directly without passing through
_reassemble_chunk_results, resulting in unprefixed annotation IDs. Now all results consistently receivec0_prefixed IDs, ensuring downstream consumers see a uniform format.Improved test coverage and reliability – Added tests for the backoff cap behavior and fixed a potential race condition in concurrent chunk tests.
Key Changes
opencontractserver/pipeline/base/chunked_parser.py:calculate_page_chunks()call in try-except to catchValueErrorand re-raise asDocumentParsingError(is_transient=False)_reassemble_chunk_results([result], [0])to ensure consistentc0_prefixed IDsopencontractserver/tests/test_chunked_parser.py:test_small_doc_no_chunkingdocstring and added assertion verifyingc0_prefixed IDstest_retry_backoff_capped_at_maxto verify backoff values are capped atMAX_CHUNK_RETRY_BACKOFF_SECONDS(30s)test_invalid_max_pages_per_chunk_raises_document_parsing_errorto verify config validationtest_invalid_min_pages_for_chunking_raises_document_parsing_errorto verify config validationslow_chunks_started.wait(timeout=2)before assertionImplementation Details
The fix ensures that whether a document is chunked or not, all annotation and relationship IDs follow the same
c{chunk_index}_prefix convention. This is achieved by routing single-chunk results through the existing_reassemble_chunk_resultsfunction, which was previously only used for multi-chunk documents.Config validation errors are now consistently handled as non-transient
DocumentParsingErrorexceptions, allowing callers to distinguish between transient parsing failures (which may be retried) and configuration issues (which require intervention).https://claude.ai/code/session_01RsC25ZupJSYVdY2gmBNpQx