Skip to content

Comments

Add BaseChunkedParser for automatic large PDF splitting and reassembly#909

Merged
JSv4 merged 11 commits intomainfrom
claude/chunked-doc-processing-nYJRg
Feb 22, 2026
Merged

Add BaseChunkedParser for automatic large PDF splitting and reassembly#909
JSv4 merged 11 commits intomainfrom
claude/chunked-doc-processing-nYJRg

Conversation

@JSv4
Copy link
Collaborator

@JSv4 JSv4 commented Feb 21, 2026

Summary

Introduces a new BaseChunkedParser abstract base class that automatically handles splitting large PDF documents into smaller page-range chunks, parsing each chunk independently (optionally in parallel), and reassembling the results into a single coherent document export. This enables parsers to transparently handle documents of any size without modifying their core parsing logic.

Key Changes

New Components

  • BaseChunkedParser (opencontractserver/pipeline/base/chunked_parser.py)

    • Abstract base class extending BaseParser with automatic chunking support
    • Subclasses implement _parse_single_chunk_impl() instead of _parse_document_impl()
    • Handles PDF splitting, chunk dispatch (sequential or concurrent), and result reassembly
    • Per-chunk retry logic with exponential back-off for transient failures
    • Optional _post_reassemble_hook() for document-wide post-processing on the full PDF
    • Configurable thresholds: max_pages_per_chunk, min_pages_for_chunking, max_concurrent_chunks, chunk_retry_limit
  • PDF Splitting Utilities (opencontractserver/utils/pdf_splitting.py)

    • get_pdf_page_count(): Extract page count from PDF bytes
    • split_pdf_by_page_range(): Extract a contiguous page range as new PDF bytes
    • calculate_page_chunks(): Compute chunk boundaries based on page count and thresholds
  • Reassembly Logic (opencontractserver/pipeline/base/chunked_parser.py)

    • _reassemble_chunk_results(): Merge per-chunk results with correct global page offsets
    • _offset_annotation(): Adjust annotation page indices and IDs
    • _offset_relationship(): Adjust relationship annotation references
    • Handles PAWLs page indices, annotation JSON keys, token page references, and ID prefixing

Modified Components

  • DoclingParser (opencontractserver/pipeline/parsers/docling_parser_rest.py)

    • Now extends BaseChunkedParser instead of BaseParser
    • Implements _parse_single_chunk_impl() to send individual chunks to the Docling microservice
    • Implements _post_reassemble_hook() for image extraction on the full PDF after reassembly
    • Added chunking configuration settings: max_pages_per_chunk, min_pages_for_chunking, max_concurrent_chunks
  • Constants (opencontractserver/constants/document_processing.py)

    • Added chunking defaults: DEFAULT_MAX_PAGES_PER_CHUNK, DEFAULT_MIN_PAGES_FOR_CHUNKING, DEFAULT_MAX_CONCURRENT_CHUNKS, DEFAULT_CHUNK_RETRY_LIMIT

Tests

  • test_chunked_parser.py: Comprehensive tests for reassembly logic, offset helpers, and integration with BaseChunkedParser
  • test_pdf_splitting.py: Tests for PDF page counting, splitting, and chunk calculation
  • test_doc_parser_docling_rest.py: Updated mocks to reflect chunked parser architecture

Implementation Details

  • Transparent Chunking: Documents below the min_pages_for_chunking threshold are processed as a single chunk with no splitting overhead
  • Concurrent Dispatch: Uses ThreadPoolExecutor for parallel chunk processing when max_concurrent_chunks > 1
  • Result Reassembly: Correctly offsets all page-relative indices and prefixes IDs to maintain uniqueness across chunks
  • Retry Strategy: Per-chunk retries with exponential back-off (5s base) for transient errors; permanent errors fail immediately
  • Post-Processing Hook: Allows subclasses to run document-wide operations (e.g., image extraction) on the full PDF after reassembly

Backward Compatibility

The public API (process_document, parse_document, save_parsed_data) remains unchanged. Existing BaseParser subclasses continue to work without modification.

https://claude.ai/code/session_016xZEFaSfqGzQuFa2xrCpT4

Large documents (70+ pages) that exceed Cloud Run timeout and response
size limits now automatically split into page-range chunks for
independent parsing and reassembly.

New BaseChunkedParser abstract class extends BaseParser with:
- Automatic PDF splitting via pypdf with configurable thresholds
- Optional concurrent chunk dispatch via ThreadPoolExecutor
- Per-chunk retry with exponential back-off
- Correct reassembly of PAWLs page indices, annotation references,
  and parent-child relationships across chunk boundaries
- Post-reassembly hook for document-wide processing (image extraction)

DoclingParser now extends BaseChunkedParser. Documents below the
chunking threshold (default: 75 pages) are processed unchanged.
Configurable via PipelineSettings (DOCLING_MAX_PAGES_PER_CHUNK,
DOCLING_MIN_PAGES_FOR_CHUNKING, DOCLING_MAX_CONCURRENT_CHUNKS).

https://claude.ai/code/session_016xZEFaSfqGzQuFa2xrCpT4
@claude
Copy link

claude bot commented Feb 21, 2026

Code Review: BaseChunkedParser for Large PDF Splitting

Overall this is a well-designed and thoughtfully implemented feature. The abstraction layer is clean, the reassembly logic is clearly separated into pure functions (making it independently testable), and the test coverage is solid. A few issues worth addressing:


🔴 Bug: Thread pool doesn't cancel in-flight futures on failure

In _dispatch_concurrent, when one chunk raises an exception inside the for future in as_completed(...) loop, Python's with ThreadPoolExecutor(...) as executor: context manager calls executor.shutdown(wait=True), which waits for all submitted futures to finish before propagating the exception. For a 10-chunk document where chunk 0 fails immediately, you silently wait for chunks 1–9 to complete before the error surfaces.

# chunked_parser.py - _dispatch_concurrent
with ThreadPoolExecutor(max_workers=max_workers) as executor:
    ...
    for future in as_completed(future_to_index):
        exc = future.exception()
        if exc is not None:
            raise exc  # executor.__exit__ still waits for all running futures!

Python 3.9+ supports executor.shutdown(wait=False, cancel_futures=True). One approach is to break out of the context manager explicitly:

executor = ThreadPoolExecutor(max_workers=max_workers)
try:
    future_to_index = {executor.submit(...): idx for ...}
    for future in as_completed(future_to_index):
        if (exc := future.exception()) is not None:
            executor.shutdown(wait=False, cancel_futures=True)
            raise DocumentParsingError(...) from exc
        results_by_index[future_to_index[future]] = future.result()
finally:
    executor.shutdown(wait=False, cancel_futures=True)

🟡 Issue: All chunk PDFs held in memory simultaneously

In _parse_document_impl, all chunk byte arrays are precomputed into chunk_data before any parsing begins:

chunk_data: list[tuple[int, bytes, int]] = []
for idx, (start, end) in enumerate(chunks):
    chunk_bytes = split_pdf_by_page_range(pdf_bytes, start, end)
    chunk_data.append((idx, chunk_bytes, start))

For a 300-page scanned PDF (say 150 MB), this means the original PDF plus all chunk PDFs are held in memory simultaneously. For sequential dispatch (max_concurrent_chunks=1) this is unnecessary — chunks can be generated lazily one at a time. At minimum, the sequential path should yield/generate chunks on demand.


🟡 Issue: Cross-chunk parent-child relationships are silently broken

_offset_annotation prefixes parent_id with the chunk's own prefix:

parent_id = annotation.get("parent_id")
if parent_id is not None:
    annotation["parent_id"] = f"{id_prefix}{parent_id}"

Since each chunk is parsed independently, chunk 1 has no knowledge of chunk 0's annotation IDs. A paragraph on page 51 whose section header is on page 49 (in chunk 0) will have parent_id = "c1_section-id" while the actual parent's prefixed ID is "c0_section-id". The reference is silently broken.

This is a fundamental limitation of per-chunk parsing (the microservice can't link across split boundaries), but it should be documented in the class docstring and ideally emitted as a warning during reassembly. The PR description says "correct reassembly of parent-child relationships across chunk boundaries" which is only true within a single chunk.


🟡 Issue: split_pdf_by_page_range creates a new PdfReader per call

def split_pdf_by_page_range(pdf_bytes: bytes, start_page: int, end_page: int) -> bytes:
    reader = PdfReader(io.BytesIO(pdf_bytes))  # re-parsed on every call

For a 10-chunk document this parses the full PDF 10 times. Consider accepting a PdfReader (or pre-read page list) as an optional parameter, or having _parse_document_impl create a single shared reader and pass it down.


🟡 Test: test_concurrent_dispatch doesn't verify result ordering

The concurrent dispatch test only checks result["page_count"] == 4 but not that PAWLs pages are in the correct global order (as test_large_doc_is_chunked does for the sequential case). Since as_completed returns futures in completion order, a bug in the ordering logic (results_by_index[i] for i in range(...)) would not be caught:

def test_concurrent_dispatch(self, mock_open):
    ...
    self.assertEqual(result["page_count"], 4)
    # Missing: check indices == [0, 1, 50, 51]

🟡 DRY: _make_pdf helper is duplicated across test files

The function appears identically in both test_chunked_parser.py and test_pdf_splitting.py. Per project conventions, shared test helpers should live in a common location (e.g., opencontractserver/tests/utils.py).


🟢 Nit: results_by_index dict can be a pre-sized list

Chunk indices are contiguous 0..N-1, so a pre-sized list is simpler:

results: list[Optional[OpenContractDocExport]] = [None] * len(chunk_data)
...
results[chunk_index] = result
return results  # already ordered

🟢 Nit: Add a comment on the two-tier retry strategy

chunk_retry_limit = DEFAULT_CHUNK_RETRY_LIMIT (value: 1) is intentionally low because the outer Celery task provides a second tier of retries. A brief comment here would prevent future maintainers from increasing this without understanding the tradeoff.


Summary

The core architecture is sound, the pure-function reassembly logic is correct for the within-chunk case, and the test suite is comprehensive. Main actionable items:

  1. Cancel in-flight futures on chunk failure — correctness/performance issue
  2. Document the cross-chunk parent-child limitation — the current behaviour is a silent data integrity issue for deeply nested documents
  3. Lazy chunk splitting for sequential dispatch — reduces peak memory
  4. Add ordering assertion to test_concurrent_dispatch — test coverage gap
  5. Deduplicate _make_pdf test helper — DRY

JSv4 added 8 commits February 21, 2026 10:18
Moves the identical _make_pdf helper from test_pdf_splitting.py and
test_chunked_parser.py into a shared opencontractserver/tests/helpers.py
module.
Previously, when one chunk raised an exception inside the
ThreadPoolExecutor context manager, executor.__exit__ called
shutdown(wait=True), blocking until all remaining futures completed.
Now the executor is managed manually with shutdown(wait=False,
cancel_futures=True) in a finally block, so failures surface
immediately.
The concurrent test previously only checked page_count but not that
PAWLs pages were in correct global order. Since as_completed returns
futures in completion order, this assertion exercises the ordering
logic in results_by_index.
Each chunk is parsed independently so parent-child annotation
references spanning chunk boundaries become orphaned after reassembly.
This adds a warning during reassembly when orphaned parent_ids are
detected and documents the limitation in the class docstring.
Previously all chunk PDFs were pre-split into memory before any
parsing began. Now the sequential path splits each chunk on demand,
so only one chunk PDF is in memory at a time. The concurrent path
still pre-splits since all futures are submitted upfront.
Avoids re-parsing the full PDF for every chunk when splitting
concurrently. The concurrent dispatch path now creates a single
shared PdfReader and passes it to all split calls.
@claude
Copy link

claude bot commented Feb 21, 2026

Code Review: BaseChunkedParser for Large PDF Splitting

Overall this is a well-structured PR. The design is clean, the documentation is thorough, and the test coverage is solid. A few items worth addressing before merging:


Bugs / Correctness

1. PAWLS page mutation via mutable default sentinel (chunked_parser.py, _reassemble_chunk_results)**

page_info = page_data.get("page", {})
page_info["index"] = page_info.get("index", 0) + offset

If page_data has no "page" key (which shouldn't happen in practice, but the code pattern is fragile), page_data.get("page", {}) returns a new empty dict, the mutation is performed on that throwaway object, and the original page_data is left untouched. This is a silent no-op rather than an error. The write is not reflected back into page_data. Consider:

if "page" in page_data:
    page_data["page"]["index"] = page_data["page"].get("index", 0) + offset

2. Single-chunk shortcut returns the original dict without copying

if len(chunk_results) == 1 and page_offsets[0] == 0:
    return chunk_results[0]

The caller (_parse_document_impl) may then pass this to _post_reassemble_hook, which is permitted to mutate it. That's fine for the current callers, but it means the returned object is the same object as what's in chunk_results[0]. If a caller ever holds a reference to both the chunk result and the final document, unexpected aliasing occurs. Consider a shallow copy: return dict(chunk_results[0]).

3. Duplicate PdfReader construction in test (test_pdf_splitting.py, test_accepts_prebuilt_reader)**

reader = PdfReader(io.BytesIO(pdf_bytes))
reader = PdfReader(io.BytesIO(pdf_bytes))  # dead assignment

The first assignment is immediately overwritten. One line should be removed.


Performance / Resource Management

4. Concurrent mode pre-splits all chunks into memory simultaneously

In _parse_document_impl, when max_concurrent_chunks > 1, all chunk PDFs are materialized into chunk_data before any thread starts:

chunk_data: list[tuple[int, bytes, int]] = []
for idx, (start, end) in enumerate(chunks):
    chunk_bytes = split_pdf_by_page_range(...)
    chunk_data.append((idx, chunk_bytes, start))

For a 500-page document split into 10 × 50-page chunks, this holds all 10 chunk PDFs in memory at once (potentially several hundred MB if each chunk is ~20 MB). This trade-off is intentional for concurrent dispatch, but it's worth documenting explicitly in the class docstring alongside the existing "Limitation" note, so operators can make an informed decision when tuning max_concurrent_chunks.

5. Cancelled concurrent threads can still block until their network timeout

executor.shutdown(wait=False, cancel_futures=True) cancels pending futures but cannot interrupt threads that have already started. If chunk N fails and other chunks are mid-request to the Docling service, those threads will block until their request_timeout expires (default: apparently per service config). The test test_concurrent_failure_cancels_remaining verifies the fast-fail behaviour by using time.sleep(5) in a non-started future — but a real production failure where multiple threads have already launched HTTP requests will still hold the thread pool threads until their timeouts. This is an inherent Python limitation, but callers should be aware the effective wall-clock time on failure can be up to request_timeout seconds even after the error is surfaced.


Code Quality

6. import time inside function bodies in tests (test_chunked_parser.py, SlowFailParser._parse_single_chunk_impl)**

import time
time.sleep(5)

Top-level imports are preferred. The outer test file already imports time at the module level (line 74 of the diff), so this inner import is redundant.

7. Timing-sensitive assertion in test_concurrent_failure_cancels_remaining

self.assertLess(elapsed, 3.0, "Failure should not wait for remaining chunks")

A 3-second wall-clock assertion is fragile on slow/loaded CI runners. The slow chunk sleeps for 5 seconds, so the guard is testing that the cancelled-but-not-yet-started futures don't run. That's the right thing to test, but the assertion margin is tight. Consider raising the threshold (e.g. 4.5s) or restructuring the test to mock time.sleep in the child thread, which avoids any real sleep and eliminates the flakiness.

8. cancel_futures=True requires Python ≥ 3.9

ThreadPoolExecutor.shutdown(cancel_futures=True) was added in Python 3.9 (PEP 3155). Django 4.x supports Python 3.8+. If the project supports 3.8, this will raise TypeError. Worth confirming the minimum Python version or guarding with a version check.


Minor / Nits

9. _offset_annotation continue path is correct but confusing

The flow for non-integer page keys is:

except (ValueError, TypeError):
    new_key = page_key
    new_json[new_key] = page_data  # added here
    continue                        # skip tokensJsons mutation and the normal append below

This is functionally correct, but the assignment and continue together are easy to misread as a no-op. A small inline comment noting "already added above" would help reviewers.

10. Log message alignment with actual threshold semantics

The log message in _parse_document_impl says "below chunking threshold" for the single-chunk path, but the actual condition is len(chunks) <= 1 (which triggers when total_pages < min_pages_for_chunking OR total_pages == 0). A document with 0 pages would also hit this path with a different log message intent. Edge case, but worth tightening.


What's working well

  • Transparent API: the public process_document / parse_document / save_parsed_data surface is completely unchanged — existing subclasses need zero modification.
  • The sequential vs. concurrent dispatch split is clean and the memory-efficiency trade-off is well-reasoned.
  • Per-chunk retry with exponential back-off is a good layering decision against the outer Celery retry.
  • The cross-chunk parent-child limitation is clearly documented in the class docstring and has a dedicated test asserting the known behaviour — this is exactly the right approach.
  • The _post_reassemble_hook design for image extraction ensures globally-consistent page indices regardless of chunking.
  • Test helpers (make_test_pdf, _make_chunk_result) are well-factored and reusable.
  • Constant extraction to document_processing.py follows existing project conventions.

Items 1 and 8 should be confirmed before merge; the rest can be addressed at your discretion.

Tests were failing because DoclingParser now extends BaseChunkedParser,
which introduced a new code path: read PDF from storage -> count pages
via pypdf -> decide chunking. Tests needed two fixes:

1. Mock get_pdf_page_count (returns 1) so invalid mock PDF bytes don't
   cause ValueError in pypdf's PdfReader
2. Fix default_storage mock path in test_pipeline_hardening.py from
   docling_parser_rest to chunked_parser (where storage is now used)
3. Set chunk_retry_limit=0 to disable per-chunk retries in unit tests
   (prevents duplicate requests.post calls and 5s backoff delays)
@claude
Copy link

claude bot commented Feb 22, 2026

PR Review: Add BaseChunkedParser for automatic large PDF splitting and reassembly

Overall this is well-designed and clearly thought through. The layered architecture (utility functions → abstract base → concrete parser), the post-reassembly hook, and the per-chunk retry logic are all solid. Test coverage is good. I have a mix of bugs, design concerns, and minor nits below.


Bugs

1. Duplicate line in test (test_pdf_splitting.py)

def test_accepts_prebuilt_reader(self):
    pdf_bytes = make_test_pdf(10)
    reader = PdfReader(io.BytesIO(pdf_bytes))
    reader = PdfReader(io.BytesIO(pdf_bytes))  # ← duplicate, dead assignment
    chunk = split_pdf_by_page_range(pdf_bytes, 2, 5, reader=reader)

Harmless but should be removed.

2. Infinite loop if max_pages_per_chunk <= 0 (pdf_splitting.py)

while start < total_pages:
    end = min(start + max_pages_per_chunk, total_pages)  # end == start if max == 0
    chunks.append((start, end))
    start = end  # never advances

If a misconfigured or malicious setting supplies max_pages_per_chunk = 0, this is an infinite loop. Add a guard at the top of calculate_page_chunks:

if max_pages_per_chunk <= 0:
    raise ValueError(f"max_pages_per_chunk must be > 0, got {max_pages_per_chunk}")

Similarly, ThreadPoolExecutor(max_workers=0) raises ValueError — a guard on max_concurrent_chunks would make the error message clearer.


Design Concerns

3. Single-chunk fast-path in _reassemble_chunk_results is dead code and creates an ID inconsistency

if len(chunk_results) == 1 and page_offsets[0] == 0:
    return chunk_results[0]  # IDs are NOT prefixed (e.g. "ann-1")

In the production flow, _reassemble_chunk_results is only reached when len(chunks) > 1 — single-chunk documents are handled before the call site. So this branch is never exercised in production.

When it is exercised (tests calling the function directly), it creates a subtle inconsistency: a single chunk at offset 0 returns unprefixed IDs (ann-1), but two chunks produce prefixed IDs (c0_ann-1, c1_ann-1). This will matter if calling code ever caches or cross-references IDs.

Recommendation: either remove the short-circuit (always prefix, even for single chunk), or document explicitly that this is a test-only path and add a comment to prevent future surprise.

4. Cross-chunk parent-child is a significant data quality limitation — consider filing a follow-up

The orphaned-parent warning is appreciated, but for structure-heavy contracts this will be frequent. Consider whether the chunk boundary could be shifted to avoid splitting in the middle of a detected section: e.g., after a first pass that reads only page structure metadata, snap chunk boundaries to section headers. Even a heuristic (never split on a page that starts a new heading) would reduce orphan rate significantly. This could be a follow-up issue rather than blocking this PR, but worth filing.

5. time.sleep in a Celery worker thread

backoff = 5 * (2 ** (attempt - 1))
time.sleep(backoff)

With the default chunk_retry_limit = 1 the max sleep is 5 s. For concurrent dispatch this blocks one thread in the pool, not the worker process, so it's acceptable. But if someone raises chunk_retry_limit this could block a worker for 5 + 10 + 20 + ... seconds. Consider adding a comment capping the max reasonable value, or add a max-backoff constant (MAX_CHUNK_RETRY_BACKOFF_SECONDS).


Potential Flakiness

6. Timing assertion in test_concurrent_failure_cancels_remaining

elapsed = time.monotonic() - start
self.assertLess(elapsed, 3.0, "Failure should not wait for remaining chunks")

The sleeping chunk uses time.sleep(5) and the test asserts completion in < 3 s. On a heavily loaded CI runner this can be flaky. A better approach is to use an Event to signal when chunk 0 has failed and check that slow chunks were cancelled rather than relying on wall-clock time. Alternatively, lower the sleep on the slow chunks to 0.5 s and keep the assertion at 0.3 s.


Minor / Style

7. Missing boundary test for min_pages_for_chunking exact threshold

test_at_threshold_returns_single tests 74 pages (below 75 threshold). There is no test for exactly 75 pages. calculate_page_chunks(75, 50, 75) would return [(0, 50), (50, 75)] because the check is strict less-than (< min_pages_for_chunking). The name "min pages for chunking" implies that having exactly that many pages triggers chunking, which is correct, but worth a test and a docstring clarification to prevent future confusion.

8. type: ignore in _dispatch_concurrent

return results  # type: ignore[return-value]  # None slots would have raised above

The reasoning is correct, but returning list[Optional[OpenContractDocExport]] as list[OpenContractDocExport] would be clearer with cast from typing instead of suppressing the type check entirely:

from typing import cast
return cast(list[OpenContractDocExport], results)

9. Memory note for concurrent dispatch

The concurrent path holds all chunk bytes in memory simultaneously before the thread pool starts. For a 500-page document with 10 chunks this is significant. A comment noting the trade-off (vs lazy sequential splitting) would help future maintainers understand why the two paths exist.

10. Orphaned-parent log level

logger.warning for orphaned cross-chunk parent refs will appear on virtually every large document that has a hierarchical structure spanning chunk boundaries. Consider logger.debug or only warning once per document (not per orphaned annotation count).


What's Done Well

  • Pure utility functions (get_pdf_page_count, split_pdf_by_page_range, calculate_page_chunks) are easy to unit-test and the tests cover the important edge cases.
  • _post_reassemble_hook cleanly separates image extraction from chunked parsing — moving it from _parse_single_chunk_impl to after reassembly is the right call.
  • cancel_futures=True in the executor shutdown on failure is correct; combined with as_completed early exit it prevents hanging when one chunk fails.
  • The existing test_doc_parser_docling_rest.py patch updates are correct — moving patches from docling_parser_rest.default_storage to chunked_parser.default_storage reflects where the code actually lives now.
  • Changelog entry is clear and follows the project format.

@codecov
Copy link

codecov bot commented Feb 22, 2026

Codecov Report

❌ Patch coverage is 96.91834% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/pipeline/base/chunked_parser.py 90.21% 18 Missing ⚠️
opencontractserver/tests/test_chunked_parser.py 99.23% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@JSv4 JSv4 merged commit 474862a into main Feb 22, 2026
6 checks passed
@JSv4 JSv4 deleted the claude/chunked-doc-processing-nYJRg branch February 22, 2026 01:27
@claude
Copy link

claude bot commented Feb 22, 2026

Code Review: BaseChunkedParser for Large PDF Processing

Overall this is a well-structured addition. The abstraction is clean, the reassembly logic is isolated into a pure function (easy to test), and the public API of existing parsers is untouched. The test coverage is solid. Below are specific findings ranging from bugs to suggestions.


Bugs / Issues

1. Duplicate PdfReader construction in test (test_pdf_splitting.py:94-95)

reader = PdfReader(io.BytesIO(pdf_bytes))
reader = PdfReader(io.BytesIO(pdf_bytes))  # redundant - first reader is discarded

The first reader is immediately discarded. Harmless but noisy - remove one line.

2. Orphaned cross-chunk parent_id references left in output data

The reassembly code detects and warns about orphaned parent_id references but leaves the broken references in the returned data:

# chunked_parser.py ~line 462
if orphaned_count > 0:
    logger.warning(...)
    # no cleanup; consumers downstream will find dangling parent_ids

Downstream code that trusts parent_id to point to a valid annotation will encounter silent data corruption - a warning is not enough. Consider setting orphaned parent_id values to None during reassembly, which is the safe sentinel the rest of the codebase uses for root-level annotations. If you want to preserve the original reference for debugging, store it in a separate field.

3. cancel_futures=True does not interrupt running threads

In _dispatch_concurrent:

finally:
    executor.shutdown(wait=False, cancel_futures=True)

cancel_futures=True only cancels queued futures that have not started executing yet. Python threads that are already running (e.g., mid-HTTP-request or sleeping through retry back-off) will NOT be interrupted - they continue in the background until they complete naturally. On failure, you may have several lingering HTTP requests to the Docling service consuming connections.

The test_concurrent_failure_cancels_remaining test passes (< 3s) only because wait=False causes the main thread not to block - the background threads are still running their 5s sleep after the test returns. This is an acceptable trade-off, but the docstring for _dispatch_concurrent should document that background threads may outlive the raised exception.


Design / Architecture Concerns

4. Memory spike in concurrent mode - all chunk bytes held simultaneously

# chunked_parser.py ~line 207
chunk_data: list[tuple[int, bytes, int]] = []
for idx, (start, end) in enumerate(chunks):
    chunk_bytes = split_pdf_by_page_range(...)
    chunk_data.append((idx, chunk_bytes, start))
# All chunk bytes are in memory before any chunk is submitted
chunk_results = self._dispatch_concurrent(..., chunk_data=chunk_data, ...)

For concurrent dispatch, all chunk PDFs are pre-split and held in memory before any chunk is submitted. For a 500-page high-resolution PDF split into 10 chunks, this holds 10x the per-chunk PDF size simultaneously (plus the original pdf_bytes). Sequential dispatch correctly avoids this by splitting lazily.

A better approach would be to pre-compute only the page ranges and let each worker thread call split_pdf_by_page_range with the shared PdfReader (which is already being created), producing bytes only when needed.

5. _parse_single_chunk_impl thread-safety requirement not documented

When max_concurrent_chunks > 1, _parse_single_chunk_impl is called concurrently from multiple threads. The abstract method's docstring doesn't mention this. Subclass authors could unknowingly use shared mutable state and introduce race conditions.

Add a note: "When max_concurrent_chunks > 1, this method is called concurrently from multiple threads. Implementations must be thread-safe."

6. _reassemble_chunk_results short-circuit skips ID prefixing inconsistently

if len(chunk_results) == 1 and page_offsets[0] == 0:
    return chunk_results[0]  # no c0_ prefix applied

When this early return fires, annotation IDs are not prefixed, but when a single chunk goes through the normal loop path (offset > 0), they get the c0_ prefix. This is currently only reachable in tests (production single-chunk paths bypass _reassemble_chunk_results entirely), but it is an unexpected inconsistency for future callers of the pure function.


Minor / Nits

7. Content joining separator may lose paragraph boundaries

"content": "\n".join(combined_content_parts),

If each chunk's parser returns content with a trailing newline, this produces double newlines at boundaries. If content ends without a trailing newline, two chunks' text is joined with only a single newline - potentially breaking downstream consumers that rely on double-newline paragraph separation. Confirm this matches expectations, or use "\n\n" to be safe.

8. executor not used as a context manager

executor = ThreadPoolExecutor(max_workers=max_workers)
try:
    ...
finally:
    executor.shutdown(wait=False, cancel_futures=True)

Prefer "with ThreadPoolExecutor(...) as executor:" for idiomatic Python. Since you are explicitly controlling shutdown behavior (wait=False), the manual approach works - but a context manager makes the cleanup intent clearer.


Test Coverage Gaps

  • No test verifies that _post_reassemble_hook is also called for single-chunk (no-chunking) documents. The existing test_post_reassemble_hook_called only covers the chunked path (100-page doc). Add a test where min_pages_for_chunking is high and the PDF is small, confirming the hook still fires.
  • No test for _dispatch_concurrent with max_concurrent_chunks < len(chunks) to verify the worker pool is correctly capped.

Summary

Category Issue Severity
Bug Duplicate PdfReader in test Low
Data integrity Orphaned parent_id left in output Medium
Resource Background threads not interruptible on failure Low-Medium
Performance All chunk bytes held in memory for concurrent mode Medium
Documentation Thread-safety of _parse_single_chunk_impl not stated Low
Design Single-chunk early-return in pure function skips ID prefixing Low

The cross-chunk parent_id orphaning issue (#2) is the one I would prioritize - silent data corruption risk is real even if it is a documented limitation. Zeroing out orphaned references instead of logging and moving on would make the behavior safe and predictable.

JSv4 pushed a commit that referenced this pull request Feb 22, 2026
Address all 10 refinement items cataloged in issue #914:

- Remove duplicate PdfReader assignment in test_pdf_splitting.py
- Add input validation for max_pages_per_chunk, min_pages_for_chunking,
  and max_concurrent_chunks to prevent infinite loops
- Remove dead single-chunk fast-path in _reassemble_chunk_results that
  caused ID prefixing inconsistency between single and multi-chunk results
- Fix flaky timing assertion in test_concurrent_failure_cancels_remaining
  by reducing sleep duration instead of relying on wall-clock bounds
- Add boundary test for exact min_pages_for_chunking threshold (75 pages)
- Replace type: ignore with cast() in _dispatch_concurrent return
- Add comment documenting memory trade-off in concurrent chunk dispatch
- Downgrade orphaned parent-child log from warning to debug level
- Add MAX_CHUNK_RETRY_BACKOFF_SECONDS (30s) cap on exponential backoff
- Document cross-chunk parent-child limitation with follow-up suggestion

Closes #914

https://claude.ai/code/session_01Y6hT1BWp6TaZFS9tacGNYz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants