Skip to content

Comments

Fix context guardrails token counting and summary merging#920

Merged
JSv4 merged 5 commits intomainfrom
claude/fix-issue-907-MSgfu
Feb 22, 2026
Merged

Fix context guardrails token counting and summary merging#920
JSv4 merged 5 commits intomainfrom
claude/fix-issue-907-MSgfu

Conversation

@JSv4
Copy link
Collaborator

@JSv4 JSv4 commented Feb 22, 2026

Summary

This PR fixes several critical issues in the LLM conversation context management system, particularly around token counting during compaction cycles, truncation edge cases, and summary merging logic.

Key Changes

Token Counting & Compaction

  • Added stored_summary_tokens parameter to compact_message_history() to properly track previously stored summaries separately from the system prompt. This prevents double-counting the old summary in total_before (threshold calculation) while correctly accounting for it being replaced by the new summary in total_after.
  • Fixed token estimation in _get_message_history() to split system prompt and stored summary token counts, passing them separately to the compaction function.

Summary Merging

  • Prevented duplicate prefix accumulation in successive compaction cycles by stripping COMPACTION_SUMMARY_PREFIX from both old and new summaries before merging, then re-adding it once to the merged result.

Truncation Safety

  • Replaced fragile guard clause in truncate_tool_output() with explicit char_budget = max(0, max_chars - len(notice)) to prevent negative slice indices when max_chars is smaller than the truncation notice length. This ensures content is always taken from the beginning of the string, never the end.

Sentence Extraction Robustness

  • Extended first-sentence regex in _deterministic_summary() to split on:

    • Double-newlines (paragraph boundaries)
    • Newlines before markdown list markers (-, *, , numbered lists)

    This prevents entire bullet-list responses from being incorrectly treated as a single sentence.

Test Improvements

  • Converted async test pattern in TestPersistCompactionOptimisticLock from asyncio.run() wrapper calls to native async def test methods, removing the unused asyncio import.
  • Strengthened truncation assertions to verify exact upper-bound length (<= max_chars) and confirm content starts from the beginning of input.
  • Added new test cases for truncation from beginning (not end) and for sentence splitting on markdown lists and paragraph boundaries.

Documentation

  • Clarified CHARS_PER_TOKEN_ESTIMATE docstring to explain the intentional use of 3.5 (not 4) for conservative token over-counting.
  • Expanded compacted_before_message_id field documentation explaining why BigIntegerField (not ForeignKey) is appropriate — the id__gt filter remains correct even if the cutoff message is deleted.
  • Added clarifying comments on defensive code paths explaining their purpose and when they're reachable.

Implementation Details

The core issue was that during successive compaction cycles, the system was:

  1. Double-counting the stored summary in token calculations
  2. Accumulating duplicate prefix headers in merged summaries
  3. Treating multi-line responses (especially bullet lists) as single sentences

These fixes ensure accurate token budgeting for compaction decisions and cleaner, more predictable summary text across multiple compaction cycles.

https://claude.ai/code/session_01XMfrtouuRV6wHwL9HB4iKL

Address all ten items from the context guardrails review:

1. Harden truncate_tool_output against negative slice indices by using
   explicit char_budget = max(0, max_chars - len(notice))
2. messages_to_proxies already in __all__ (no change needed)
3. Fix token double-counting by separating stored_summary_tokens from
   system_prompt_tokens in compact_message_history
4. Document why compacted_before_message_id uses BigIntegerField
5. Strip duplicate COMPACTION_SUMMARY_PREFIX during summary merges
6. Add clarifying comment on the unreachable recent_count < 1 guard
7. Convert async test from asyncio.run() to native async def methods
8. Extend sentence extraction regex to handle markdown bullets and
   paragraph boundaries
9. Strengthen truncation test assertions with exact length bounds and
   beginning-of-string checks
10. Add TestCompactionBookmarkDatabaseFilter verifying id__gt filtering

https://claude.ai/code/session_01XMfrtouuRV6wHwL9HB4iKL
@claude
Copy link

claude bot commented Feb 22, 2026

PR Review: Fix context guardrails token counting and summary merging

Overall the fixes are sound and address real bugs. The core logic is correct. Here are my findings:


Bugs / Issues

1. should_compact() not updated to accept stored_summary_tokens (inconsistency)

compact_message_history() now correctly accounts for stored_summary_tokens in total_before, but should_compact() - the public pre-check helper - was not updated with the same parameter. Any caller that uses should_compact() as a guard before calling compact_message_history() will compute a different (lower) token total than compact_message_history() does internally. This means a context window inflated by a large stored summary could slip past the should_compact() guard. Either add stored_summary_tokens to should_compact() for parity, or add a docstring caveat.


2. Nested _strip_prefix function redefined on every call

In pydantic_ai_agents.py:371, _strip_prefix is defined inside _get_message_history(). This is better placed as a private module-level function in context_guardrails.py (alongside COMPACTION_SUMMARY_PREFIX) or at least at class scope. As written it is recreated on every compaction event and is untestable in isolation.


Code Quality Observations

3. Truncation produces exactly max_chars - docstring says at most

When char_budget > 0, the result is always exactly max_chars chars. The function docstring says at most max_chars characters which is technically correct, but the new tests use assertLessEqual when assertEqual would be more precise for the positive-budget path. Minor - not a bug.

4. Regex new list-marker alternative could use a clarifying comment

The new pattern for list markers is correct but slightly non-obvious. The digit+punctuation alternative will match 1. or 2) at the start of a line and correctly requires trailing whitespace, so it won't split on mid-sentence decimals. A brief inline comment would help future readers understand why whitespace is required in the lookahead.


Test Quality

5. test_truncated_content_from_beginning_not_end splits on a format-specific string

The test splits on a hardcoded string literal that relies on TOOL_OUTPUT_TRUNCATION_NOTICE starting with a specific prefix. If the notice format changes, the test will either split incorrectly or become a no-op. Consider importing TOOL_OUTPUT_TRUNCATION_NOTICE directly in the test and splitting on the actual constant's prefix.

6. TestCompactionBookmarkDatabaseFilter mocks but doesn't validate filtering

The tests verify that filter(id__gt=50) is called on the queryset, but the mock always yields the pre-filtered expected_messages regardless. The length assertion therefore only validates the mock, not the filtering logic. This is acceptable for a unit test (integration tests cover the DB), but the test name/docstring could be clearer that it is verifying the call was made rather than the filter's correctness.


Minor Nits

  • The compacted_before_message_id field comment in models.py now appears both inline (code comment) and in help_text, which is slightly redundant.
  • test_markdown_bullet_list_split asserts that Liability is not in the summary but does not assert that Indemnification (the first list item) is also excluded. Adding that assertion would make the test more airtight.

Summary

Severity Item
Medium should_compact() missing stored_summary_tokens - semantic inconsistency with compact_message_history()
Low _strip_prefix should be module-level, not recreated inline
Low Test splitting on notice format string literal instead of imported constant
Nit Regex comment, docstring precision, redundant model field comment

The core logic changes (token split, prefix deduplication, negative-index defense, async test conversion) are all correct. The PR is in good shape - addressing the should_compact() inconsistency is the most important follow-up.

@JSv4 JSv4 linked an issue Feb 22, 2026 that may be closed by this pull request
@claude
Copy link

claude bot commented Feb 22, 2026

Code Review

Good set of targeted fixes. The bugs addressed here are real and the changes are generally clean. A few issues worth addressing:


Medium: should_compact not updated to match compact_message_history

compact_message_history now correctly includes stored_summary_tokens in total_before, but the public should_compact function still ignores it:

# context_guardrails.py:209 — does NOT account for stored_summary_tokens
total_tokens = system_prompt_tokens + sum(m.token_estimate for m in messages)

Any caller using should_compact as a pre-check will get a different answer than compact_message_history's internal check (which is what actually gates compaction). Since should_compact is exported in __all__, it's part of the public API. It should accept and propagate stored_summary_tokens the same way compact_message_history now does.


Medium: estimated_tokens_after underestimates when merging old + new summaries

CompactionResult.estimated_tokens_after is computed in compact_message_history using only the new summary tokens:

# context_guardrails.py:315
total_after = system_prompt_tokens + summary_tokens + recent_tokens

But in _get_message_history the stored summary is merged (old body + new body), which can be substantially larger than result.summary until cap_summary_length kicks in. The token metrics reported upstream via _HistoryResult (and surfaced in context_status) will therefore undercount after a merge cycle. This is informational-only so it's not a correctness bug in the compaction logic, but it's worth a comment or a follow-up fix.


Minor: _strip_prefix closure defined inside an if block inside an async def

# pydantic_ai_agents.py:388
if stored_summary:
    def _strip_prefix(text: str) -> str:
        ...

A new function object is allocated every time compaction fires. This should be a module-level helper (or at minimum hoisted to method scope) to keep the hot path allocation-free and the function more discoverable/testable.


Minor: Hard-truncation path produces no debug log

When char_budget == 0 the function silently returns output[:max_chars] without logging. The debug log on the main truncation path would be helpful here too — it's the more surprising case and is harder to notice in production.


Minor: test_output_exceeding_limit_is_truncated assertion is still loose

# test_context_guardrails.py:124
self.assertLessEqual(len(result), MAX_TOOL_OUTPUT_CHARS + 200)  # +notice

The fix guarantees len(result) <= max_chars for custom limits and the notice is ~80 chars, but the default-limit test still allows 200 chars of slack. Given that the PR explicitly tightens the other truncation assertions (assertLessEqual(len(result), 100) etc.), this one stands out as inconsistent. The notice is a fixed-width format string, so the expected upper bound could be tightened to MAX_TOOL_OUTPUT_CHARS + len(notice_format) or at worst a small constant.


Minor: TestCompactionBookmarkDatabaseFilter — mock assertion verifies call, not effect

The test constructs expected_messages by manually filtering all_messages, then configures the mock to yield exactly those pre-filtered messages. The assertion then checks that filter(id__gt=50) was called:

self.assertTrue(
    any(call.kwargs.get("id__gt") == 50 for call in filter_calls), ...
)
self.assertEqual(len(result), 3)

The len(result) == 3 assertion is always satisfied by the mock setup regardless of whether the filter call happens. So the test only validates that the filter is called with the right argument, not that calling it without the filter would return a different result. This is acceptable for a unit test (you can't test real DB filtering with mocks), but a comment clarifying the test's scope would prevent confusion about what guarantee it actually provides.


Nit: Changelog date

The changelog entry was updated from 2026-02-21 to 2026-02-22 — fine, but if the date is meant to reflect the release date rather than commit date it's better left as [Unreleased] without a date until the actual release.


Overall: the core fixes (negative slice index, double-counting, duplicate prefix) are correct and the test improvements are solid. Addressing the should_compact inconsistency is the most important item before merge.

@codecov
Copy link

codecov bot commented Feb 22, 2026

Codecov Report

❌ Patch coverage is 97.56098% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...encontractserver/llms/agents/pydantic_ai_agents.py 50.00% 3 Missing ⚠️
...pencontractserver/tests/test_context_guardrails.py 98.93% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

- Resolve CHANGELOG.md merge conflict with main
- Add TestGetMessageHistoryCompactionTokenCounting: exercises the compaction
  eligibility check path in _get_message_history (pydantic_ai_agents.py lines
  359-373) covering system_prompt_tokens and stored_summary_tokens computation
- Remove dead code in TestPersistFailurePreservesContext that created 2
  uncovered lines (unreachable if-branch)
@claude
Copy link

claude bot commented Feb 22, 2026

Code Review: Fix context guardrails token counting and summary merging

Overall this is a well-targeted set of fixes. The core bug fixes are correct and the changelog/docstring improvements are a nice addition. A few things worth discussing before merge.


✅ Correct fixes

truncate_tool_output (context_guardrails.py:138) — The char_budget = max(0, ...) refactor is cleaner than the old guard, though functionally both are equivalent (the original if max_chars <= len(notice) already prevented negative slice indices). No behavior change, but the new form is more obviously safe to readers.

Token double-counting — The split of system_prompt_tokens and stored_summary_tokens in _get_message_history is the right fix. Previously total_after = system_prompt_tokens + summary_tokens + recent_tokens double-counted the old stored summary because it was baked into system_prompt_tokens. The new approach correctly excludes it from total_after.

Duplicate prefix — Stripping COMPACTION_SUMMARY_PREFIX from both summaries before merging and re-adding it once is correct. The old stored_summary.rstrip() + "\n\n" + result.summary would accumulate the header on every successive compaction cycle.

Sentence extraction regex — Adding \n{2,}|\n(?=[-*•]\s|\d+[.)]\s) is a solid improvement. The lookahead pattern handles -/*/ and both 1. and 1) list styles correctly.

Async test pattern — Converting asyncio.run() wrappers to native async def test methods is the correct modern pattern for Django SimpleTestCase.


⚠️ Issues to address

1. should_compact is now inconsistent with compact_message_history (context_guardrails.py:195)

should_compact still doesn't accept stored_summary_tokens, but compact_message_history now does:

# should_compact — missing stored_summary_tokens
def should_compact(
    messages, model_name, *, system_prompt_tokens=0, threshold_ratio=...
)

Any caller using should_compact to pre-check whether to compact (rather than delegating to compact_message_history directly) will under-count tokens when a stored summary exists. Since should_compact is in __all__ it's part of the public API. This PR introduced the conceptual split between system-prompt tokens and stored-summary tokens, so it should keep should_compact consistent.

2. _strip_prefix as a one-off nested function (pydantic_ai_agents.py:383)

_strip_prefix is defined inside the if result.compacted: block. It has no closure dependencies — it's a pure function of its argument — so it would be more reusable and directly testable as a module-level private function in context_guardrails.py (alongside cap_summary_length). As a nested function it gets re-created on every compaction call and can't be exercised in isolation.

3. estimated_tokens_after underestimates post-merge token count

In compact_message_history:

summary_tokens = estimate_token_count(summary_with_prefix)   # only the NEW summary
total_after = system_prompt_tokens + summary_tokens + recent_tokens

But in _get_message_history, the actual value written to the DB is merged_summary = cap_summary_length(COMPACTION_SUMMARY_PREFIX + old_body + "\n\n" + new_body), which is larger than result.summary. So CompactionResult.estimated_tokens_after understates reality when a prior summary exists. This doesn't break compaction logic (it doesn't feed back into future thresholds), but it makes the status reporting / log output misleading:

Compacted conversation: removed N messages, ... (tokens 95000 → 12000)
# but actual tokens after merge are ~16000 due to merged summary

Worth noting in a follow-up if not fixing now.

4. New async mock pattern is fragile in TestCompactionBookmarkDatabaseFilter

mock_qs.__aiter__ = lambda self: aiter_messages()

Setting magic methods directly on a MagicMock instance relies on MagicMock's __setattr__ forwarding to the mock's synthetic class. It works in practice today, but the idiomatic pattern recommended by unittest.mock for async iteration is:

# More robust:
mock_qs.__aiter__ = MagicMock(return_value=aiter_messages())

Or use AsyncMock directly. The current form passes the lambda self parameter (receiving the mock object itself), which is unusual and will confuse readers.

5. Missing integration-level test for two-cycle compaction

The new TestGetMessageHistoryCompactionTokenCounting tests verify that stored_summary_tokens is passed correctly, but there's no test that runs two successive compaction cycles and asserts the merged summary doesn't accumulate the prefix header or that token counts are correct in cycle 2. A round-trip test would catch regressions of all three bugs simultaneously.


Minor nit

The docstring update in constants/context_guardrails.py says 3.5 instead of 4 to intentionally over-count but 3.5 chars/token gives more tokens per character than 4 chars/token, so it over-counts tokens (i.e. under-estimates the text volume per token). The intent is correct but the phrasing could be clearer: "Using 3.5 instead of 4 chars-per-token means we over-estimate the token count for a given string, triggering compaction conservatively."


Summary

The core fixes are correct and necessary. Items 1 and 2 above are small enough to fix in this PR; items 3–5 can reasonably go on the backlog. Happy to approve once should_compact is updated for API consistency.

JSv4 added 2 commits February 22, 2026 01:21
- Add stored_summary_tokens param to should_compact() for API consistency
  with compact_message_history()
- Extract nested _strip_prefix into module-level strip_compaction_prefix()
  in context_guardrails.py for testability and reuse
- Fix log accuracy: compute actual post-merge token count instead of using
  CompactionResult.estimated_tokens_after which only reflects new summary
- Add two-cycle compaction integration test verifying prefix deduplication,
  token counting, and summary capping across successive cycles
- Clarify CHARS_PER_TOKEN_ESTIMATE and async mock pattern comments
@claude
Copy link

claude bot commented Feb 22, 2026

PR Review: Fix context guardrails token counting and summary merging

Overall this is a solid, well-targeted PR. The core bugs are real, the fixes are correct, and the test coverage added is genuinely comprehensive. A few observations below.


Correctness

Summary merge deduplication (primary fix) — The merge logic fix is correct. The old code:

stored_summary.rstrip() + "\n\n" + result.summary

was concatenating two COMPACTION_SUMMARY_PREFIX-prefixed strings directly, producing a doubled header on every cycle. The new approach strips both before re-adding once. The three-cycle reasoning in the test test_two_cycles_no_duplicate_prefix proves the fix holds across repeated compactions.

Token double-counting — Splitting system_prompt_tokens and stored_summary_tokens and passing them separately is the right model. total_before correctly reflects the full context cost, and by not including stored_summary_tokens in total_after, the estimate correctly reflects the new summary replacing the old one.

truncate_tool_output negative-index guard — The refactoring to char_budget = max(0, max_chars - len(notice)) is cleaner and makes the intent explicit. However, the original guard if max_chars <= len(notice) was already functionally correct (it prevented negative indices). The PR description slightly overstates the severity here — this is a readability/defensiveness improvement, not a live bug fix. The new tests that assert result.startswith("x") are a good addition regardless.

Regex extension for bullet lists — The new alternation |\n{2,}|\n(?=[-*•]\s|\d+[.)]\s) is correct and handles the common markdown response formats. The existing lookbehind (?<=\w{3}[.!?]) was already a fixed-length construct valid in Python re, and the added alternates don't break anything.


Potential Issue: estimated_tokens_after in CompactionResult is now stale

The CompactionResult.estimated_tokens_after field is computed inside compact_message_history and reflects the new summary alone (not the merged summary). This PR correctly adds actual_tokens_after to the logging path in _get_message_history, but result.estimated_tokens_after is still returned in the CompactionResult and will be an under-estimate when a previous summary exists.

If any code outside the logging path consumes CompactionResult.estimated_tokens_after to make decisions (e.g., triggering another compaction, capacity checks, telemetry), it will see an optimistic token count. Worth either:

  • Documenting on CompactionResult.estimated_tokens_after that it reflects only the new summary, not the merged one, or
  • Adding a merged_tokens_after field to the result and computing it inside the function where the stored summary tokens are known.

Minor points

getattr(m, "content", "") in actual_tokens_after (pydantic_ai_agents.py):

estimate_token_count(getattr(m, "content", "") or "")

raw_messages are ChatMessage model instances which always have a content attribute. The defensive getattr is unnecessary — a plain m.content or "" is cleaner and doesn't obscure the type contract.

+10 slack in test_two_cycles_summary_stays_capped:

self.assertLessEqual(len(merged), max_chars + 10, ...)

cap_summary_length should produce max_chars + 1 at most (the ellipsis character). A tighter bound like max_chars + 2 (or max_chars + len("…")) would better constrain the cap and catch regressions more precisely.

test_persist_failure_keeps_full_message_list now just asserts result.summary is truthy and that the raw_messages list is still length 20 after swallowing the exception. The removed if not persist_failed branch was the conditional being tested. The simplified version still validates the failure path correctly, but a comment clarifying that state is intentionally not modified on failure would aid future readers.


Tests

The new test classes are well-structured and cover the key scenarios:

  • TestShouldCompactWithStoredSummary — verifies the token counting path
  • TestCompactionBookmarkDatabaseFilter — verifies ORM filter chaining
  • TestGetMessageHistoryCompactionTokenCounting — integration test for correct kwarg passing
  • TestTwoCycleCompactionIntegration — end-to-end multi-cycle verification
  • TestStripCompactionPrefix — edge cases for the new utility

The mock structure in TestCompactionBookmarkDatabaseFilter (using __aiter__ = lambda self: aiter_messages() with a captured comment explaining why self is required) is a nice touch for future maintainers.

Conversion from asyncio.run() to native async def in TestPersistCompactionOptimisticLock is correct — Django's SimpleTestCase supports async test methods natively since 4.1.


Summary

Approve with suggestions. The primary bugs are correctly fixed and the test suite is meaningfully improved. The main follow-up worth tracking is documenting or updating CompactionResult.estimated_tokens_after so consumers are not silently misled by the under-estimate when a prior summary exists.

@JSv4 JSv4 merged commit 60651ab into main Feb 22, 2026
7 checks passed
@JSv4 JSv4 deleted the claude/fix-issue-907-MSgfu branch February 22, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Context Guardrails for LLM Conversation Management

2 participants