Fix context guardrails token counting and summary merging#920
Conversation
Address all ten items from the context guardrails review: 1. Harden truncate_tool_output against negative slice indices by using explicit char_budget = max(0, max_chars - len(notice)) 2. messages_to_proxies already in __all__ (no change needed) 3. Fix token double-counting by separating stored_summary_tokens from system_prompt_tokens in compact_message_history 4. Document why compacted_before_message_id uses BigIntegerField 5. Strip duplicate COMPACTION_SUMMARY_PREFIX during summary merges 6. Add clarifying comment on the unreachable recent_count < 1 guard 7. Convert async test from asyncio.run() to native async def methods 8. Extend sentence extraction regex to handle markdown bullets and paragraph boundaries 9. Strengthen truncation test assertions with exact length bounds and beginning-of-string checks 10. Add TestCompactionBookmarkDatabaseFilter verifying id__gt filtering https://claude.ai/code/session_01XMfrtouuRV6wHwL9HB4iKL
PR Review: Fix context guardrails token counting and summary mergingOverall the fixes are sound and address real bugs. The core logic is correct. Here are my findings: Bugs / Issues1. should_compact() not updated to accept stored_summary_tokens (inconsistency)compact_message_history() now correctly accounts for stored_summary_tokens in total_before, but should_compact() - the public pre-check helper - was not updated with the same parameter. Any caller that uses should_compact() as a guard before calling compact_message_history() will compute a different (lower) token total than compact_message_history() does internally. This means a context window inflated by a large stored summary could slip past the should_compact() guard. Either add stored_summary_tokens to should_compact() for parity, or add a docstring caveat. 2. Nested _strip_prefix function redefined on every callIn pydantic_ai_agents.py:371, _strip_prefix is defined inside _get_message_history(). This is better placed as a private module-level function in context_guardrails.py (alongside COMPACTION_SUMMARY_PREFIX) or at least at class scope. As written it is recreated on every compaction event and is untestable in isolation. Code Quality Observations3. Truncation produces exactly max_chars - docstring says at mostWhen char_budget > 0, the result is always exactly max_chars chars. The function docstring says at most max_chars characters which is technically correct, but the new tests use assertLessEqual when assertEqual would be more precise for the positive-budget path. Minor - not a bug. 4. Regex new list-marker alternative could use a clarifying commentThe new pattern for list markers is correct but slightly non-obvious. The digit+punctuation alternative will match 1. or 2) at the start of a line and correctly requires trailing whitespace, so it won't split on mid-sentence decimals. A brief inline comment would help future readers understand why whitespace is required in the lookahead. Test Quality5. test_truncated_content_from_beginning_not_end splits on a format-specific stringThe test splits on a hardcoded string literal that relies on TOOL_OUTPUT_TRUNCATION_NOTICE starting with a specific prefix. If the notice format changes, the test will either split incorrectly or become a no-op. Consider importing TOOL_OUTPUT_TRUNCATION_NOTICE directly in the test and splitting on the actual constant's prefix. 6. TestCompactionBookmarkDatabaseFilter mocks but doesn't validate filteringThe tests verify that filter(id__gt=50) is called on the queryset, but the mock always yields the pre-filtered expected_messages regardless. The length assertion therefore only validates the mock, not the filtering logic. This is acceptable for a unit test (integration tests cover the DB), but the test name/docstring could be clearer that it is verifying the call was made rather than the filter's correctness. Minor Nits
Summary
The core logic changes (token split, prefix deduplication, negative-index defense, async test conversion) are all correct. The PR is in good shape - addressing the should_compact() inconsistency is the most important follow-up. |
Code ReviewGood set of targeted fixes. The bugs addressed here are real and the changes are generally clean. A few issues worth addressing: Medium:
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
- Resolve CHANGELOG.md merge conflict with main - Add TestGetMessageHistoryCompactionTokenCounting: exercises the compaction eligibility check path in _get_message_history (pydantic_ai_agents.py lines 359-373) covering system_prompt_tokens and stored_summary_tokens computation - Remove dead code in TestPersistFailurePreservesContext that created 2 uncovered lines (unreachable if-branch)
Code Review: Fix context guardrails token counting and summary mergingOverall this is a well-targeted set of fixes. The core bug fixes are correct and the changelog/docstring improvements are a nice addition. A few things worth discussing before merge. ✅ Correct fixes
Token double-counting — The split of Duplicate prefix — Stripping Sentence extraction regex — Adding Async test pattern — Converting
|
- Add stored_summary_tokens param to should_compact() for API consistency with compact_message_history() - Extract nested _strip_prefix into module-level strip_compaction_prefix() in context_guardrails.py for testability and reuse - Fix log accuracy: compute actual post-merge token count instead of using CompactionResult.estimated_tokens_after which only reflects new summary - Add two-cycle compaction integration test verifying prefix deduplication, token counting, and summary capping across successive cycles - Clarify CHARS_PER_TOKEN_ESTIMATE and async mock pattern comments
PR Review: Fix context guardrails token counting and summary mergingOverall this is a solid, well-targeted PR. The core bugs are real, the fixes are correct, and the test coverage added is genuinely comprehensive. A few observations below. CorrectnessSummary merge deduplication (primary fix) — The merge logic fix is correct. The old code: stored_summary.rstrip() + "\n\n" + result.summarywas concatenating two Token double-counting — Splitting
Regex extension for bullet lists — The new alternation Potential Issue:
|
Summary
This PR fixes several critical issues in the LLM conversation context management system, particularly around token counting during compaction cycles, truncation edge cases, and summary merging logic.
Key Changes
Token Counting & Compaction
stored_summary_tokensparameter tocompact_message_history()to properly track previously stored summaries separately from the system prompt. This prevents double-counting the old summary intotal_before(threshold calculation) while correctly accounting for it being replaced by the new summary intotal_after._get_message_history()to split system prompt and stored summary token counts, passing them separately to the compaction function.Summary Merging
COMPACTION_SUMMARY_PREFIXfrom both old and new summaries before merging, then re-adding it once to the merged result.Truncation Safety
truncate_tool_output()with explicitchar_budget = max(0, max_chars - len(notice))to prevent negative slice indices whenmax_charsis smaller than the truncation notice length. This ensures content is always taken from the beginning of the string, never the end.Sentence Extraction Robustness
Extended first-sentence regex in
_deterministic_summary()to split on:-,*,•, numbered lists)This prevents entire bullet-list responses from being incorrectly treated as a single sentence.
Test Improvements
TestPersistCompactionOptimisticLockfromasyncio.run()wrapper calls to nativeasync deftest methods, removing the unusedasyncioimport.<= max_chars) and confirm content starts from the beginning of input.Documentation
CHARS_PER_TOKEN_ESTIMATEdocstring to explain the intentional use of 3.5 (not 4) for conservative token over-counting.compacted_before_message_idfield documentation explaining whyBigIntegerField(notForeignKey) is appropriate — theid__gtfilter remains correct even if the cutoff message is deleted.Implementation Details
The core issue was that during successive compaction cycles, the system was:
These fixes ensure accurate token budgeting for compaction decisions and cleaner, more predictable summary text across multiple compaction cycles.
https://claude.ai/code/session_01XMfrtouuRV6wHwL9HB4iKL