Skip to content

LEADS-240: Token usage should be 0 for a re-run with successful cache#176

Open
xmican10 wants to merge 1 commit intolightspeed-core:mainfrom
xmican10:LEADS-240-0-token-usage-when-using-cache
Open

LEADS-240: Token usage should be 0 for a re-run with successful cache#176
xmican10 wants to merge 1 commit intolightspeed-core:mainfrom
xmican10:LEADS-240-0-token-usage-when-using-cache

Conversation

@xmican10
Copy link
Contributor

@xmican10 xmican10 commented Feb 27, 2026

…such scenarios

Description

  • When llm cache is enabled, JudgeLLM token counts are not added + unit test for this scenario
  • When api cache is enabled, API Calls token counts are zeroed when response is loaded from cache + unit test for such scenario
  • For Deepeval JudgeLLM token counts are not counted, so this issue is ignored. This tracks another ticket LEADS-241

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude(e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Bug Fixes
    • Improved accuracy of token usage tracking. Token counts from cached API responses are now properly zeroed to prevent double-counting. Token consumption is now accurately recorded only when responses are freshly fetched from the service, not when retrieved from cache.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 27, 2026

Walkthrough

This PR modifies token count handling for cached API responses. The API client now zeros out token counts when retrieving cached responses, and the custom LLM module updates token tracking logic to check for cache hits and only count tokens when responses are not cached, preventing duplicate token accounting.

Changes

Cohort / File(s) Summary
API Client Cache Token Handling
src/lightspeed_evaluation/core/api/client.py
Modified _get_cached_response to zero out input_tokens and output_tokens on cached APIResponse objects before returning, ensuring cached responses contribute no token counts.
LLM Token Tracking Logic
src/lightspeed_evaluation/core/llm/custom.py
Restructured token tracking to occur in a finally block and added cache-hit detection via response._hidden_params["cache_hit"]; tokens are only recorded when response exists, tracker is active, and response is not a cache hit.
API Client Tests
tests/unit/core/api/test_client.py
Added test_get_cached_response_zeros_token_counts to verify token counts are zeroed while other response fields remain intact.
LLM Tests
tests/unit/core/llm/test_custom.py
Added test_call_does_not_add_tokens_on_cache_hit to verify tokens are not tracked on cache hits; updated existing test with _hidden_params = {} to exclude cache-hit path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • lightspeed-evaluation#75: Implements the caching mechanism that this PR modifies to handle token counts correctly on cached responses.

Suggested reviewers

  • asamal4
  • tisnik
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: zeroing token usage when cache is successfully hit, which aligns with the primary objectives of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/api/client.py (1)

284-289: Avoid mutating cached response objects in place.

This works functionally, but mutating cached model instances can create hidden side effects. Prefer returning a copy with zeroed token fields.

♻️ Proposed refactor
-        # Zero out token counts for cached responses since no API call was made
-        if cached_response is not None:
-            cached_response.input_tokens = 0
-            cached_response.output_tokens = 0
-
-        return cached_response
+        # Return zero token usage for cache hits without mutating cached object state
+        if cached_response is None:
+            return None
+
+        return cached_response.model_copy(
+            update={"input_tokens": 0, "output_tokens": 0}
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/core/api/client.py` around lines 284 - 289, The
current code mutates the cached_response object in place by setting
cached_response.input_tokens/output_tokens to 0; instead create and return a
copy to avoid side effects—make a shallow copy (e.g., via copy.copy,
dataclass.replace, or the model's own clone/copy method) of cached_response, set
input_tokens and output_tokens to 0 on the copy, and return that copy while
leaving the original cached_response untouched.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lightspeed_evaluation/core/llm/custom.py`:
- Line 111: Remove the inline pylint suppression on the newly modified functions
(notably the method named call) and refactor the implementations to satisfy lint
rules instead: for the call method, reduce the number of local variables by
extracting helper functions, grouping related values into small
dataclasses/tuples or reusing existing attributes, and simplifying complex
expressions; apply the same approach to the other location where a "# pylint:
disable" was added (refactor to smaller helper functions or combine variables)
so no "# pylint: disable" pragmas are required. Ensure all logic remains covered
by tests and that helper functions are private and colocated with the original
function to preserve readability.

---

Nitpick comments:
In `@src/lightspeed_evaluation/core/api/client.py`:
- Around line 284-289: The current code mutates the cached_response object in
place by setting cached_response.input_tokens/output_tokens to 0; instead create
and return a copy to avoid side effects—make a shallow copy (e.g., via
copy.copy, dataclass.replace, or the model's own clone/copy method) of
cached_response, set input_tokens and output_tokens to 0 on the copy, and return
that copy while leaving the original cached_response untouched.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e395459 and 691fb68.

📒 Files selected for processing (4)
  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/core/llm/custom.py
  • tests/unit/core/api/test_client.py
  • tests/unit/core/llm/test_custom.py

litellm.ssl_verify = False

def call(
def call( # pylint: disable=too-many-locals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove newly introduced pylint suppression pragmas.

Line 111 and Line 190 add # pylint: disable comments. Please refactor to satisfy lint without inline suppression.

As per coding guidelines, **/*.py: "Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead". Based on learnings, the exception for too-many-locals is only acceptable for lazy-import-heavy functions, which does not apply here.

Also applies to: 190-190

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/core/llm/custom.py` at line 111, Remove the inline
pylint suppression on the newly modified functions (notably the method named
call) and refactor the implementations to satisfy lint rules instead: for the
call method, reduce the number of local variables by extracting helper
functions, grouping related values into small dataclasses/tuples or reusing
existing attributes, and simplifying complex expressions; apply the same
approach to the other location where a "# pylint: disable" was added (refactor
to smaller helper functions or combine variables) so no "# pylint: disable"
pragmas are required. Ensure all logic remains covered by tests and that helper
functions are private and colocated with the original function to preserve
readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant