Skip to content

fix: unify RPC error response JSON key to "error" across server and s…#1019

Merged
garrett4wade merged 2 commits intoinclusionAI:mainfrom
HT-Yuan:bugfix/unify-rpc-error-response-key
Mar 13, 2026
Merged

fix: unify RPC error response JSON key to "error" across server and s…#1019
garrett4wade merged 2 commits intoinclusionAI:mainfrom
HT-Yuan:bugfix/unify-rpc-error-response-key

Conversation

@HT-Yuan
Copy link
Contributor

@HT-Yuan HT-Yuan commented Mar 11, 2026

Description

Unify RPC error response JSON key to "error" across server and schedulers.

The rpc_server.py uses "error" as the JSON key in error responses (42 out of 45 places), but 3 places in the /configure endpoint incorrectly use "detail". On the consumer side, local.py and slurm.py read the error message using .get("detail", "Unknown error") in most places, which fails to extract the actual error message from the server, always falling back to "Unknown error".

This mismatch makes it impossible to debug RPC failures — the real error message from the worker is silently lost.

Changes:

  • rpc_server.py: Changed 3 occurrences of "detail""error" in /configure endpoint (aligning with the other 42 uses of "error")
  • local.py: Changed 8 occurrences of .get("detail", ...).get("error", ...)
  • slurm.py: Changed 8 occurrences of .get("detail", ...).get("error", ...)
  • test_local_scheduler.py: Updated 3 mock response payloads to use "error" key

Note: proxy_gateway.py (FastAPI) intentionally uses "detail" following FastAPI/OpenAPI conventions and is not affected by this change.

Related Issue

N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Additional Context

Before this fix, any RPC error (e.g., engine onload failure) would show as "Unknown error" in the traceback because local.py/slurm.py read .get("detail") while rpc_server.py returns {"error": "..."}. This made distributed debugging extremely difficult.

issue

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an inconsistency in how RPC error responses are formatted, specifically unifying the JSON key used for error messages to 'error'. This change ensures that error messages are correctly propagated and improves the debuggability of RPC failures.

Highlights

  • Error Response Unification: This PR unifies the JSON key used for error responses across the server and schedulers to consistently use 'error' instead of the inconsistent 'detail'.
  • Improved Debugging: The change resolves an issue where the actual error message from the worker was silently lost, making debugging RPC failures difficult.
  • Code Modifications: The PR modifies rpc_server.py, local.py, slurm.py, and test_local_scheduler.py to ensure consistent error key usage.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • local.py
    • Standardized error message extraction using the 'error' key.
  • rpc_server.py
    • Fixed inconsistent error key usage in /configure endpoint.
  • slurm.py
    • Standardized error message extraction using the 'error' key.
  • test_local_scheduler.py
    • Updated mock response payloads to use the 'error' key.
Activity
  • The PR author, HT-Yuan, has addressed an inconsistency in RPC error responses.
  • The changes involve modifications to rpc_server.py, local.py, slurm.py, and test_local_scheduler.py.
  • The PR includes unit tests and documentation updates.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes replace the string 'detail' with 'error' in the json responses for error messages. The code review identified critical insecure deserialization vulnerabilities in the RPC server and both the local and Slurm schedulers, where responses from worker processes are deserialized using deserialize_value, potentially leading to Remote Code Execution (RCE) if a compromised worker returns a malicious payload. Additionally, the review suggests consolidating error handling logic for status codes 400 and 500 in local.py and slurm.py to reduce code duplication, and consolidating the checks for required fields in rpc_server.py into a loop for better maintainability.

Comment on lines 436 to +442
config = data.get("config")
if config is None:
return jsonify({"detail": "Missing 'config' field in request"}), 400
return jsonify({"error": "Missing 'config' field in request"}), 400

rank = data.get("rank")
if rank is None:
return jsonify({"detail": "Missing 'rank' field in request"}), 400
return jsonify({"error": "Missing 'rank' field in request"}), 400
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve maintainability and reduce repetition, you can consolidate the checks for required fields into a loop. This makes it easier to add more required fields in the future.

        for field in ("config", "rank"):
            if data.get(field) is None:
                return jsonify({"error": f"Missing '{field}' field in request"}), 400

        config = data["config"]
        rank = data["rank"]

Comment on lines 868 to 873
elif response.status_code == 400:
error_detail = response.json().get("detail", "Unknown error")
error_detail = response.json().get("error", "Unknown error")
raise WorkerConfigurationError(worker_id, error_detail, str(400))
elif response.status_code == 500:
error_detail = response.json().get("detail", "Unknown error")
error_detail = response.json().get("error", "Unknown error")
raise WorkerConfigurationError(worker_id, error_detail, str(500))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error handling logic for status codes 400 and 500 is identical. You can combine these elif blocks to reduce code duplication and improve readability.

Suggested change
elif response.status_code == 400:
error_detail = response.json().get("detail", "Unknown error")
error_detail = response.json().get("error", "Unknown error")
raise WorkerConfigurationError(worker_id, error_detail, str(400))
elif response.status_code == 500:
error_detail = response.json().get("detail", "Unknown error")
error_detail = response.json().get("error", "Unknown error")
raise WorkerConfigurationError(worker_id, error_detail, str(500))
elif response.status_code in (400, 500):
error_detail = response.json().get("error", "Unknown error")
raise WorkerConfigurationError(worker_id, error_detail, str(response.status_code))

Comment on lines 311 to 316
elif response.status_code == 400:
error_detail = response.json().get("detail", "Unknown error")
error_detail = response.json().get("error", "Unknown error")
raise WorkerConfigurationError(worker_id, error_detail, str(400))
elif response.status_code == 500:
error_detail = response.json().get("detail", "Unknown error")
error_detail = response.json().get("error", "Unknown error")
raise WorkerConfigurationError(worker_id, error_detail, str(500))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error handling for status codes 400 and 500 is identical. To improve conciseness and maintainability, you can combine these elif blocks.

Suggested change
elif response.status_code == 400:
error_detail = response.json().get("detail", "Unknown error")
error_detail = response.json().get("error", "Unknown error")
raise WorkerConfigurationError(worker_id, error_detail, str(400))
elif response.status_code == 500:
error_detail = response.json().get("detail", "Unknown error")
error_detail = response.json().get("error", "Unknown error")
raise WorkerConfigurationError(worker_id, error_detail, str(500))
elif response.status_code in (400, 500):
error_detail = response.json().get("error", "Unknown error")
raise WorkerConfigurationError(worker_id, error_detail, str(response.status_code))

@HT-Yuan
Copy link
Contributor Author

HT-Yuan commented Mar 11, 2026

Thanks for the review. These comments are about pre-existing code
(deserialize_value), not the changes in this PR. This PR only fixes
the JSON key mismatch ("detail""error") between the RPC server
and schedulers. The deserialization logic is unchanged.

Regarding the security concern: the RPC communication is internal
(scheduler ↔ workers it spawned), within a trusted boundary.

Copy link
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@garrett4wade garrett4wade merged commit 7cad4da into inclusionAI:main Mar 13, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants