Add ACP agent support (Claude Code, Codex) for all benchmarks#440
Add ACP agent support (Claude Code, Codex) for all benchmarks#440simonrosenberg wants to merge 38 commits intomainfrom
Conversation
Add --agent-type CLI argument to swebench-infer. When agent_type='acp', uses ACPAgent (Claude Code via ACP protocol) instead of the default LLM-based Agent. Forwards ANTHROPIC_API_KEY to the runtime container. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ACPAgent (Claude Code via ACP) completes the full task in a single conversation turn and doesn't emit FinishAction events. This caused two issues: 1. fake_user_response loop sent 10 "please continue" messages 2. AgentFinishedCritic marked the result as failed (no FinishAction) Fix: set max_fake_responses=0 for ACP mode and use empty_patch_critic which only requires a non-empty git patch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ACPAgent" This reverts commit 193dd94.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ACP requires settings to allow tool permissions in headless mode. Without this, the Write tool hangs waiting for interactive permission approval. Create ~/.claude/settings.json on the runtime before starting the conversation to allow Edit, Read, and Bash tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 'acp-codex' to agent_type choices in models and args parser - Use codex-acp command for acp-codex agent type - Forward OPENAI_API_KEY for codex containers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligns naming with 'acp-codex': both now follow the
acp-{provider} pattern, making the agent type self-documenting.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add acp_agent_name and acp_agent_version to test_result when using ACPAgent, enabling tracking of which ACP server and version produced each evaluation result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update acp_command in run_infer.py and vendor SDK to latest commit which fixes the bypass mode resolution and renames all references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Feature works but has several design issues and missing tests that should be addressed.
Linus-Style Verdict: The core idea is sound, but the implementation has too many magic strings and fragile shell operations. A benchmarking framework needs rock-solid reproducibility, and this PR introduces several potential failure points.
Move the ACP agent API key forwarding logic from swebench/run_infer.py into benchmarks/utils/acp.py so it can be reused by other benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ests - Expand benchmarks/utils/acp.py with is_acp_agent(), get_acp_command(), setup_acp_workspace(), and env-var existence validation - Fix shell injection in Claude settings.json write (use file_upload) - Make ACP command mapping explicit (raises ValueError for unknown types) - Replace all inline ACP checks in swebench/run_infer.py with shared utils - Add 21 tests covering all ACP utility functions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Clean abstraction, data-driven design, previous issues properly addressed.
Linus-Style Verdict: This is how you refactor. The centralized acp.py module with data-driven dicts (_ACP_COMMANDS, _ACP_ENV_VARS) eliminates magic strings and makes adding new providers trivial. Shell injection fixed properly with tempfile. The 21 tests cover edge cases including tuple inputs, missing env vars, and immutability guarantees. Backward compatible with agent_type="default".
The code does exactly what it needs to do without over-engineering. Approved.
The TODO comment about enabling condenser and security analyzer was accidentally removed during the ACP agent refactoring. Restoring it with a link to #463 for tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include LMNR_ENV_VARS (LMNR_PROJECT_API_KEY, LMNR_SPAN_CONTEXT) in the forward_env for ACP agents so Laminar tracing context propagates to the ACP subprocess. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace file_upload with execute_command using base64 encoding to work around 500 errors when uploading to paths with tilde expansion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…benchmarks For each benchmark: - Import ACP utilities and ACPAgent - Forward ACP API key env vars in prepare_workspace - Conditionally create ACPAgent in evaluate_instance - Setup ACP workspace (Claude settings.json) - Capture ACP agent metadata in test results - Pass agent_type to EvalMetadata Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nation ACP agents (Claude Code, Codex) use their own built-in tools and don't make calls to the workspace while processing. The runtime management system considers the workspace idle after ~20 min and terminates it. This causes evaluation failures for long-running instances (gaia, commit0). Fix: Add workspace_keepalive context manager that periodically runs a no-op command on the workspace to prevent idle termination. Applied to all 5 benchmark evaluate_instance methods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The eval_infer.py was overwriting the actual report_path
(e.g. OpenHands.{run_id}.json) with a hardcoded "report.json"
that doesn't exist, causing the evaluation to fail with FileNotFoundError.
Fix by using the actual report path from run_swebench_multimodal_evaluation()
and moving the copy inside the `if report_path:` guard.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update ACP utilities to forward both *_API_KEY and *_BASE_URL env vars. This enables routing Claude Code and Codex through the LiteLLM proxy when ANTHROPIC_BASE_URL/OPENAI_BASE_URL are set. Changes: - acp.py: _ACP_ENV_VARS now maps to a list of env vars (key + base URL) - acp.py: get_acp_forward_env forwards both API key and base URL - tests: Updated to check for both env vars Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolve merge conflicts: - benchmarks/swebench/run_infer.py: Combined ACP imports with main's new imports (IMAGE_TAG_PREFIX, add_prompt_path_argument, ensure_local_image, etc.) - benchmarks/swtbench/run_infer.py: Same import combination strategy - benchmarks/swebenchmultimodal/run_infer.py: Same import combination strategy - benchmarks/swebenchmultimodal/eval_infer.py: Dropped broken shutil.copy code (had missing import), used main's cleaner implementation - benchmarks/commit0/run_infer.py: Combined ACP imports with main's refactored code - vendor/software-agent-sdk: Used main's commit (bde715c)
Previously, workspace_keepalive waited for the full interval (300s) before sending the first ping. This caused ACP agents to hit runtime idle timeouts when: 1. The workspace was already partially idle when keepalive started 2. Network issues caused a single ping to fail Changes: - Send immediate ping on context entry to reset idle timer - Reduce default interval from 300s to 60s for more robust keepalive - Restructure loop to ping first, then wait (instead of wait first) - Add debug logging for ping success/failure to aid troubleshooting Fixes: #488 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Commit0 tasks are significantly more complex than other benchmarks (full repo implementations) and ACP agents regularly exceed the 30-min / 1-hour timeout. Increase the default from 3600s to 7200s (2 hours). Fixes #496 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fusals ACP agents (Claude Code, Codex) refuse prompts with coercive/threatening language like "will not be tolerated, success will be rewarded". Use a neutral variant for ACP agents that conveys the same intent without triggering safety filters. Regular agent prompts are unchanged. Fixes #495 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For ACP agents, the regular tool_calls count only shows OpenHands ActionEvents (typically just 1 "finish" action), which is misleading. This adds a separate count for ACPToolCallEvent entries that capture the actual Claude Code tool calls. Output now shows: tool_calls=1 (acp:47) for ACP agents Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add extract_acp_model_hint() to strip litellm_proxy and provider prefixes from model strings. All 5 benchmark runners now pass acp_model to ACPAgent so the selected model flows through to the ACP server session. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The function strips prefixes for any provider, not just Anthropic. Update docstring to match actual behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update vendor SDK to 524e918e (adds acp_model field, agent_name/version properties) - Rename test_result to test_cmd_result in commit0 to fix CommandResult/dict type collision - Fix test_setup_acp_workspace to match current base64 execute_command implementation - Add type narrowing asserts in test_acp.py for Pyright strict mode - Apply ruff format/lint fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Opus 4.6 with extended thinking can take >30 min for complex tasks, causing premature timeouts with the previous 1800s (30 min) default. Increase to 3600s (1 hour) across all ACP-compatible benchmarks. Changes: - Add ACP_PROMPT_TIMEOUT constant in benchmarks/utils/acp.py - Pass acp_prompt_timeout to ACPAgent in all 5 ACP-compatible benchmarks: gaia, commit0, swebench, swtbench, swebenchmultimodal Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolve ACP benchmark runner conflicts by keeping ACP agent support on the ACP path, preserving condenser support on standard agents, and retaining the PR branch SDK pin required for ACPAgent features. Co-authored-by: openhands <openhands@all-hands.dev>
Add compatibility for older SDK image-build APIs while keeping benchmarks tests aligned with the current ACP SDK pin, and make timeout tests deterministic on CI runners. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands /codereview-roasted |
|
I'm on it! simonrosenberg can track my progress at all-hands.dev |
🔴 Needs improvement - Non-ACP Code Path ReviewI analyzed this PR for changes affecting the non-ACP (default agent) code path, as requested. Here are my findings: [CRITICAL ISSUES] (Must fix)[benchmarks/gaia/run_infer.py, Lines 320-345] 🐛 Condenser Support Removed The condenser feature is completely removed from the GAIA non-ACP agent path. The original code created a condenser when Original (main): condenser = None
if self.metadata.enable_condenser:
condenser = LLMSummarizingCondenser(...)
agent = Agent(
llm=self.metadata.llm,
tools=tools,
condenser=condenser, # ← Present
...
)New code (feat/acp-agent): else:
agent = Agent(
llm=self.metadata.llm,
tools=tools,
# ← NO condenser parameter!
mcp_config={...},
)This is a functional regression for non-ACP agents. The [benchmarks/commit0/run_infer.py, Line 365] The default # OLD: run_timeout = int(os.getenv("CONVERSATION_TIMEOUT", "3600"))
run_timeout = int(os.getenv("CONVERSATION_TIMEOUT", "7200"))PR description says this is for ACP agents with extended thinking, but this change affects the default agent code path too. If this is intentional, it should be documented. If it's meant only for ACP, the timeout increase should be conditional on [IMPROVEMENT OPPORTUNITIES] (Should consider)[benchmarks/utils/models.py, Line 105-112] 📦 Artifact Structure Change A new agent_type: Literal["default", "acp-claude", "acp-codex"] = Field(default="default", ...)This will add [VERIFIED: No Impact]The following changes were verified to NOT affect non-ACP behavior:
VERDICT:❌ Needs rework - The GAIA condenser regression must be fixed before merge. KEY INSIGHT: |
SummaryI performed a code review on PR #440 ("Add ACP agent support") to analyze whether it introduces functional/behavioral changes to the non-ACP agent code path. Findings:🔴 Critical Issues Found:
|
Summary
Add ACP (Agent Communication Protocol) agent support to all benchmarks, enabling evaluation with Claude Code and Codex CLI agents instead of the default LLM-based agent.
Core Changes
--agent-typeCLI argument: Added to all benchmark runners with optionsdefault,acp-claude,acp-codexbenchmarks/utils/acp.py) with shared helpers:is_acp_agent()- check if agent type is ACP-basedget_acp_command()- return the ACP command for agent typeget_acp_forward_env()- forward required API keys (+ BASE_URLs for LiteLLM proxy routing)extract_acp_model_hint()- strip LiteLLM proxy prefixes to get bare model IDssetup_acp_workspace()- configure Claude Code settings.json for headless modeworkspace_keepalive()- context manager to prevent runtime idle terminationagent_typefield added toEvalMetadataFixes & Improvements
workspace_keepalivecontext manager that pings every 60stool_calls=1 (acp:47))acp_agent_nameandacp_agent_versionincluded in eval output*_BASE_URLenv vars alongside API keysLMNR_*env vars to ACP subprocessacp_modelfieldTesting
tests/test_acp.pyacp_modelfield,agent_name/agent_versionproperties)Test Plan
--agent-type acp-claudeon 1 instance--agent-type acp-codexon 1 instance--agent-type acp-claudeon 1 instance--agent-type acp-claudeon 1 instance--agent-type defaultstill works (no regression)acp_agent_name,acp_agent_version) is captured in eval output🤖 Generated with Claude Code