Add ACP agent support (Claude Code, Codex) for all benchmarks by simonrosenberg · Pull Request #440 · OpenHands/benchmarks

simonrosenberg · 2026-02-23T21:19:35Z

Summary

Add ACP (Agent Communication Protocol) agent support to all benchmarks, enabling evaluation with Claude Code and Codex CLI agents instead of the default LLM-based agent.

Core Changes

New --agent-type CLI argument: Added to all benchmark runners with options default, acp-claude, acp-codex
New ACP utility module (benchmarks/utils/acp.py) with shared helpers:
- is_acp_agent() - check if agent type is ACP-based
- get_acp_command() - return the ACP command for agent type
- get_acp_forward_env() - forward required API keys (+ BASE_URLs for LiteLLM proxy routing)
- extract_acp_model_hint() - strip LiteLLM proxy prefixes to get bare model IDs
- setup_acp_workspace() - configure Claude Code settings.json for headless mode
- workspace_keepalive() - context manager to prevent runtime idle termination
ACP support in all 5 benchmarks: swebench, swtbench, gaia, commit0, swebenchmultimodal
agent_type field added to EvalMetadata

Fixes & Improvements

Workspace keepalive: ACP agents don't call workspace during thinking → runtime idle kill. Fixed with workspace_keepalive context manager that pings every 60s
Neutral GAIA prompt: ACP agents refused prompts with coercive language ("will not be tolerated"). Use neutral variant for ACP
Increased timeouts:
- ACP prompt timeout: 3600s (1 hour) for extended thinking models
- Commit0 conversation timeout: 7200s (2 hours) for complex tasks
Console logging: Display ACP tool call count separately (e.g., tool_calls=1 (acp:47))
ACP metadata capture: acp_agent_name and acp_agent_version included in eval output
LiteLLM proxy support: Forward *_BASE_URL env vars alongside API keys
Laminar tracing: Forward LMNR_* env vars to ACP subprocess
Model passthrough: Extract model hint and pass to ACPAgent via acp_model field
Shell injection fix: Use base64 encoding for Claude settings file write
swebenchmultimodal report.json bug: Fixed pre-existing bug where report_path was overwritten

Testing

21 tests added for ACP utility functions in tests/test_acp.py
Vendor SDK updated to 524e918e (adds acp_model field, agent_name/agent_version properties)

Test Plan

Run SWE-bench eval with --agent-type acp-claude on 1 instance
Run SWE-bench eval with --agent-type acp-codex on 1 instance
Run GAIA eval with --agent-type acp-claude on 1 instance
Run commit0 eval with --agent-type acp-claude on 1 instance
Verify default --agent-type default still works (no regression)
Verify ACP agent metadata (acp_agent_name, acp_agent_version) is captured in eval output

🤖 Generated with Claude Code

Add --agent-type CLI argument to swebench-infer. When agent_type='acp', uses ACPAgent (Claude Code via ACP protocol) instead of the default LLM-based Agent. Forwards ANTHROPIC_API_KEY to the runtime container. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ACPAgent (Claude Code via ACP) completes the full task in a single conversation turn and doesn't emit FinishAction events. This caused two issues: 1. fake_user_response loop sent 10 "please continue" messages 2. AgentFinishedCritic marked the result as failed (no FinishAction) Fix: set max_fake_responses=0 for ACP mode and use empty_patch_critic which only requires a non-empty git patch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ACPAgent" This reverts commit 193dd94.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Claude Code ACP requires settings to allow tool permissions in headless mode. Without this, the Write tool hangs waiting for interactive permission approval. Create ~/.claude/settings.json on the runtime before starting the conversation to allow Edit, Read, and Bash tools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add 'acp-codex' to agent_type choices in models and args parser - Use codex-acp command for acp-codex agent type - Forward OPENAI_API_KEY for codex containers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Aligns naming with 'acp-codex': both now follow the acp-{provider} pattern, making the agent type self-documenting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add acp_agent_name and acp_agent_version to test_result when using ACPAgent, enabling tracking of which ACP server and version produced each evaluation result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update acp_command in run_infer.py and vendor SDK to latest commit which fixes the bypass mode resolution and renames all references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

🟡 Acceptable - Feature works but has several design issues and missing tests that should be addressed.

Linus-Style Verdict: The core idea is sound, but the implementation has too many magic strings and fragile shell operations. A benchmarking framework needs rock-solid reproducibility, and this PR introduces several potential failure points.

benchmarks/swebench/run_infer.py

benchmarks/utils/models.py

benchmarks/swebench/run_infer.py

Move the ACP agent API key forwarding logic from swebench/run_infer.py into benchmarks/utils/acp.py so it can be reused by other benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ests - Expand benchmarks/utils/acp.py with is_acp_agent(), get_acp_command(), setup_acp_workspace(), and env-var existence validation - Fix shell injection in Claude settings.json write (use file_upload) - Make ACP command mapping explicit (raises ValueError for unknown types) - Replace all inline ACP checks in swebench/run_infer.py with shared utils - Add 21 tests covering all ACP utility functions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

🟢 Good taste - Clean abstraction, data-driven design, previous issues properly addressed.

Linus-Style Verdict: This is how you refactor. The centralized acp.py module with data-driven dicts (_ACP_COMMANDS, _ACP_ENV_VARS) eliminates magic strings and makes adding new providers trivial. Shell injection fixed properly with tempfile. The 21 tests cover edge cases including tuple inputs, missing env vars, and immutability guarantees. Backward compatible with agent_type="default".

The code does exactly what it needs to do without over-engineering. Approved.

The TODO comment about enabling condenser and security analyzer was accidentally removed during the ACP agent refactoring. Restoring it with a link to #463 for tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Include LMNR_ENV_VARS (LMNR_PROJECT_API_KEY, LMNR_SPAN_CONTEXT) in the forward_env for ACP agents so Laminar tracing context propagates to the ACP subprocess. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace file_upload with execute_command using base64 encoding to work around 500 errors when uploading to paths with tilde expansion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…benchmarks For each benchmark: - Import ACP utilities and ACPAgent - Forward ACP API key env vars in prepare_workspace - Conditionally create ACPAgent in evaluate_instance - Setup ACP workspace (Claude settings.json) - Capture ACP agent metadata in test results - Pass agent_type to EvalMetadata Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nation ACP agents (Claude Code, Codex) use their own built-in tools and don't make calls to the workspace while processing. The runtime management system considers the workspace idle after ~20 min and terminates it. This causes evaluation failures for long-running instances (gaia, commit0). Fix: Add workspace_keepalive context manager that periodically runs a no-op command on the workspace to prevent idle termination. Applied to all 5 benchmark evaluate_instance methods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The eval_infer.py was overwriting the actual report_path (e.g. OpenHands.{run_id}.json) with a hardcoded "report.json" that doesn't exist, causing the evaluation to fail with FileNotFoundError. Fix by using the actual report path from run_swebench_multimodal_evaluation() and moving the copy inside the `if report_path:` guard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update ACP utilities to forward both *_API_KEY and *_BASE_URL env vars. This enables routing Claude Code and Codex through the LiteLLM proxy when ANTHROPIC_BASE_URL/OPENAI_BASE_URL are set. Changes: - acp.py: _ACP_ENV_VARS now maps to a list of env vars (key + base URL) - acp.py: get_acp_forward_env forwards both API key and base URL - tests: Updated to check for both env vars Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Resolve merge conflicts: - benchmarks/swebench/run_infer.py: Combined ACP imports with main's new imports (IMAGE_TAG_PREFIX, add_prompt_path_argument, ensure_local_image, etc.) - benchmarks/swtbench/run_infer.py: Same import combination strategy - benchmarks/swebenchmultimodal/run_infer.py: Same import combination strategy - benchmarks/swebenchmultimodal/eval_infer.py: Dropped broken shutil.copy code (had missing import), used main's cleaner implementation - benchmarks/commit0/run_infer.py: Combined ACP imports with main's refactored code - vendor/software-agent-sdk: Used main's commit (bde715c)

Previously, workspace_keepalive waited for the full interval (300s) before sending the first ping. This caused ACP agents to hit runtime idle timeouts when: 1. The workspace was already partially idle when keepalive started 2. Network issues caused a single ping to fail Changes: - Send immediate ping on context entry to reset idle timer - Reduce default interval from 300s to 60s for more robust keepalive - Restructure loop to ping first, then wait (instead of wait first) - Add debug logging for ping success/failure to aid troubleshooting Fixes: #488 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Commit0 tasks are significantly more complex than other benchmarks (full repo implementations) and ACP agents regularly exceed the 30-min / 1-hour timeout. Increase the default from 3600s to 7200s (2 hours). Fixes #496 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…fusals ACP agents (Claude Code, Codex) refuse prompts with coercive/threatening language like "will not be tolerated, success will be rewarded". Use a neutral variant for ACP agents that conveys the same intent without triggering safety filters. Regular agent prompts are unchanged. Fixes #495 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

For ACP agents, the regular tool_calls count only shows OpenHands ActionEvents (typically just 1 "finish" action), which is misleading. This adds a separate count for ACPToolCallEvent entries that capture the actual Claude Code tool calls. Output now shows: tool_calls=1 (acp:47) for ACP agents Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add extract_acp_model_hint() to strip litellm_proxy and provider prefixes from model strings. All 5 benchmark runners now pass acp_model to ACPAgent so the selected model flows through to the ACP server session. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The function strips prefixes for any provider, not just Anthropic. Update docstring to match actual behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Update vendor SDK to 524e918e (adds acp_model field, agent_name/version properties) - Rename test_result to test_cmd_result in commit0 to fix CommandResult/dict type collision - Fix test_setup_acp_workspace to match current base64 execute_command implementation - Add type narrowing asserts in test_acp.py for Pyright strict mode - Apply ruff format/lint fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Claude Opus 4.6 with extended thinking can take >30 min for complex tasks, causing premature timeouts with the previous 1800s (30 min) default. Increase to 3600s (1 hour) across all ACP-compatible benchmarks. Changes: - Add ACP_PROMPT_TIMEOUT constant in benchmarks/utils/acp.py - Pass acp_prompt_timeout to ACPAgent in all 5 ACP-compatible benchmarks: gaia, commit0, swebench, swtbench, swebenchmultimodal Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Resolve ACP benchmark runner conflicts by keeping ACP agent support on the ACP path, preserving condenser support on standard agents, and retaining the PR branch SDK pin required for ACPAgent features. Co-authored-by: openhands <openhands@all-hands.dev>

Add compatibility for older SDK image-build APIs while keeping benchmarks tests aligned with the current ACP SDK pin, and make timeout tests deterministic on CI runners. Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg · 2026-03-16T11:11:36Z

@OpenHands /codereview-roasted
Does this PR introduce any functional/behavioral changes to the non-ACP agent code path? It should not. Flag any changes in configuration / artifacts structure / any behavior that would be mutated

openhands-ai · 2026-03-16T11:12:02Z

I'm on it! simonrosenberg can track my progress at all-hands.dev

simonrosenberg · 2026-03-16T11:14:14Z

🔴 Needs improvement - Non-ACP Code Path Review

I analyzed this PR for changes affecting the non-ACP (default agent) code path, as requested. Here are my findings:

[CRITICAL ISSUES] (Must fix)

[benchmarks/gaia/run_infer.py, Lines 320-345] 🐛 Condenser Support Removed

The condenser feature is completely removed from the GAIA non-ACP agent path. The original code created a condenser when enable_condenser was set, but the new else block for default agents does not include condenser support.

Original (main):

condenser = None
if self.metadata.enable_condenser:
    condenser = LLMSummarizingCondenser(...)

agent = Agent(
    llm=self.metadata.llm,
    tools=tools,
    condenser=condenser,  # ← Present
    ...
)

New code (feat/acp-agent):

else:
    agent = Agent(
        llm=self.metadata.llm,
        tools=tools,
        # ← NO condenser parameter!
        mcp_config={...},
    )

This is a functional regression for non-ACP agents. The --enable-condenser flag will be silently ignored for GAIA evaluations.

[benchmarks/commit0/run_infer.py, Line 365] ⚠️ Default Timeout Changed

The default CONVERSATION_TIMEOUT was changed from 3600 (1 hour) to 7200 (2 hours) for ALL agent types, not just ACP:

# OLD: run_timeout = int(os.getenv("CONVERSATION_TIMEOUT", "3600"))
run_timeout = int(os.getenv("CONVERSATION_TIMEOUT", "7200"))

PR description says this is for ACP agents with extended thinking, but this change affects the default agent code path too. If this is intentional, it should be documented. If it's meant only for ACP, the timeout increase should be conditional on is_acp_agent().

[IMPROVEMENT OPPORTUNITIES] (Should consider)

[benchmarks/utils/models.py, Line 105-112] 📦 Artifact Structure Change

A new agent_type field is added to EvalMetadata:

agent_type: Literal["default", "acp-claude", "acp-codex"] = Field(default="default", ...)

This will add "agent_type": "default" to all output JSONs. While backward compatible, this changes the artifact structure and should be documented as a known change.

[VERIFIED: No Impact]

The following changes were verified to NOT affect non-ACP behavior:

✅ console_logging.py: total_tool_calls = n_tool_calls + n_acp_tool_calls is equivalent when n_acp_tool_calls=0
✅ workspace_keepalive(): Pass-through no-op for non-ACP agents
✅ setup_acp_workspace(): No-op for agent_type="default"
✅ get_acp_forward_env(): Returns input unchanged for non-ACP agents
✅ swebench, swtbench, swebenchmultimodal, commit0: Condenser preserved in non-ACP path
✅ swefficiency: Refactor of sched_getaffinity is functionally equivalent

VERDICT:

❌ Needs rework - The GAIA condenser regression must be fixed before merge.

KEY INSIGHT:
When refactoring code into if/else branches for feature flags, all existing functionality in the original code must be preserved in the "default" branch.

openhands-ai · 2026-03-16T11:14:35Z

Summary

I performed a code review on PR #440 ("Add ACP agent support") to analyze whether it introduces functional/behavioral changes to the non-ACP agent code path.

Findings:

🔴 Critical Issues Found:

GAIA Condenser Regression (benchmarks/gaia/run_infer.py, Lines 320-345):
- The condenser feature was completely removed from the non-ACP agent path
- The original code created a condenser when enable_condenser was set, but the refactored else block for default agents does not include condenser support
- The --enable-condenser flag will be silently ignored for GAIA evaluations
Commit0 Timeout Change (benchmarks/commit0/run_infer.py, Line 365):
- Default CONVERSATION_TIMEOUT changed from 3600s to 7200s for ALL agent types, not just ACP
- This affects the default agent code path behavior

⚠️ Artifact Structure Change:

New agent_type field added to EvalMetadata - adds "agent_type": "default" to all output JSONs (backward compatible)

✅ Verified No Impact:

Console logging, workspace_keepalive, setup_acp_workspace, get_acp_forward_env, swebench/swtbench/swebenchmultimodal/commit0 condenser handling, swefficiency refactor - all preserve non-ACP behavior

No code changes were made - this was a review-only task. My review comment was posted to the PR at:
#440 (comment)

This reverts commit efcbfee.

simonrosenberg and others added 10 commits February 23, 2026 18:16

Revert "fix: skip fake user responses and use empty_patch_critic for …

107a5e8

…ACPAgent" This reverts commit 193dd94.

chore: update vendor SDK to d72fd08a (ACPAgent FinishAction)

98bedff

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: update vendor SDK to fb04adc (managed settings fix)

2ed8761

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: update vendor SDK to 9910c88 (bypassPermissions fix)

f8158e5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add acp-codex agent type for Codex ACP evaluation

eed513b

- Add 'acp-codex' to agent_type choices in models and args parser - Use codex-acp command for acp-codex agent type - Forward OPENAI_API_KEY for codex containers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rename agent_type 'acp' to 'acp-claude' for consistency

88e95fe

Aligns naming with 'acp-codex': both now follow the acp-{provider} pattern, making the agent type self-documenting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(swebench): include ACP agent metadata in eval output

fbfbe16

Add acp_agent_name and acp_agent_version to test_result when using ACPAgent, enabling tracking of which ACP server and version produced each evaluation result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg marked this pull request as draft February 24, 2026 12:36

chore: rename claude-code-acp to claude-agent-acp

1a2d669

Update acp_command in run_infer.py and vendor SDK to latest commit which fixes the bypass mode resolution and renames all references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg marked this pull request as ready for review February 27, 2026 12:13

all-hands-bot reviewed Feb 27, 2026

View reviewed changes

simonrosenberg and others added 2 commits February 27, 2026 09:17

refactor: extract ACP env forwarding into shared utility

05a70e2

Move the ACP agent API key forwarding logic from swebench/run_infer.py into benchmarks/utils/acp.py so it can be reused by other benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg requested a review from all-hands-bot February 27, 2026 12:33

all-hands-bot approved these changes Feb 27, 2026

View reviewed changes

simonrosenberg mentioned this pull request Feb 27, 2026

Enable condenser and security analyzer as configurable options in EvalMetadata #463

Open

simonrosenberg changed the title ~~Add ACPAgent support for SWE-bench evaluation~~ Add ACP agent support (Claude Code, Codex) for SWE-bench evaluation Feb 27, 2026

simonrosenberg and others added 7 commits March 3, 2026 15:31

Forward Laminar env vars to ACP agents

c86b3a2

Include LMNR_ENV_VARS (LMNR_PROJECT_API_KEY, LMNR_SPAN_CONTEXT) in the forward_env for ACP agents so Laminar tracing context propagates to the ACP subprocess. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix Claude settings file writing using base64 encoding

72f2e4f

Replace file_upload with execute_command using base64 encoding to work around 500 errors when uploading to paths with tilde expansion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

simonrosenberg mentioned this pull request Mar 6, 2026

ACP Agent: workspace_keepalive not preventing all 'Remote conversation ended with error' failures #488

Open

simonrosenberg and others added 7 commits March 9, 2026 17:09

fix: clarify extract_acp_model_hint docstring

2af291d

The function strips prefixes for any provider, not just Anthropic. Update docstring to match actual behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg changed the title ~~Add ACP agent support (Claude Code, Codex) for SWE-bench evaluation~~ Add ACP agent support (Claude Code, Codex) for all benchmarks Mar 10, 2026

simonrosenberg and others added 4 commits March 13, 2026 18:45

Fix CI after merging main into feat/acp-agent

1e3a3b2

Add compatibility for older SDK image-build APIs while keeping benchmarks tests aligned with the current ACP SDK pin, and make timeout tests deterministic on CI runners. Co-authored-by: openhands <openhands@all-hands.dev>

Restore non-ACP benchmark behavior

720c5ae

Revert image build drift from ACP branch

440434d

simonrosenberg added 5 commits March 16, 2026 08:20

Revert non-functional commit0 rename

ca1d0f5

Preserve default metadata shape for ACP support

efcbfee

Revert "Preserve default metadata shape for ACP support"

79d37be

This reverts commit efcbfee.

Merge remote-tracking branch 'origin/main' into pr-440

96d0294

Fix ACP typing for pre-commit

6f506d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ACP agent support (Claude Code, Codex) for all benchmarks#440

Add ACP agent support (Claude Code, Codex) for all benchmarks#440
simonrosenberg wants to merge 38 commits intomainfrom
feat/acp-agent

simonrosenberg commented Feb 23, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Uh oh!

simonrosenberg commented Mar 16, 2026

Uh oh!

openhands-ai bot commented Mar 16, 2026

Uh oh!

simonrosenberg commented Mar 16, 2026

Uh oh!

openhands-ai bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonrosenberg commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core Changes

Fixes & Improvements

Testing

Test Plan

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

simonrosenberg commented Mar 16, 2026

Uh oh!

openhands-ai bot commented Mar 16, 2026

Uh oh!

simonrosenberg commented Mar 16, 2026

🔴 Needs improvement - Non-ACP Code Path Review

[CRITICAL ISSUES] (Must fix)

[IMPROVEMENT OPPORTUNITIES] (Should consider)

[VERIFIED: No Impact]

VERDICT:

Uh oh!

openhands-ai bot commented Mar 16, 2026

Summary

Findings:

🔴 Critical Issues Found:

⚠️ Artifact Structure Change:

✅ Verified No Impact:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonrosenberg commented Feb 23, 2026 •

edited

Loading