Skip to content

Add ACP agent support (Claude Code, Codex) for all benchmarks#440

Open
simonrosenberg wants to merge 38 commits intomainfrom
feat/acp-agent
Open

Add ACP agent support (Claude Code, Codex) for all benchmarks#440
simonrosenberg wants to merge 38 commits intomainfrom
feat/acp-agent

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Feb 23, 2026

Summary

Add ACP (Agent Communication Protocol) agent support to all benchmarks, enabling evaluation with Claude Code and Codex CLI agents instead of the default LLM-based agent.

Core Changes

  • New --agent-type CLI argument: Added to all benchmark runners with options default, acp-claude, acp-codex
  • New ACP utility module (benchmarks/utils/acp.py) with shared helpers:
    • is_acp_agent() - check if agent type is ACP-based
    • get_acp_command() - return the ACP command for agent type
    • get_acp_forward_env() - forward required API keys (+ BASE_URLs for LiteLLM proxy routing)
    • extract_acp_model_hint() - strip LiteLLM proxy prefixes to get bare model IDs
    • setup_acp_workspace() - configure Claude Code settings.json for headless mode
    • workspace_keepalive() - context manager to prevent runtime idle termination
  • ACP support in all 5 benchmarks: swebench, swtbench, gaia, commit0, swebenchmultimodal
  • agent_type field added to EvalMetadata

Fixes & Improvements

  • Workspace keepalive: ACP agents don't call workspace during thinking → runtime idle kill. Fixed with workspace_keepalive context manager that pings every 60s
  • Neutral GAIA prompt: ACP agents refused prompts with coercive language ("will not be tolerated"). Use neutral variant for ACP
  • Increased timeouts:
    • ACP prompt timeout: 3600s (1 hour) for extended thinking models
    • Commit0 conversation timeout: 7200s (2 hours) for complex tasks
  • Console logging: Display ACP tool call count separately (e.g., tool_calls=1 (acp:47))
  • ACP metadata capture: acp_agent_name and acp_agent_version included in eval output
  • LiteLLM proxy support: Forward *_BASE_URL env vars alongside API keys
  • Laminar tracing: Forward LMNR_* env vars to ACP subprocess
  • Model passthrough: Extract model hint and pass to ACPAgent via acp_model field
  • Shell injection fix: Use base64 encoding for Claude settings file write
  • swebenchmultimodal report.json bug: Fixed pre-existing bug where report_path was overwritten

Testing

  • 21 tests added for ACP utility functions in tests/test_acp.py
  • Vendor SDK updated to 524e918e (adds acp_model field, agent_name/agent_version properties)

Test Plan

  • Run SWE-bench eval with --agent-type acp-claude on 1 instance
  • Run SWE-bench eval with --agent-type acp-codex on 1 instance
  • Run GAIA eval with --agent-type acp-claude on 1 instance
  • Run commit0 eval with --agent-type acp-claude on 1 instance
  • Verify default --agent-type default still works (no regression)
  • Verify ACP agent metadata (acp_agent_name, acp_agent_version) is captured in eval output

🤖 Generated with Claude Code

simonrosenberg and others added 10 commits February 23, 2026 18:16
Add --agent-type CLI argument to swebench-infer. When agent_type='acp',
uses ACPAgent (Claude Code via ACP protocol) instead of the default
LLM-based Agent. Forwards ANTHROPIC_API_KEY to the runtime container.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ACPAgent (Claude Code via ACP) completes the full task in a single
conversation turn and doesn't emit FinishAction events. This caused
two issues:
1. fake_user_response loop sent 10 "please continue" messages
2. AgentFinishedCritic marked the result as failed (no FinishAction)

Fix: set max_fake_responses=0 for ACP mode and use empty_patch_critic
which only requires a non-empty git patch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ACP requires settings to allow tool permissions in
headless mode. Without this, the Write tool hangs waiting for
interactive permission approval.

Create ~/.claude/settings.json on the runtime before starting
the conversation to allow Edit, Read, and Bash tools.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 'acp-codex' to agent_type choices in models and args parser
- Use codex-acp command for acp-codex agent type
- Forward OPENAI_API_KEY for codex containers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligns naming with 'acp-codex': both now follow the
acp-{provider} pattern, making the agent type self-documenting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add acp_agent_name and acp_agent_version to test_result when using
ACPAgent, enabling tracking of which ACP server and version produced
each evaluation result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg marked this pull request as draft February 24, 2026 12:36
Update acp_command in run_infer.py and vendor SDK to latest commit
which fixes the bypass mode resolution and renames all references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg marked this pull request as ready for review February 27, 2026 12:13
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Feature works but has several design issues and missing tests that should be addressed.

Linus-Style Verdict: The core idea is sound, but the implementation has too many magic strings and fragile shell operations. A benchmarking framework needs rock-solid reproducibility, and this PR introduces several potential failure points.

simonrosenberg and others added 2 commits February 27, 2026 09:17
Move the ACP agent API key forwarding logic from swebench/run_infer.py
into benchmarks/utils/acp.py so it can be reused by other benchmarks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ests

- Expand benchmarks/utils/acp.py with is_acp_agent(), get_acp_command(),
  setup_acp_workspace(), and env-var existence validation
- Fix shell injection in Claude settings.json write (use file_upload)
- Make ACP command mapping explicit (raises ValueError for unknown types)
- Replace all inline ACP checks in swebench/run_infer.py with shared utils
- Add 21 tests covering all ACP utility functions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean abstraction, data-driven design, previous issues properly addressed.

Linus-Style Verdict: This is how you refactor. The centralized acp.py module with data-driven dicts (_ACP_COMMANDS, _ACP_ENV_VARS) eliminates magic strings and makes adding new providers trivial. Shell injection fixed properly with tempfile. The 21 tests cover edge cases including tuple inputs, missing env vars, and immutability guarantees. Backward compatible with agent_type="default".

The code does exactly what it needs to do without over-engineering. Approved.

The TODO comment about enabling condenser and security analyzer was
accidentally removed during the ACP agent refactoring. Restoring it
with a link to #463 for tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg changed the title Add ACPAgent support for SWE-bench evaluation Add ACP agent support (Claude Code, Codex) for SWE-bench evaluation Feb 27, 2026
simonrosenberg and others added 7 commits March 3, 2026 15:31
Include LMNR_ENV_VARS (LMNR_PROJECT_API_KEY, LMNR_SPAN_CONTEXT) in the
forward_env for ACP agents so Laminar tracing context propagates to
the ACP subprocess.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace file_upload with execute_command using base64 encoding to work
around 500 errors when uploading to paths with tilde expansion.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…benchmarks

For each benchmark:
- Import ACP utilities and ACPAgent
- Forward ACP API key env vars in prepare_workspace
- Conditionally create ACPAgent in evaluate_instance
- Setup ACP workspace (Claude settings.json)
- Capture ACP agent metadata in test results
- Pass agent_type to EvalMetadata

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nation

ACP agents (Claude Code, Codex) use their own built-in tools and don't make
calls to the workspace while processing. The runtime management system
considers the workspace idle after ~20 min and terminates it. This causes
evaluation failures for long-running instances (gaia, commit0).

Fix: Add workspace_keepalive context manager that periodically runs a no-op
command on the workspace to prevent idle termination. Applied to all 5
benchmark evaluate_instance methods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The eval_infer.py was overwriting the actual report_path
(e.g. OpenHands.{run_id}.json) with a hardcoded "report.json"
that doesn't exist, causing the evaluation to fail with FileNotFoundError.

Fix by using the actual report path from run_swebench_multimodal_evaluation()
and moving the copy inside the `if report_path:` guard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update ACP utilities to forward both *_API_KEY and *_BASE_URL env vars.
This enables routing Claude Code and Codex through the LiteLLM proxy
when ANTHROPIC_BASE_URL/OPENAI_BASE_URL are set.

Changes:
- acp.py: _ACP_ENV_VARS now maps to a list of env vars (key + base URL)
- acp.py: get_acp_forward_env forwards both API key and base URL
- tests: Updated to check for both env vars

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolve merge conflicts:
- benchmarks/swebench/run_infer.py: Combined ACP imports with main's new imports
  (IMAGE_TAG_PREFIX, add_prompt_path_argument, ensure_local_image, etc.)
- benchmarks/swtbench/run_infer.py: Same import combination strategy
- benchmarks/swebenchmultimodal/run_infer.py: Same import combination strategy
- benchmarks/swebenchmultimodal/eval_infer.py: Dropped broken shutil.copy code
  (had missing import), used main's cleaner implementation
- benchmarks/commit0/run_infer.py: Combined ACP imports with main's refactored code
- vendor/software-agent-sdk: Used main's commit (bde715c)
Previously, workspace_keepalive waited for the full interval (300s) before
sending the first ping. This caused ACP agents to hit runtime idle timeouts
when:
1. The workspace was already partially idle when keepalive started
2. Network issues caused a single ping to fail

Changes:
- Send immediate ping on context entry to reset idle timer
- Reduce default interval from 300s to 60s for more robust keepalive
- Restructure loop to ping first, then wait (instead of wait first)
- Add debug logging for ping success/failure to aid troubleshooting

Fixes: #488

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
simonrosenberg and others added 7 commits March 9, 2026 17:09
Commit0 tasks are significantly more complex than other benchmarks
(full repo implementations) and ACP agents regularly exceed the 30-min
/ 1-hour timeout. Increase the default from 3600s to 7200s (2 hours).

Fixes #496

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fusals

ACP agents (Claude Code, Codex) refuse prompts with coercive/threatening
language like "will not be tolerated, success will be rewarded". Use a
neutral variant for ACP agents that conveys the same intent without
triggering safety filters. Regular agent prompts are unchanged.

Fixes #495

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For ACP agents, the regular tool_calls count only shows OpenHands
ActionEvents (typically just 1 "finish" action), which is misleading.
This adds a separate count for ACPToolCallEvent entries that capture
the actual Claude Code tool calls.

Output now shows: tool_calls=1 (acp:47) for ACP agents

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add extract_acp_model_hint() to strip litellm_proxy and provider
prefixes from model strings. All 5 benchmark runners now pass
acp_model to ACPAgent so the selected model flows through to the
ACP server session.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The function strips prefixes for any provider, not just Anthropic.
Update docstring to match actual behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update vendor SDK to 524e918e (adds acp_model field, agent_name/version properties)
- Rename test_result to test_cmd_result in commit0 to fix CommandResult/dict type collision
- Fix test_setup_acp_workspace to match current base64 execute_command implementation
- Add type narrowing asserts in test_acp.py for Pyright strict mode
- Apply ruff format/lint fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Opus 4.6 with extended thinking can take >30 min for complex
tasks, causing premature timeouts with the previous 1800s (30 min)
default. Increase to 3600s (1 hour) across all ACP-compatible benchmarks.

Changes:
- Add ACP_PROMPT_TIMEOUT constant in benchmarks/utils/acp.py
- Pass acp_prompt_timeout to ACPAgent in all 5 ACP-compatible benchmarks:
  gaia, commit0, swebench, swtbench, swebenchmultimodal

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@simonrosenberg simonrosenberg changed the title Add ACP agent support (Claude Code, Codex) for SWE-bench evaluation Add ACP agent support (Claude Code, Codex) for all benchmarks Mar 10, 2026
simonrosenberg and others added 4 commits March 13, 2026 18:45
Resolve ACP benchmark runner conflicts by keeping ACP agent support on the ACP path, preserving condenser support on standard agents, and retaining the PR branch SDK pin required for ACPAgent features.

Co-authored-by: openhands <openhands@all-hands.dev>
Add compatibility for older SDK image-build APIs while keeping benchmarks tests aligned with the current ACP SDK pin, and make timeout tests deterministic on CI runners.

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg
Copy link
Collaborator Author

@OpenHands /codereview-roasted
Does this PR introduce any functional/behavioral changes to the non-ACP agent code path? It should not. Flag any changes in configuration / artifacts structure / any behavior that would be mutated

@openhands-ai
Copy link

openhands-ai bot commented Mar 16, 2026

I'm on it! simonrosenberg can track my progress at all-hands.dev

Copy link
Collaborator Author

🔴 Needs improvement - Non-ACP Code Path Review

I analyzed this PR for changes affecting the non-ACP (default agent) code path, as requested. Here are my findings:


[CRITICAL ISSUES] (Must fix)

[benchmarks/gaia/run_infer.py, Lines 320-345] 🐛 Condenser Support Removed

The condenser feature is completely removed from the GAIA non-ACP agent path. The original code created a condenser when enable_condenser was set, but the new else block for default agents does not include condenser support.

Original (main):

condenser = None
if self.metadata.enable_condenser:
    condenser = LLMSummarizingCondenser(...)

agent = Agent(
    llm=self.metadata.llm,
    tools=tools,
    condenser=condenser,  # ← Present
    ...
)

New code (feat/acp-agent):

else:
    agent = Agent(
        llm=self.metadata.llm,
        tools=tools,
        # ← NO condenser parameter!
        mcp_config={...},
    )

This is a functional regression for non-ACP agents. The --enable-condenser flag will be silently ignored for GAIA evaluations.


[benchmarks/commit0/run_infer.py, Line 365] ⚠️ Default Timeout Changed

The default CONVERSATION_TIMEOUT was changed from 3600 (1 hour) to 7200 (2 hours) for ALL agent types, not just ACP:

# OLD: run_timeout = int(os.getenv("CONVERSATION_TIMEOUT", "3600"))
run_timeout = int(os.getenv("CONVERSATION_TIMEOUT", "7200"))

PR description says this is for ACP agents with extended thinking, but this change affects the default agent code path too. If this is intentional, it should be documented. If it's meant only for ACP, the timeout increase should be conditional on is_acp_agent().


[IMPROVEMENT OPPORTUNITIES] (Should consider)

[benchmarks/utils/models.py, Line 105-112] 📦 Artifact Structure Change

A new agent_type field is added to EvalMetadata:

agent_type: Literal["default", "acp-claude", "acp-codex"] = Field(default="default", ...)

This will add "agent_type": "default" to all output JSONs. While backward compatible, this changes the artifact structure and should be documented as a known change.


[VERIFIED: No Impact]

The following changes were verified to NOT affect non-ACP behavior:

  • console_logging.py: total_tool_calls = n_tool_calls + n_acp_tool_calls is equivalent when n_acp_tool_calls=0
  • workspace_keepalive(): Pass-through no-op for non-ACP agents
  • setup_acp_workspace(): No-op for agent_type="default"
  • get_acp_forward_env(): Returns input unchanged for non-ACP agents
  • swebench, swtbench, swebenchmultimodal, commit0: Condenser preserved in non-ACP path
  • swefficiency: Refactor of sched_getaffinity is functionally equivalent

VERDICT:

Needs rework - The GAIA condenser regression must be fixed before merge.

KEY INSIGHT:
When refactoring code into if/else branches for feature flags, all existing functionality in the original code must be preserved in the "default" branch.

@openhands-ai
Copy link

openhands-ai bot commented Mar 16, 2026

Summary

I performed a code review on PR #440 ("Add ACP agent support") to analyze whether it introduces functional/behavioral changes to the non-ACP agent code path.

Findings:

🔴 Critical Issues Found:

  1. GAIA Condenser Regression (benchmarks/gaia/run_infer.py, Lines 320-345):

    • The condenser feature was completely removed from the non-ACP agent path
    • The original code created a condenser when enable_condenser was set, but the refactored else block for default agents does not include condenser support
    • The --enable-condenser flag will be silently ignored for GAIA evaluations
  2. Commit0 Timeout Change (benchmarks/commit0/run_infer.py, Line 365):

    • Default CONVERSATION_TIMEOUT changed from 3600s to 7200s for ALL agent types, not just ACP
    • This affects the default agent code path behavior

⚠️ Artifact Structure Change:

  • New agent_type field added to EvalMetadata - adds "agent_type": "default" to all output JSONs (backward compatible)

✅ Verified No Impact:

  • Console logging, workspace_keepalive, setup_acp_workspace, get_acp_forward_env, swebench/swtbench/swebenchmultimodal/commit0 condenser handling, swefficiency refactor - all preserve non-ACP behavior

No code changes were made - this was a review-only task. My review comment was posted to the PR at:
#440 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants