feat(sdk/agent): Parallel Tool Call Execution#2390
Conversation
… tool execution Add infrastructure for executing multiple tool calls concurrently with a configurable global concurrency limit. Classes: - ToolExecutorSemaphore: Process-global singleton that limits concurrent tool executions across all agents and sub-agents. Configured via OPENHANDS_TOOL_CONCURRENCY_LIMIT environment variable (default: 8). - ParallelToolExecutor: Executes batches of tool calls concurrently using ThreadPoolExecutor, with concurrency controlled by the semaphore. Key design decisions: - Single layer of concurrency control via environment variable - Singleton pattern using __new__ for ToolExecutorSemaphore - ThreadPoolExecutor for I/O-bound tool execution - Results returned in original order regardless of completion order Related to #2350 Co-authored-by: openhands <openhands@all-hands.dev>
API breakage checks (Griffe)Result: Passed |
Agent server REST API breakage checks (OpenAPI)Result: Passed |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable - Requires Eval Verification
Core architecture is excellent. Making _execute_action_event side-effect-free (returns events instead of emitting directly) is exactly the right design — this eliminates the need for locks and makes the special case (parallel execution) become a normal case. Per-agent thread pools elegantly prevent deadlocks without complex detection logic.
The code is clean, tests are comprehensive and test real behavior (not mocks), and default concurrency=1 preserves backward compatibility.
However, this PR changes core agent execution flow (tool calling, event emission, state management). Even with the backward-compatible default, the execution path has been refactored significantly. Per repository policy, PRs that change agent behavior require lightweight eval verification before merge.
KEY INSIGHT
The refactoring turns concurrency from a special case requiring complex coordination into a normal case with side-effect-free functions. This is "good taste" — the right abstraction eliminates the complexity rather than managing it with locks and conditionals.
VERDICT
✅ Code quality is solid — approve from a technical perspective
Move _emit_batch and _handle_finish logic from Agent into _ActionBatch as emit() and finalize() methods. Agent-specific logic (iterative refinement check, mark-finished callback) is injected via callables, keeping _ActionBatch decoupled from the Agent class. This simplifies Agent._execute_actions to a clean prepare → emit → finalize pipeline and gives _ActionBatch full ownership of the batch lifecycle. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands Do a /codereview-roasted on this PR. |
|
I'm on it! enyst can track my progress at all-hands.dev |
enyst
left a comment
There was a problem hiding this comment.
🔴 Needs improvement
[CRITICAL ISSUES]
- [openhands-sdk/openhands/sdk/agent/parallel_executor.py, Lines 97-103] Breaking change disguised as the fallback path:
TOOL_CONCURRENCY_LIMIT=1still routes any multi-tool batch throughThreadPoolExecutor(max_workers=1). That is not the old behavior. It changes thread affinity, and because results are buffered until the batch finishes, it also changes when observations hit the conversation. I reproduced this locally with a tiny tool: both calls ran onThreadPoolExecutor-*, notMainThread, and the second call saw zero priorObservationEvents. So the PR description's “fully backward-compatible” claim is false. Fix: keep the oldfor action in action_events: execute + emitpath when the limit is1, and only use the batch executor when the limit is actually>1. - [openhands-sdk/openhands/sdk/agent/agent.py, Lines 389-396] Sequential semantics were silently changed:
_ActionBatch.prepare()executes the whole batch beforebatch.emit(), so later tools in the same batch no longer see earlier observations inconversation.state.events. Even with concurrency effectively “off”, you've changed execution fromrun tool -> emit observation -> run next toolintorun everything -> emit later. That's a real semantic regression for tools/hooks that inspect conversation state mid-batch. Fix: preserve incremental emission in the sequential path; don't reuse the buffered parallel path as the fallback. - [openhands-sdk/openhands/sdk/agent/agent.py, Lines 389-393] Unsafe by construction for the stock tool set: once
TOOL_CONCURRENCY_LIMIT > 1, this code blindly parallelizes whatever tool calls the model produced. But the default tool set includes shared mutable executors like terminal sessions, browser sessions, and file-editor history. A warning in a docstring is not a concurrency model; it just means the env var can turn into garbled terminal I/O or corrupted edits. Fix: parallelize only tools that explicitly declare themselves safe, or serialize known shared-resource tools.
[TESTING GAPS]
- [tests/sdk/agent/test_parallel_execution_integration.py, Lines 188-213] The default-limit regression isn't actually tested: this test only checks ordering. It doesn't assert same-thread execution or incremental state visibility, so it passes even though the fallback already changed behavior.
- [tests/sdk/agent/test_parallel_execution_integration.py] Only synthetic tools are exercised: the tests never hit the real risky executors (
terminal,file_editor,browser), so they won't catch the races this PR introduces.
VERDICT:
❌ Needs rework: the concurrency primitive is reasonable, but the fallback path is not backward-compatible, and the parallel path has no safety gate for the tools people actually run.
KEY INSIGHT:
The real bug here is that the code collapsed “sequential semantics” and “parallel scheduling” into one path. Those are not the same thing, and when limit=1 they must stay identical to the old behavior.
Because this PR changes agent tool-execution behavior, I'd still want a human maintainer to make the final merge call after lightweight evals even after the above is addressed.
This comment was marked as duplicate.
This comment was marked as duplicate.
|
OpenHands-GPT-5.4 here. I re-read my earlier review with fresh eyes, section by section, and I think I overstated parts of it. Here's my updated take. 1)
|
|
OpenHands-GPT-5.4 follow-up: you're right, and I need to correct my re-review. I made a reasoning mistake there. I wrote that the This PR exists precisely to handle the multi-tool-call case better. If multi-tool responses are common in real usage, then the changed fallback semantics are not some corner case — they hit the exact workload this feature is supposed to serve. So I want to sharpen the conclusion:
I would restate the concern more cleanly as:
I still think my earlier re-review was right to soften one thing: synthetic tools are fine for testing the concurrency primitive itself. But I over-corrected on severity. If multi-tool responses are common in production, then the fallback-semantics issue is materially more important than I gave it credit for. I also think this strengthens — not weakens — the concern about parallelizing shared-state tools. If the intended real-world usage is "LLM emits several tool calls and we want them parallel," then correctness for the default tool ecosystem is part of the core feature story, not just a docs footnote. |
|
HUMAN: Just to note, I do think multiple tool calls are a normal occurrence. I don’t have numbers, but the agent here is confused; they’re often enough that, in practice, it’s like, see We have the eval trajectories we could compute to see the number of batches with the same |
@enyst yes this could be interesting. |
Summary
(fix #2350)
Add ParallelToolExecutor to enable concurrent tool execution within agent steps, controlled by the TOOL_CONCURRENCY_LIMIT environment variable (default: 1, fully backward-compatible).
## Motivation
When an LLM returns multiple tool calls in a single response (e.g., "read these 3 files" or "run these 4 independent searches"), the current agent executes them sequentially. For I/O-bound tools — file reads, HTTP requests, MCP server calls, database queries — this leaves significant performance on the table. Parallel execution turns N × latency into ~1 × latency for independent operations.
Concrete scenarios where this helps:
What this does NOT help: CPU-bound tools limited by the GIL, or tools with shared mutable state that aren't thread-safe.
Design
emission) happen on the main thread after parallel work completes.
exceptions (RuntimeError, AssertionError, etc.) are logged at ERROR with full traceback to aid debugging.
Thread safety warning
When TOOL_CONCURRENCY_LIMIT > 1, tools run in parallel threads sharing the same conversation object. Tools are not thread-safe by default. Callers opting into parallelism must ensure their tools are safe for concurrent execution
(no shared mutable filesystem state, no concurrent conversation mutations).
Evaluation
I ran an evaluation with SWE-bench to ensure that the default behavior is the one we already have in the repo [ref]
Report from trace investigation of OpenHands CLI:
No parallel tool calls detected -- the feature is cleanly disabled. Here's the full breakdown: Trace Format - Events alternate between ActionEvent (tool call) and ObservationEvent (tool result) - Tools used: terminal (1150), file_editor (588), think (58), finish (25) - 1,821 action events matched exactly 1,821 observation events across all 25 traces Parallel Tool Call Check: CLEAN - Zero shared llm_response_id across events (each LLM turn produced exactly 1 tool call) - Perfect action-observation interleaving -- no consecutive actions or observations - No tool_calls arrays, no parallel batching of any kind - All 25 conversations completed normally with a finish actionAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:cd09704-pythonRun
All tags pushed for this build
About Multi-Architecture Support
cd09704-python) is a multi-arch manifest supporting both amd64 and arm64cd09704-python-amd64) are also available if needed