Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,4 +100,9 @@ When converting between OpenHands format and benchmark-specific formats:
- Handle missing/optional fields gracefully
- Log conversion warnings for debugging
- Validate output format before evaluation

# Workspace runtimes
- The benchmark CLI currently accepts only `docker` and `remote` for `--workspace`
- The vendored SDK also includes `openhands.workspace.ApptainerWorkspace`, but the benchmark repo does not yet wire it into `run_infer.py`
- On Docker-restricted systems, document Apptainer as an SDK capability rather than a benchmark CLI option unless code support is added
</BENCHMARK_SPECIFIC>
25 changes: 24 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ Inputs (forwarded to the SDK `run-eval.yml` workflow):

## Workspace Types

Benchmarks support two workspace types for running evaluations:
Benchmarks expose three workspace types in their CLIs:

### Docker Workspace (Default)

Expand All @@ -183,6 +183,29 @@ Uses local Docker containers to run agent evaluations. Images are built locally
- **Cons**: Resource-intensive on local machine, slower for large-scale evaluations
- **Use case**: Development, testing, small-scale evaluations

### Apptainer Workspace

Uses `openhands.workspace.ApptainerWorkspace` from the vendored SDK to run a pre-built agent-server image without a local Docker daemon. The workspace pulls OCI/Docker images with `apptainer pull docker://...`, so it is a good fit for HPC or university environments where Docker is unavailable.

- **Pros**: No Docker daemon required, works on many shared/HPC systems
- **Cons**: Requires a pre-built agent-server image in a registry; unlike Docker mode, it cannot build from a base image on the fly
- **Use case**: Local benchmark runs on Docker-restricted machines

Example:

```bash
uv run swebench-infer path/to/llm_config.json \
--dataset princeton-nlp/SWE-bench_Verified \
--split test \
--workspace apptainer
```

Useful environment variables:
- `APPTAINER_CACHE_DIR`: Override the SIF/cache directory
- `APPTAINER_HOST_PORT`: Pin the local port used by the agent server
- `APPTAINER_USE_FAKEROOT=0`: Disable fakeroot if your cluster does not support it
- `APPTAINER_ENABLE_DOCKER_COMPAT=0`: Disable `--compat` for custom Apptainer behavior

### Remote Workspace

Uses a [remote runtime API](https://openhands.dev/blog/evaluation-of-llms-as-coding-agents-on-swe-bench-at-30x-speed) to provision containers in a cloud environment, enabling massive parallelization.
Expand Down
22 changes: 16 additions & 6 deletions benchmarks/commit0/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,11 @@
construct_eval_output_dir,
get_default_on_result_writer,
)
from benchmarks.utils.image_utils import create_docker_workspace, remote_image_exists
from benchmarks.utils.image_utils import (
create_apptainer_workspace,
create_docker_workspace,
remote_image_exists,
)
from benchmarks.utils.llm_config import load_llm_config
from benchmarks.utils.models import (
EvalInstance,
Expand Down Expand Up @@ -186,18 +190,24 @@ def prepare_workspace(
build_target = "source-minimal"
logger.info(f"Using base docker image: {base_docker_image}")

custom_tag = extract_custom_tag(base_docker_image)
suffix = f"-{build_target}" if build_target != "binary" else ""
agent_server_image = (
f"{EVAL_AGENT_SERVER_IMAGE}:{IMAGE_TAG_PREFIX}-{custom_tag}{suffix}"
)

if self.metadata.workspace_type == "docker":
custom_tag = extract_custom_tag(base_docker_image)
suffix = f"-{build_target}" if build_target != "binary" else ""
agent_server_image = (
f"{EVAL_AGENT_SERVER_IMAGE}:{IMAGE_TAG_PREFIX}-{custom_tag}{suffix}"
)
workspace = create_docker_workspace(
agent_server_image=agent_server_image,
base_image=base_docker_image,
build_target=build_target,
forward_env=forward_env,
)
elif self.metadata.workspace_type == "apptainer":
workspace = create_apptainer_workspace(
agent_server_image=agent_server_image,
forward_env=forward_env,
)
elif self.metadata.workspace_type == "remote":
runtime_api_key = os.getenv("RUNTIME_API_KEY")
if not runtime_api_key:
Expand Down
16 changes: 12 additions & 4 deletions benchmarks/gaia/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,11 @@
get_default_on_result_writer,
)
from benchmarks.utils.fake_user_response import run_conversation_with_fake_user_response
from benchmarks.utils.image_utils import create_docker_workspace, remote_image_exists
from benchmarks.utils.image_utils import (
create_apptainer_workspace,
create_docker_workspace,
remote_image_exists,
)
from benchmarks.utils.llm_config import load_llm_config
from benchmarks.utils.models import EvalInstance, EvalMetadata, EvalOutput
from benchmarks.utils.version import IMAGE_TAG_PREFIX
Expand Down Expand Up @@ -155,16 +159,20 @@ def prepare_workspace(
"""
logger.info(f"Preparing workspace for instance {instance.id}")

agent_server_image = f"{EVAL_AGENT_SERVER_IMAGE}:{IMAGE_TAG_PREFIX}-gaia-binary"

if self.metadata.workspace_type == "docker":
agent_server_image = (
f"{EVAL_AGENT_SERVER_IMAGE}:{IMAGE_TAG_PREFIX}-gaia-binary"
)
workspace = create_docker_workspace(
agent_server_image=agent_server_image,
base_image="nikolaik/python-nodejs:python3.12-nodejs22",
build_target="binary",
forward_env=forward_env,
)
elif self.metadata.workspace_type == "apptainer":
workspace = create_apptainer_workspace(
agent_server_image=agent_server_image,
forward_env=forward_env,
)
elif self.metadata.workspace_type == "remote":
# For workflow, use APIRemoteWorkspace with pre-built GAIA image
# GAIA uses a universal agent server image (one image for all instances)
Expand Down
14 changes: 14 additions & 0 deletions benchmarks/multiswebench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,20 @@ LANGUAGE=java uv run multi-swebench-infer path/to/llm_config.json \
--workspace docker
```

### Apptainer Workspace (Local Evaluation without Docker)

If Docker is unavailable, you can run against pre-built agent-server images with Apptainer:

```bash
LANGUAGE=java uv run multi-swebench-infer path/to/llm_config.json \
--dataset bytedance-research/Multi-SWE-Bench \
--split java_verified \
--workspace apptainer
```

Apptainer mode requires the agent-server images to already exist in a registry; it does not build them locally from base images.


### Remote Workspace (Scalable Cloud Evaluation)

Remote workspace enables running evaluations at scale by using a cloud-based runtime API to provision containers. This is ideal for large-scale benchmark runs with high parallelization.
Expand Down
17 changes: 13 additions & 4 deletions benchmarks/multiswebench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,10 @@
get_default_on_result_writer,
)
from benchmarks.utils.fake_user_response import run_conversation_with_fake_user_response
from benchmarks.utils.image_utils import remote_image_exists
from benchmarks.utils.image_utils import (
create_apptainer_workspace,
remote_image_exists,
)
from benchmarks.utils.llm_config import load_llm_config
from benchmarks.utils.models import (
EvalInstance,
Expand Down Expand Up @@ -207,10 +210,11 @@ def prepare_workspace(
# For non-binary targets, append target suffix
suffix = f"-{build_target}" if build_target != "binary" else ""

agent_server_image = (
f"{EVAL_AGENT_SERVER_IMAGE}:{IMAGE_TAG_PREFIX}-{custom_tag}{suffix}"
)

if self.metadata.workspace_type == "docker":
agent_server_image = (
f"{EVAL_AGENT_SERVER_IMAGE}:{IMAGE_TAG_PREFIX}-{custom_tag}{suffix}"
)
ensure_local_image(
agent_server_image=agent_server_image,
base_image=official_docker_image,
Expand All @@ -222,6 +226,11 @@ def prepare_workspace(
working_dir="/workspace",
forward_env=forward_env or [],
)
elif self.metadata.workspace_type == "apptainer":
workspace = create_apptainer_workspace(
agent_server_image=agent_server_image,
forward_env=forward_env,
)
elif self.metadata.workspace_type == "remote":
runtime_api_key = os.getenv("RUNTIME_API_KEY")
if not runtime_api_key:
Expand Down
53 changes: 33 additions & 20 deletions benchmarks/openagentsafety/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from benchmarks.utils.evaluation import Evaluation
from benchmarks.utils.evaluation_utils import construct_eval_output_dir
from benchmarks.utils.fake_user_response import run_conversation_with_fake_user_response
from benchmarks.utils.image_utils import create_apptainer_workspace
from benchmarks.utils.llm_config import load_llm_config
from benchmarks.utils.models import EvalInstance, EvalMetadata, EvalOutput
from openhands.sdk import Agent, Conversation, Tool, get_logger
Expand Down Expand Up @@ -212,7 +213,7 @@ def cleanup_docker_containers():
def setup_host_mapping(workspace):
"""Add the-agent-company.com host mapping inside the container."""
try:
gateway_ip = "172.17.0.1"
gateway_ip = os.getenv("THE_AGENT_COMPANY_HOST_IP", "172.17.0.1")
logger.info(f"Adding host mapping: {gateway_ip} the-agent-company.com")
workspace.execute_command(
f"echo '{gateway_ip} the-agent-company.com' >> /etc/hosts"
Expand Down Expand Up @@ -388,32 +389,44 @@ def prepare_workspace(
resource_factor: int = 1,
forward_env: list[str] | None = None,
) -> RemoteWorkspace:
"""Create a fresh Docker workspace for this instance.
"""Create a fresh workspace for this instance.

Args:
instance: The evaluation instance to prepare workspace for.
resource_factor: Resource factor for runtime allocation (default: 1).
forward_env: Environment variables to forward into the workspace.
"""
# Try to build image on-the-fly, fall back to pre-built if build fails
try:
server_image = build_workspace_image()
except (subprocess.CalledProcessError, RuntimeError) as e:
logger.warning(f"On-the-fly build failed: {e}")
if self.metadata.workspace_type == "docker":
# Try to build image on-the-fly, fall back to pre-built if build fails
try:
server_image = build_workspace_image()
except (subprocess.CalledProcessError, RuntimeError) as e:
logger.warning(f"On-the-fly build failed: {e}")
server_image = get_image_name()

if not check_image_exists(server_image):
raise RuntimeError(
f"On-the-fly build failed and pre-built image {server_image} does not exist"
)
logger.info(f"Using pre-built image {server_image}")

workspace = DockerWorkspace(
server_image=server_image,
platform="linux/amd64",
extra_ports=True,
forward_env=forward_env or [],
)
elif self.metadata.workspace_type == "apptainer":
server_image = get_image_name()

if not check_image_exists(server_image):
raise RuntimeError(
f"On-the-fly build failed and pre-built image {server_image} does not exist"
)
logger.info(f"Using pre-built image {server_image}")

workspace = DockerWorkspace(
server_image=server_image,
platform="linux/amd64",
extra_ports=True,
forward_env=forward_env or [],
)
workspace = create_apptainer_workspace(
agent_server_image=server_image,
forward_env=forward_env,
extra_ports=True,
)
else:
raise ValueError(
f"Unsupported workspace_type: {self.metadata.workspace_type}"
)

# Setup host mapping for The Agent Company services
setup_host_mapping(workspace)
Expand Down
15 changes: 15 additions & 0 deletions benchmarks/swebench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,21 @@ uv run swebench-infer path/to/llm_config.json \
--workspace docker
```

### Apptainer Workspace (Local Evaluation without Docker)

If Docker is unavailable, you can use the same pre-built agent-server images with Apptainer:

```bash
uv run swebench-infer path/to/llm_config.json \
--dataset princeton-nlp/SWE-bench_Verified \
--split test \
--max-iterations 100 \
--workspace apptainer
```

Unlike Docker mode, Apptainer mode cannot build images from base images on the fly. Build and push the agent-server images first, then run inference with `--workspace apptainer`.


### Remote Workspace (Scalable Cloud Evaluation)

Remote workspace enables running evaluations at scale by using a cloud-based runtime API to provision containers. This is ideal for large-scale benchmark runs with high parallelization.
Expand Down
10 changes: 9 additions & 1 deletion benchmarks/swebench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,10 @@
get_default_on_result_writer,
)
from benchmarks.utils.fake_user_response import run_conversation_with_fake_user_response
from benchmarks.utils.image_utils import remote_image_exists
from benchmarks.utils.image_utils import (
create_apptainer_workspace,
remote_image_exists,
)
from benchmarks.utils.llm_config import load_llm_config
from benchmarks.utils.models import (
EvalInstance,
Expand Down Expand Up @@ -183,6 +186,11 @@ def prepare_workspace(
working_dir="/workspace",
forward_env=forward_env or [],
)
elif self.metadata.workspace_type == "apptainer":
workspace = create_apptainer_workspace(
agent_server_image=agent_server_image,
forward_env=forward_env,
)
elif self.metadata.workspace_type == "remote":
runtime_api_key = os.getenv("RUNTIME_API_KEY")
if not runtime_api_key:
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/swebenchmultimodal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ The benchmark uses the same configuration options as regular SWE-Bench:
- `--split`: Dataset split (e.g., `test`, `dev`)
- `--llm-config`: Path to LLM configuration file
- `--max-iterations`: Maximum number of agent iterations
- `--workspace-type`: Either `docker` or `remote`
- `--workspace`: One of `docker`, `apptainer`, or `remote`
- `--num-workers`: Number of parallel workers

## Environment Variables
Expand Down
17 changes: 13 additions & 4 deletions benchmarks/swebenchmultimodal/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@
get_default_on_result_writer,
)
from benchmarks.utils.fake_user_response import run_conversation_with_fake_user_response
from benchmarks.utils.image_utils import remote_image_exists
from benchmarks.utils.image_utils import (
create_apptainer_workspace,
remote_image_exists,
)
from benchmarks.utils.llm_config import load_llm_config
from benchmarks.utils.models import (
EvalInstance,
Expand Down Expand Up @@ -160,10 +163,11 @@ def prepare_workspace(
# For non-binary targets, append target suffix
suffix = f"-{build_target}" if build_target != "binary" else ""

agent_server_image = (
f"{EVAL_AGENT_SERVER_IMAGE}:{IMAGE_TAG_PREFIX}-{custom_tag}{suffix}"
)

if self.metadata.workspace_type == "docker":
agent_server_image = (
f"{EVAL_AGENT_SERVER_IMAGE}:{IMAGE_TAG_PREFIX}-{custom_tag}{suffix}"
)
ensure_local_image(
agent_server_image=agent_server_image,
base_image=official_docker_image,
Expand All @@ -175,6 +179,11 @@ def prepare_workspace(
working_dir="/workspace",
forward_env=forward_env or [],
)
elif self.metadata.workspace_type == "apptainer":
workspace = create_apptainer_workspace(
agent_server_image=agent_server_image,
forward_env=forward_env,
)
elif self.metadata.workspace_type == "remote":
runtime_api_key = os.getenv("RUNTIME_API_KEY")
if not runtime_api_key:
Expand Down
16 changes: 15 additions & 1 deletion benchmarks/swefficiency/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,20 @@ uv run swefficiency-infer path/to/llm_config.json \
--workspace docker
```


### Apptainer Workspace (Local Evaluation without Docker)

If Docker is unavailable, you can run SWE-fficiency with Apptainer against pre-built agent-server images:

```bash
uv run swefficiency-infer path/to/llm_config.json \
--dataset swefficiency/swefficiency \
--split test \
--workspace apptainer
```

Apptainer mode does not apply the Docker-only CPU and memory limits above; it expects the image to already be published in a registry.

### Remote Workspace (Scalable Cloud Evaluation)

```bash
Expand All @@ -75,7 +89,7 @@ After running inference, use the official SWE-fficiency benchmark evaluation too
|--------|-------------|---------|
| `--dataset` | HuggingFace dataset name | `swefficiency/swefficiency` |
| `--split` | Dataset split | `test` |
| `--workspace` | Workspace type (`docker` or `remote`) | `docker` |
| `--workspace` | Workspace type (`docker`, `apptainer`, or `remote`) | `docker` |
| `--num-workers` | Number of parallel workers | `4` |
| `--max-iterations` | Maximum agent iterations | `500` |
| `--num-cpus-per-worker` | CPUs per Docker container | `4` |
Expand Down
Loading
Loading