fix: cache SDK sdist across image builds by simonrosenberg · Pull Request #507 · OpenHands/benchmarks

simonrosenberg · 2026-03-12T12:11:51Z

Summary

pre-build the SDK sdist once per image-build batch and reuse it across all workers
intercept only the SDK's internal uv build --sdist subprocess call so the SDK still owns tag, cache, and Docker build behavior
switch build-related modules to stdlib logging to avoid inheriting SDK Rich console state across worker forks
add focused tests for the cached-sdist interception seam
bump the SDK submodule to 8e8223b24dfd041875fe845b13326b0a1238b793
set OPENHANDS_BUILDKIT_CACHE_MODE=off for the SWT image-build workflow
remove the timeout-minutes: 180 limit from both SWE-Bench and SWT-Bench image-build workflows

Why this is the real fix

The first regression in this area was repeated SDK sdist construction across hundreds of image builds. This PR removes that repeated work in benchmarks without forking SDK behavior.

While validating that change on full SWT-Bench runs, the dominant remaining slowdown turned out to be registry cache export during cold fan-out image builds. The SDK PR adds cache export controls; this PR wires those controls into the SWT workflow and removes the workflow timeout that was cancelling otherwise healthy long-running image builds.

Because benchmarks still calls the SDK's native build() path, the repo keeps using SDK-owned tagging, Docker args, and cache semantics instead of duplicating them here.

Validation

make build
uv run pytest tests/test_image_utils.py tests/test_llm_config.py
uv run pre-commit run --files benchmarks/utils/build_utils.py benchmarks/utils/buildx_utils.py benchmarks/utils/image_utils.py benchmarks/swebench/build_images.py benchmarks/swtbench/build_eval_env_images.py benchmarks/swtbench/image_utils.py tests/test_image_utils.py
full SWT-Bench 433-instance image build using this branch + SDK 8e8223b24dfd041875fe845b13326b0a1238b793:
- run: 23043936501
- result from logs: 433/433 images built, ❌ 0, total image-build time 4:10:33

Fixes #504

Build the SDK sdist once per batch and reuse it through the SDK's native build() path by intercepting only the internal uv sdist subprocess call. This removes the per-image sdist overhead while preserving tag, cache, and build-arg behavior for current and future SDK versions. Also switch build-related modules to stdlib logging so worker forks do not inherit SDK Rich console state. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

Taste Rating: 🟡 Acceptable - Solves a real performance problem, but the solution is more complex than it needs to be.

[CRITICAL ISSUES]

The core optimization is sound (build sdist once instead of N times), but the implementation adds unnecessary indirection that will make this harder to maintain.

[IMPROVEMENT OPPORTUNITIES]

See inline comments for specific architectural improvements that would simplify the data flow and reduce coupling to SDK internals.

VERDICT: ✅ Worth merging - Performance improvement is real and valuable. Fallback behavior prevents breaking changes. The complexity is manageable but could be simplified in follow-up work.

KEY INSIGHT: Monkey-patching SDK internals works but creates maintenance coupling. Consider upstreaming sdist caching to the SDK itself, or pass the pre-built sdist path explicitly through the call chain instead of via environment variable + runtime patching.

benchmarks/utils/build_utils.py

tests/test_image_utils.py

benchmarks/swebench/build_images.py

simonrosenberg · 2026-03-12T14:04:58Z

Addressed the review feedback in 65331d6.

Changes made:

removed the env-var handoff and now pass cached_sdist explicitly through build_all_images() -> _iter_batch_results() -> _build_with_logging() -> build_image()
tightened the interception predicate so cached-sdist reuse only applies to uv build --sdist calls whose cwd is the SDK root
reduced nesting in build_all_images() by flattening the cached-sdist setup into the main with block and extracting batch execution into _iter_batch_results()
added a stronger integration-style test against the real imported SDK build module in addition to the unit tests

I kept the patch-at-the-SDK-build-seam design because it still avoids duplicating the SDK build path in benchmarks, while making the data flow more explicit and defensive.

all-hands-bot

Taste Rating: 🟡 Acceptable - Solves a real performance problem with pragmatic engineering. The patching approach is more complex than ideal, but given the constraint of preserving SDK build logic, it's a reasonable trade-off.

benchmarks/utils/build_utils.py

simonrosenberg · 2026-03-12T14:20:06Z

@OpenHands /codereview-roasted
review this PR as well as this other PR that fixes the same issue but slightly differently #505
Are there are strong features in PR #505 that should be ported to this PR

openhands-ai · 2026-03-12T14:20:36Z

I'm on it! simonrosenberg can track my progress at all-hands.dev

simonrosenberg · 2026-03-12T14:22:16Z

🔥 Linus-Style Code Review: PR #507 vs PR #505

Taste Rating: 🟢 Good taste (PR #507)

PR #507 demonstrates "good taste" - it makes the minimal change to solve the problem while keeping the SDK in control of its own build behavior. Let me explain why this matters.

The Core Difference

PR #507 (this PR): Patches the SDK's internal _run() helper to short-circuit only the uv build --sdist subprocess call. The SDK's build() is still called - it just gets a cached sdist copied instead of building a new one.

PR #505: Completely replaces the SDK's build() with ~100 lines of manually-constructed docker buildx build commands, tarball extraction, and Dockerfile copying.

Why PR #507 is the Right Design

1. "Don't Duplicate; Delegate"

PR #507:

with _patch_sdk_sdist_build(sdk_build_module, cached_sdist):
    tags = build(opts)  # SDK still owns all build logic

PR #505:

# Replicates 100 lines of SDK internals:
args = ["docker", "buildx", "build", "--file", ...]
args += ["--build-arg", f"BASE_IMAGE={opts.base_image}"]
args += ["--build-arg", f"OPENHANDS_BUILD_GIT_SHA={opts.git_sha}"]
# ... and so on

When the SDK adds a new build arg (say, --build-arg OPENHANDS_BUILD_FEATURES=...), PR #507 will automatically pick it up. PR #505 will silently diverge and produce subtly different images.

2. Future-Proofing via Graceful Fallback

PR #507 fails gracefully:

original_run = getattr(sdk_build_module, "_run", None)
if original_run is None:
    logger.warning("SDK build module has no _run helper; falling back to native build()")
    yield
    return

If the SDK refactors away _run, benchmarks falls back to the SDK's native behavior instead of exploding.

3. Explicit Parameter Passing > Hidden Environment Variables

PR #507 passes cached_sdist explicitly through the call chain:

def build_image(..., cached_sdist: Path | None = None):

PR #505 uses a hidden env var:

os.environ["OPENHANDS_CACHED_SDIST"] = str(cached_sdist_path)
# ... and later reads it implicitly in build_image()
cached_sdist = os.environ.get("OPENHANDS_CACHED_SDIST")

Explicit is better than implicit. The function signature documents the interface; environment variables create hidden coupling.

4. Testability

PR #507 has focused tests:

test_is_sdist_build_command_matches_expected_shape
test_is_sdk_sdist_build_command_requires_sdk_cwd
test_patch_sdk_sdist_build_reuses_cached_sdist_only_for_sdk_sdist_commands
test_build_image_reuses_cached_sdist_with_real_sdk_build_module

PR #505 has no tests for _build_with_cached_sdist() at all. That 100-line function with tarfile.open(), Dockerfile copying, and subprocess construction? Untested.

Strong Features from PR #505 Worth Porting

✅ Explanatory Comments for the Logging Switch

PR #505 has inline comments explaining why the stdlib logging switch matters:

# Use stdlib logging instead of openhands.sdk.get_logger to avoid initializing
# Rich console state before ProcessPoolExecutor forks (causes deadlocks).
import logging

PR #507 just switches the import without explaining the deadlock context. This comment should be added for future maintainers.

Recommendation: Add a brief comment to build_utils.py explaining the Rich/fork deadlock issue.

❌ Environment Variable Approach (Don't Port)

The OPENHANDS_CACHED_SDIST env var in PR #505 is actually an anti-pattern. PR #507's explicit parameter passing is cleaner.

❌ Manual buildx Command Construction (Definitely Don't Port)

This is exactly what creates the maintenance burden. Keep the SDK in control.

Minor Nits (Skip These - Linters Exist)

I don't care about import ordering or spacing. The code passes pre-commit.

VERDICT: ✅ Worth merging (PR #507)

The core logic is sound. PR #507 makes the minimal, surgical change to eliminate the repeated uv build --sdist overhead while keeping the SDK in control of everything else.

KEY INSIGHT: The difference between "good taste" and "bad taste" here is whether you trust the SDK to own its build behavior. PR #507 says "yes, just let me cache the expensive bit." PR #505 says "no, I'll reimplement your build logic myself." The first approach scales; the second creates a fork.

Suggested Follow-up (Non-Blocking)

Add this comment at the top of benchmarks/utils/build_utils.py:

# Use stdlib logging instead of openhands.sdk.get_logger to avoid initializing
# Rich console state before ProcessPoolExecutor forks (causes deadlocks).

This documents the "why" for future maintainers.

Ported from PR #505 - explains why openhands.sdk.get_logger is not used in build-related modules to avoid Rich console state deadlocks when ProcessPoolExecutor forks worker processes. Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg · 2026-03-12T14:24:23Z

Ported the explanatory comments from PR #505 in 68bf988.

All six affected files now document why stdlib logging is used instead of openhands.sdk.get_logger (to avoid Rich console state deadlocks during ProcessPoolExecutor forks).

openhands-ai · 2026-03-12T14:25:21Z

Summary

I completed the code review comparing PR #507 (this PR) with PR #505, and ported the one strong feature from #505.

Code Review Posted

Full review: fix: cache SDK sdist across image builds #507 (comment)
Verdict: 🟢 PR fix: cache SDK sdist across image builds #507 has "good taste" - it makes a minimal, surgical change by patching only the SDK's _run() helper, while still letting the SDK own its build logic. PR fix: pre-build SDK sdist once to eliminate 500-image build regression (#504) #505's approach of duplicating ~100 lines of SDK build logic would create maintenance burden when the SDK changes.

Changes Made (Already Pushed)

Commit 68bf988: Added explanatory comments to 6 files explaining why stdlib logging is used instead of openhands.sdk.get_logger:

# Use stdlib logging instead of openhands.sdk.get_logger to avoid initializing
# Rich console state before ProcessPoolExecutor forks (causes deadlocks).

Files updated:

benchmarks/utils/build_utils.py
benchmarks/utils/buildx_utils.py
benchmarks/utils/image_utils.py
benchmarks/swebench/build_images.py
benchmarks/swtbench/build_eval_env_images.py
benchmarks/swtbench/image_utils.py

Checklist

✅ Reviewed PR fix: cache SDK sdist across image builds #507 diff
✅ Reviewed PR fix: pre-build SDK sdist once to eliminate 500-image build regression (#504) #505 diff for comparison
✅ Posted Linus-style roasted code review
✅ Identified strong features from fix: pre-build SDK sdist once to eliminate 500-image build regression (#504) #505 to port (explanatory comments only)
✅ Declined to port fix: pre-build SDK sdist once to eliminate 500-image build regression (#504) #505's anti-patterns (env var handoff, SDK logic duplication)
✅ Pre-commit checks pass
✅ Changes pushed to fix/issue-504-cached-sdist branch

simonrosenberg · 2026-03-12T14:48:17Z

Triggered fresh full image-build validation runs from the current feature branch tip 68bf988 against current software-agent-sdk main (aa9df699cdf22dab5b46ecf23a48d3342413444e).

Monitor here:

SWE-Bench full build: https://github.com/OpenHands/benchmarks/actions/runs/23007890612
SWT-Bench full build: https://github.com/OpenHands/benchmarks/actions/runs/23007890709

Run details:

branch: fix/issue-504-cached-sdist
swebench: n-limit=500
swtbench: n-limit=500, build-eval-env=false

These were dispatched after the latest review fixes. They are currently pending behind the older branch-scoped runs while GitHub finishes cancelling those earlier jobs.

simonrosenberg · 2026-03-12T15:08:06Z

Current validation runs for the SDK-native prebuilt-sdist API change:

SWE-Bench full build: https://github.com/OpenHands/benchmarks/actions/runs/23008709390
SWT-Bench full build: https://github.com/OpenHands/benchmarks/actions/runs/23008709458

Current refs:

benchmarks branch: fix/issue-504-cached-sdist @ 355829e
sdk branch: fix/prebuilt-sdist-api @ d0c1a392

Older branch-scoped workflow runs were cancelled so these are the active runs to track.

all-hands-bot

Taste Rating: 🟡 Acceptable - Pragmatic fix for a real performance problem. The solution is clean, preserves SDK build logic, and handles failures gracefully.

[KEY INSIGHT]
This solves the actual problem (SDK rebuilding the same sdist N times) with straightforward engineering: build once, pass the path through, clean up after. Previous review concerns about env var indirection have been addressed by making the data flow explicit.

[MINOR OBSERVATIONS]
See inline comments. The workflow hardening changes feel like scope creep but are acceptable since they all relate to build robustness. The 10-parameter function is extracting complexity to reduce nesting, which is the right trade-off.

VERDICT: ✅ Worth merging - Core logic is sound, no blocking issues.

all-hands-bot · 2026-03-12T15:25:12Z

.github/workflows/build-swebench-images.yml


    runs-on:
      labels: blacksmith-32vcpu-ubuntu-2204
+    timeout-minutes: 180


🟡 Suggestion - Scope: Workflow improvements (timeout, build timing, disk reporting) are good defensive engineering, but feel like scope creep for a PR focused on caching the SDK sdist.

Acceptable because they all relate to making builds more reliable. Just be aware that bundling unrelated improvements makes bisecting regressions harder.

If splitting: the workflow hardening could be a separate PR titled "ci: add build timeouts and disk monitoring".

all-hands-bot · 2026-03-12T15:25:27Z

benchmarks/utils/build_utils.py

    raise RuntimeError("Unreachable code reached in _build_with_logging")


+def _iter_batch_results(


🟢 Acceptable - Pragmatism: Ten parameters is a lot, but this function is extracting the ProcessPoolExecutor loop from build_many() to reduce nesting depth.

The alternative would be bundling these into a config object, but that adds indirection without reducing actual complexity. Since all parameters are needed for the build process, this is a reasonable trade-off.

The nesting reduction in build_many() is more valuable than the parameter count smell.

all-hands-bot · 2026-03-12T15:25:38Z

benchmarks/utils/build_utils.py

+        except Exception as e:
+            logger.warning(
+                "Failed to pre-build SDK sdist; each image will build its own: %s", e
+            )


🟢 Good taste: Clean fallback behavior. If prebuild fails, log a warning and continue without caching—each image builds its own sdist.

This is robust: don't fail the entire batch because of an optimization failure. The warning ensures visibility without breaking the build.

all-hands-bot · 2026-03-12T15:25:52Z

benchmarks/swebench/build_images.py

+# Use stdlib logging instead of openhands.sdk.get_logger to avoid initializing
+# Rich console state before ProcessPoolExecutor forks (causes deadlocks).
+import logging


🟢 Acceptable - Pragmatism: Switching to stdlib logging to avoid Rich console state issues in multiprocessing is the right fix.

Rich's console state doesn't fork well across ProcessPoolExecutor workers—this was causing deadlocks. The comment clearly explains the constraint for future maintainers.

simonrosenberg · 2026-03-12T18:21:43Z

Triggered fresh full validation runs on fix/issue-504-cached-sdist after the SDK shared-cache change and workflow batch/worker configurability updates.

SWE-Bench: https://github.com/OpenHands/benchmarks/actions/runs/23017371628
SWT-Bench: https://github.com/OpenHands/benchmarks/actions/runs/23017371689

Both runs are on benchmarks 3d93862f394ba608f2547062f16287a96efcd741 and SDK fc962c4c8d7fd6eb0e91db06a5ec9911b29589ac.

Update the software-agent-sdk submodule from fc962c4 to 447aa91 so the issue-504 SWT branch uses the SDK build path that keeps shared cache reads but disables shared cache writes by default. This preserves the prebuilt-sdist work while removing the parallel registry export contention identified in issue #510. Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini · 2026-03-13T14:23:10Z

Reposting this here so it doesn't get lost in the slack.

https://github.com/OpenHands/benchmarks/actions/runs/23043936501/job/66928367859 succeeded because it relied on build
https://github.com/OpenHands/benchmarks/actions/runs/23033839708 both built the same sdk 8e8223b24dfd041875fe845b13326b0a1238b793
so the one that succeed started with a 203 head start
✅ Done: 47%|████▋ | 203/433 [00:20<00:09, 24.25it/s, ✅ 203 ❌ 0 🏃 47
...
✅ Done: 100%|██████████| 433/433 [4:10:33<00:00, 17.59s/it, ✅ 433 ❌ 0 🏃 0]

all-hands-bot reviewed Mar 12, 2026

View reviewed changes

simonrosenberg added 2 commits March 12, 2026 10:45

ci: harden SDK image build workflows

c614f83

refactor: simplify cached sdist plumbing

65331d6

simonrosenberg requested a review from all-hands-bot March 12, 2026 14:12

all-hands-bot reviewed Mar 12, 2026

View reviewed changes

benchmarks/utils/build_utils.py Show resolved Hide resolved

simonrosenberg mentioned this pull request Mar 12, 2026

fix: pre-build SDK sdist once to eliminate 500-image build regression (#504) #505

Closed

4 tasks

fix: use SDK prebuilt sdist API

355829e

simonrosenberg requested a review from all-hands-bot March 12, 2026 15:20

all-hands-bot reviewed Mar 12, 2026

View reviewed changes

build: expose parallelism knobs and shared SDK cache

3d93862

simonrosenberg and others added 3 commits March 12, 2026 20:13

build: disable cache exports for SWT image workflow

c1423f7

build: point SWT workflow at cache-export fix

277cb51

simonrosenberg mentioned this pull request Mar 13, 2026

SWT-Bench cold image builds regress badly under shared registry cache export #510

Open

build: remove image workflow timeouts

039aebd

This was referenced Mar 13, 2026

fix: cache SDK sdist across image builds #515

Merged

build: expose image build parallelism knobs to workflows #516

Merged

		raise RuntimeError("Unreachable code reached in _build_with_logging")


		def _iter_batch_results(

Conversation

simonrosenberg commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this is the real fix

Validation

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simonrosenberg commented Mar 12, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simonrosenberg commented Mar 12, 2026

Uh oh!

openhands-ai bot commented Mar 12, 2026

Uh oh!

simonrosenberg commented Mar 12, 2026

🔥 Linus-Style Code Review: PR #507 vs PR #505

Taste Rating: 🟢 Good taste (PR #507)

The Core Difference

Why PR #507 is the Right Design

1. "Don't Duplicate; Delegate"

2. Future-Proofing via Graceful Fallback

3. Explicit Parameter Passing > Hidden Environment Variables

4. Testability

Strong Features from PR #505 Worth Porting

✅ Explanatory Comments for the Logging Switch

❌ Environment Variable Approach (Don't Port)

❌ Manual buildx Command Construction (Definitely Don't Port)

Minor Nits (Skip These - Linters Exist)

VERDICT: ✅ Worth merging (PR #507)

Suggested Follow-up (Non-Blocking)

Uh oh!

simonrosenberg commented Mar 12, 2026

Uh oh!

openhands-ai bot commented Mar 12, 2026

Summary

Code Review Posted

Changes Made (Already Pushed)

Checklist

Uh oh!

simonrosenberg commented Mar 12, 2026

Uh oh!

simonrosenberg commented Mar 12, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

simonrosenberg commented Mar 12, 2026

Uh oh!

juanmichelini commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

simonrosenberg commented Mar 12, 2026 •

edited

Loading

juanmichelini commented Mar 13, 2026 •

edited

Loading