Skip to content

fix: cache SDK sdist across image builds#507

Open
simonrosenberg wants to merge 10 commits intomainfrom
fix/issue-504-cached-sdist
Open

fix: cache SDK sdist across image builds#507
simonrosenberg wants to merge 10 commits intomainfrom
fix/issue-504-cached-sdist

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Mar 12, 2026

Summary

  • pre-build the SDK sdist once per image-build batch and reuse it across all workers
  • intercept only the SDK's internal uv build --sdist subprocess call so the SDK still owns tag, cache, and Docker build behavior
  • switch build-related modules to stdlib logging to avoid inheriting SDK Rich console state across worker forks
  • add focused tests for the cached-sdist interception seam
  • bump the SDK submodule to 8e8223b24dfd041875fe845b13326b0a1238b793
  • set OPENHANDS_BUILDKIT_CACHE_MODE=off for the SWT image-build workflow
  • remove the timeout-minutes: 180 limit from both SWE-Bench and SWT-Bench image-build workflows

Why this is the real fix

The first regression in this area was repeated SDK sdist construction across hundreds of image builds. This PR removes that repeated work in benchmarks without forking SDK behavior.

While validating that change on full SWT-Bench runs, the dominant remaining slowdown turned out to be registry cache export during cold fan-out image builds. The SDK PR adds cache export controls; this PR wires those controls into the SWT workflow and removes the workflow timeout that was cancelling otherwise healthy long-running image builds.

Because benchmarks still calls the SDK's native build() path, the repo keeps using SDK-owned tagging, Docker args, and cache semantics instead of duplicating them here.

Validation

  • make build
  • uv run pytest tests/test_image_utils.py tests/test_llm_config.py
  • uv run pre-commit run --files benchmarks/utils/build_utils.py benchmarks/utils/buildx_utils.py benchmarks/utils/image_utils.py benchmarks/swebench/build_images.py benchmarks/swtbench/build_eval_env_images.py benchmarks/swtbench/image_utils.py tests/test_image_utils.py
  • full SWT-Bench 433-instance image build using this branch + SDK 8e8223b24dfd041875fe845b13326b0a1238b793:
    • run: 23043936501
    • result from logs: 433/433 images built, ❌ 0, total image-build time 4:10:33

Fixes #504

Build the SDK sdist once per batch and reuse it through the SDK's native build() path by intercepting only the internal uv sdist subprocess call. This removes the per-image sdist overhead while preserving tag, cache, and build-arg behavior for current and future SDK versions.

Also switch build-related modules to stdlib logging so worker forks do not inherit SDK Rich console state.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solves a real performance problem, but the solution is more complex than it needs to be.

[CRITICAL ISSUES]

The core optimization is sound (build sdist once instead of N times), but the implementation adds unnecessary indirection that will make this harder to maintain.

[IMPROVEMENT OPPORTUNITIES]

See inline comments for specific architectural improvements that would simplify the data flow and reduce coupling to SDK internals.

VERDICT: ✅ Worth merging - Performance improvement is real and valuable. Fallback behavior prevents breaking changes. The complexity is manageable but could be simplified in follow-up work.

KEY INSIGHT: Monkey-patching SDK internals works but creates maintenance coupling. Consider upstreaming sdist caching to the SDK itself, or pass the pre-built sdist path explicitly through the call chain instead of via environment variable + runtime patching.

@simonrosenberg
Copy link
Collaborator Author

Addressed the review feedback in 65331d6.

Changes made:

  • removed the env-var handoff and now pass cached_sdist explicitly through build_all_images() -> _iter_batch_results() -> _build_with_logging() -> build_image()
  • tightened the interception predicate so cached-sdist reuse only applies to uv build --sdist calls whose cwd is the SDK root
  • reduced nesting in build_all_images() by flattening the cached-sdist setup into the main with block and extracting batch execution into _iter_batch_results()
  • added a stronger integration-style test against the real imported SDK build module in addition to the unit tests

I kept the patch-at-the-SDK-build-seam design because it still avoids duplicating the SDK build path in benchmarks, while making the data flow more explicit and defensive.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solves a real performance problem with pragmatic engineering. The patching approach is more complex than ideal, but given the constraint of preserving SDK build logic, it's a reasonable trade-off.

@simonrosenberg
Copy link
Collaborator Author

@OpenHands /codereview-roasted
review this PR as well as this other PR that fixes the same issue but slightly differently #505
Are there are strong features in PR #505 that should be ported to this PR

@openhands-ai
Copy link

openhands-ai bot commented Mar 12, 2026

I'm on it! simonrosenberg can track my progress at all-hands.dev

Copy link
Collaborator Author

🔥 Linus-Style Code Review: PR #507 vs PR #505

Taste Rating: 🟢 Good taste (PR #507)

PR #507 demonstrates "good taste" - it makes the minimal change to solve the problem while keeping the SDK in control of its own build behavior. Let me explain why this matters.


The Core Difference

PR #507 (this PR): Patches the SDK's internal _run() helper to short-circuit only the uv build --sdist subprocess call. The SDK's build() is still called - it just gets a cached sdist copied instead of building a new one.

PR #505: Completely replaces the SDK's build() with ~100 lines of manually-constructed docker buildx build commands, tarball extraction, and Dockerfile copying.


Why PR #507 is the Right Design

1. "Don't Duplicate; Delegate"

PR #507:

with _patch_sdk_sdist_build(sdk_build_module, cached_sdist):
    tags = build(opts)  # SDK still owns all build logic

PR #505:

# Replicates 100 lines of SDK internals:
args = ["docker", "buildx", "build", "--file", ...]
args += ["--build-arg", f"BASE_IMAGE={opts.base_image}"]
args += ["--build-arg", f"OPENHANDS_BUILD_GIT_SHA={opts.git_sha}"]
# ... and so on

When the SDK adds a new build arg (say, --build-arg OPENHANDS_BUILD_FEATURES=...), PR #507 will automatically pick it up. PR #505 will silently diverge and produce subtly different images.

2. Future-Proofing via Graceful Fallback

PR #507 fails gracefully:

original_run = getattr(sdk_build_module, "_run", None)
if original_run is None:
    logger.warning("SDK build module has no _run helper; falling back to native build()")
    yield
    return

If the SDK refactors away _run, benchmarks falls back to the SDK's native behavior instead of exploding.

3. Explicit Parameter Passing > Hidden Environment Variables

PR #507 passes cached_sdist explicitly through the call chain:

def build_image(..., cached_sdist: Path | None = None):

PR #505 uses a hidden env var:

os.environ["OPENHANDS_CACHED_SDIST"] = str(cached_sdist_path)
# ... and later reads it implicitly in build_image()
cached_sdist = os.environ.get("OPENHANDS_CACHED_SDIST")

Explicit is better than implicit. The function signature documents the interface; environment variables create hidden coupling.

4. Testability

PR #507 has focused tests:

  • test_is_sdist_build_command_matches_expected_shape
  • test_is_sdk_sdist_build_command_requires_sdk_cwd
  • test_patch_sdk_sdist_build_reuses_cached_sdist_only_for_sdk_sdist_commands
  • test_build_image_reuses_cached_sdist_with_real_sdk_build_module

PR #505 has no tests for _build_with_cached_sdist() at all. That 100-line function with tarfile.open(), Dockerfile copying, and subprocess construction? Untested.


Strong Features from PR #505 Worth Porting

✅ Explanatory Comments for the Logging Switch

PR #505 has inline comments explaining why the stdlib logging switch matters:

# Use stdlib logging instead of openhands.sdk.get_logger to avoid initializing
# Rich console state before ProcessPoolExecutor forks (causes deadlocks).
import logging

PR #507 just switches the import without explaining the deadlock context. This comment should be added for future maintainers.

Recommendation: Add a brief comment to build_utils.py explaining the Rich/fork deadlock issue.

❌ Environment Variable Approach (Don't Port)

The OPENHANDS_CACHED_SDIST env var in PR #505 is actually an anti-pattern. PR #507's explicit parameter passing is cleaner.

❌ Manual buildx Command Construction (Definitely Don't Port)

This is exactly what creates the maintenance burden. Keep the SDK in control.


Minor Nits (Skip These - Linters Exist)

I don't care about import ordering or spacing. The code passes pre-commit.


VERDICT:Worth merging (PR #507)

The core logic is sound. PR #507 makes the minimal, surgical change to eliminate the repeated uv build --sdist overhead while keeping the SDK in control of everything else.

KEY INSIGHT: The difference between "good taste" and "bad taste" here is whether you trust the SDK to own its build behavior. PR #507 says "yes, just let me cache the expensive bit." PR #505 says "no, I'll reimplement your build logic myself." The first approach scales; the second creates a fork.


Suggested Follow-up (Non-Blocking)

Add this comment at the top of benchmarks/utils/build_utils.py:

# Use stdlib logging instead of openhands.sdk.get_logger to avoid initializing
# Rich console state before ProcessPoolExecutor forks (causes deadlocks).

This documents the "why" for future maintainers.

Ported from PR #505 - explains why openhands.sdk.get_logger is not used
in build-related modules to avoid Rich console state deadlocks when
ProcessPoolExecutor forks worker processes.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator Author

Ported the explanatory comments from PR #505 in 68bf988.

All six affected files now document why stdlib logging is used instead of openhands.sdk.get_logger (to avoid Rich console state deadlocks during ProcessPoolExecutor forks).

@openhands-ai
Copy link

openhands-ai bot commented Mar 12, 2026

Summary

I completed the code review comparing PR #507 (this PR) with PR #505, and ported the one strong feature from #505.

Code Review Posted

Changes Made (Already Pushed)

Commit 68bf988: Added explanatory comments to 6 files explaining why stdlib logging is used instead of openhands.sdk.get_logger:

# Use stdlib logging instead of openhands.sdk.get_logger to avoid initializing
# Rich console state before ProcessPoolExecutor forks (causes deadlocks).

Files updated:

  • benchmarks/utils/build_utils.py
  • benchmarks/utils/buildx_utils.py
  • benchmarks/utils/image_utils.py
  • benchmarks/swebench/build_images.py
  • benchmarks/swtbench/build_eval_env_images.py
  • benchmarks/swtbench/image_utils.py

Checklist

@simonrosenberg
Copy link
Collaborator Author

Triggered fresh full image-build validation runs from the current feature branch tip 68bf988 against current software-agent-sdk main (aa9df699cdf22dab5b46ecf23a48d3342413444e).

Monitor here:

Run details:

  • branch: fix/issue-504-cached-sdist
  • swebench: n-limit=500
  • swtbench: n-limit=500, build-eval-env=false

These were dispatched after the latest review fixes. They are currently pending behind the older branch-scoped runs while GitHub finishes cancelling those earlier jobs.

@simonrosenberg
Copy link
Collaborator Author

Current validation runs for the SDK-native prebuilt-sdist API change:

Current refs:

  • benchmarks branch: fix/issue-504-cached-sdist @ 355829e
  • sdk branch: fix/prebuilt-sdist-api @ d0c1a392

Older branch-scoped workflow runs were cancelled so these are the active runs to track.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Pragmatic fix for a real performance problem. The solution is clean, preserves SDK build logic, and handles failures gracefully.

[KEY INSIGHT]
This solves the actual problem (SDK rebuilding the same sdist N times) with straightforward engineering: build once, pass the path through, clean up after. Previous review concerns about env var indirection have been addressed by making the data flow explicit.

[MINOR OBSERVATIONS]
See inline comments. The workflow hardening changes feel like scope creep but are acceptable since they all relate to build robustness. The 10-parameter function is extracting complexity to reduce nesting, which is the right trade-off.

VERDICT: ✅ Worth merging - Core logic is sound, no blocking issues.


runs-on:
labels: blacksmith-32vcpu-ubuntu-2204
timeout-minutes: 180
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Scope: Workflow improvements (timeout, build timing, disk reporting) are good defensive engineering, but feel like scope creep for a PR focused on caching the SDK sdist.

Acceptable because they all relate to making builds more reliable. Just be aware that bundling unrelated improvements makes bisecting regressions harder.

If splitting: the workflow hardening could be a separate PR titled "ci: add build timeouts and disk monitoring".

raise RuntimeError("Unreachable code reached in _build_with_logging")


def _iter_batch_results(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable - Pragmatism: Ten parameters is a lot, but this function is extracting the ProcessPoolExecutor loop from build_many() to reduce nesting depth.

The alternative would be bundling these into a config object, but that adds indirection without reducing actual complexity. Since all parameters are needed for the build process, this is a reasonable trade-off.

The nesting reduction in build_many() is more valuable than the parameter count smell.

Comment on lines +246 to +249
except Exception as e:
logger.warning(
"Failed to pre-build SDK sdist; each image will build its own: %s", e
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste: Clean fallback behavior. If prebuild fails, log a warning and continue without caching—each image builds its own sdist.

This is robust: don't fail the entire batch because of an optimization failure. The warning ensures visibility without breaking the build.

Comment on lines +12 to +14
# Use stdlib logging instead of openhands.sdk.get_logger to avoid initializing
# Rich console state before ProcessPoolExecutor forks (causes deadlocks).
import logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable - Pragmatism: Switching to stdlib logging to avoid Rich console state issues in multiprocessing is the right fix.

Rich's console state doesn't fork well across ProcessPoolExecutor workers—this was causing deadlocks. The comment clearly explains the constraint for future maintainers.

@simonrosenberg
Copy link
Collaborator Author

Triggered fresh full validation runs on fix/issue-504-cached-sdist after the SDK shared-cache change and workflow batch/worker configurability updates.

Both runs are on benchmarks 3d93862f394ba608f2547062f16287a96efcd741 and SDK fc962c4c8d7fd6eb0e91db06a5ec9911b29589ac.

simonrosenberg and others added 3 commits March 12, 2026 20:13
Update the software-agent-sdk submodule from fc962c4 to 447aa91 so the issue-504 SWT branch uses the SDK build path that keeps shared cache reads but disables shared cache writes by default. This preserves the prebuilt-sdist work while removing the parallel registry export contention identified in issue #510.

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini
Copy link
Collaborator

juanmichelini commented Mar 13, 2026

Reposting this here so it doesn't get lost in the slack.

https://github.com/OpenHands/benchmarks/actions/runs/23043936501/job/66928367859 succeeded because it relied on build
https://github.com/OpenHands/benchmarks/actions/runs/23033839708 both built the same sdk 8e8223b24dfd041875fe845b13326b0a1238b793
so the one that succeed started with a 203 head start
✅ Done:  47%|████▋     | 203/433 [00:20<00:09, 24.25it/s, ✅ 203  ❌ 0  🏃 47
...
✅ Done: 100%|██████████| 433/433 [4:10:33<00:00, 17.59s/it, ✅ 433  ❌ 0  🏃 0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track root cause isolation and fix plan for the 500-image build regression after #456

4 participants