fix: pre-build SDK sdist once to eliminate 500-image build regression (#504) by simonrosenberg · Pull Request #505 · OpenHands/benchmarks

simonrosenberg · 2026-03-12T03:24:40Z

Summary

Pre-build SDK sdist once and share across all 500 image builds, eliminating ~2h of redundant uv build --sdist overhead
Replace openhands.sdk.get_logger with stdlib logging in benchmarks utilities to prevent Rich/fork deadlocks in ProcessPoolExecutor workers
Add CI workflow improvements: timeout-minutes, build timing instrumentation, preflight BuildKit prune

Root Cause

The SDK's _make_build_context() runs uv build --sdist for every image build (~15s each). With 500 images, this adds ~2 hours of pure overhead — the sdist is identical for all images (same SDK code, different base images).

Fix

build_all_images() now calls _pre_build_sdist() once before the batch loop. The cached sdist path is passed to workers via OPENHANDS_CACHED_SDIST env var. _build_with_cached_sdist() extracts the pre-built tarball and runs docker buildx build directly, bypassing the per-image uv build --sdist. Falls back to the SDK's native build() if pre-build fails.

This works for any future SDK version since it builds from whatever submodule SHA is checked out.

Results

Benchmark	Images	Build Step	Total Run	Failures
SWE-bench	500	28 min	41 min	0
SWT-bench	433	18 min	32 min	0

vs. the regression: 5h49m+ (cancelled at batch 19/34, 599 GiB BuildKit usage)

Evidence from CI logs:

Pre-built SDK sdist in 3.5s: /tmp/shared-sdist-.../unknown-0.0.0.tar.gz

Test plan

SWE-bench 500-image build: 0 failures, 28 min
SWT-bench 433-image build: 0 failures, 18 min
Sdist pre-build logged at 3.5s (vs ~2h cumulative before)
Graceful fallback: if _pre_build_sdist() fails, per-image build() still works

Closes #504

🤖 Generated with Claude Code

all-hands-bot

Taste Rating

🟡 Acceptable - Pragmatic fix for real production problem, but needs evidence the SDK revert is safe.

Linus-Style Analysis

Core Philosophy Check:
✅ Solving a REAL problem (5hr builds → 60min), not theoretical
✅ Logging fix is the RIGHT solution - Rich console state DOES deadlock in forked workers
✅ Simple, focused changes - no over-engineering
⚠️ Missing proof that SDK revert won't break things

[CRITICAL ISSUES]

SDK Revert Safety: You claim b498a699 is "known-good baseline" but provide ZERO evidence. What if code written after that commit depends on features from bde715c1? Show me test results or build logs proving this revert doesn't break functionality. This is "we don't break userspace" territory.

Logging Configuration: You replaced Rich logging with stdlib in 6 files, but I see no logging.basicConfig() call. Where is logging configured? Stdlib logging is silent by default - users will lose all log output unless something configures handlers and formatters.

[IMPROVEMENT OPPORTUNITIES]

CI Bash Complexity: The preflight disk check has nested if/else with multiple fallbacks. Acceptable for infrastructure tooling, but could be simpler.

[ACCEPTABLE PRAGMATISM]

🟢 No automated tests for logging changes - for identical import swaps across files, manual CI verification is reasonable given the build infrastructure focus.

🟢 CI instrumentation (timeouts, disk monitoring) - necessary observability for debugging build issues.

VERDICT:
⚠️ Conditionally approve: Core logging fix is sound, but you MUST provide evidence the SDK revert is safe before merging. Run the full test suite and show it passes.

KEY INSIGHT:
The logging fix demonstrates good engineering taste - you identified the root cause (fork-unsafe Rich state) and applied the minimal fix (stdlib logging). But the SDK revert is flying blind without test coverage.

all-hands-bot · 2026-03-12T03:27:02Z

benchmarks/swebench/build_images.py

+import logging

-logger = get_logger(__name__)
+logger = logging.getLogger(__name__)


🟠 Important: You replaced openhands logger with stdlib logging, but where is logging.basicConfig() configured? Stdlib logging is silent by default - it needs handlers and formatters to actually output anything. Without configuration, all your log statements will go nowhere.

Check if there's a central initialization point, or add basic config in the main entry points.

all-hands-bot · 2026-03-12T03:27:03Z

vendor/software-agent-sdk

@@ -1 +1 @@
-Subproject commit bde715c12bce8fb112980529d5ad162f6b81a7f1
+Subproject commit b498a69908f7d06feb3921ffe05ff7e781a6f108


🔴 Critical: This SDK revert claims b498a699 is a "known-good baseline" but you provide ZERO evidence it won't break functionality. What if code written after this commit depends on features introduced in bde715c1?

Required before merge: Run the full test suite against this reverted SDK and show passing results. Or at minimum, explain what changed between these commits and why reverting is safe.

all-hands-bot · 2026-03-12T03:27:03Z

.github/workflows/build-swtbench-images.yml

+      - name: "Preflight: prune cache and verify BuildKit disk"
+        run: |
+          set -euo pipefail
+          KEEP_GB=60
+          echo "Pruning BuildKit cache (target max-storage ${KEEP_GB} GiB, no filters)..."
+          if ! docker buildx prune --all --force --max-storage ${KEEP_GB}g; then
+            docker buildx prune --all --force --keep-storage ${KEEP_GB}g || true
+          fi
+
+          if df -B1 /var/lib/buildkit > /tmp/buildkit_df 2>/dev/null; then
+            LINE=$(tail -n1 /tmp/buildkit_df)
+            TOTAL=$(echo "$LINE" | awk '{print $2}')
+            USED=$(echo "$LINE" | awk '{print $3}')
+            FREE=$(echo "$LINE" | awk '{print $4}')
+            if [ -n "$TOTAL" ] && [ -n "$FREE" ]; then
+              PCT=$(( 100 * USED / TOTAL ))
+              echo "BuildKit disk: used ${USED} / ${TOTAL} bytes (${PCT}%); free ${FREE} bytes"
+              MIN=$((75 * 1024 * 1024 * 1024))
+              if [ "$FREE" -lt "$MIN" ]; then
+                echo "::error::Not enough free space on /var/lib/buildkit (${FREE} bytes free, need >= ${MIN})"
+                exit 1
+              fi
+            else
+              echo "Warning: unable to parse df output for /var/lib/buildkit"
+            fi
+          else
+            echo "Warning: /var/lib/buildkit not found; skipping disk check"


🟡 Suggestion: This disk check has nested conditionals and multiple fallback strategies. It works, but consider extracting to a script file for clarity.

That said, for one-off infrastructure tooling, this level of inline complexity is acceptable - not worth blocking the PR over.

Three-part fix for the 500-image build regression introduced in #456: 1. Pin SDK submodule to 30819566 (proven in 34-min 500-image build) - The regression between bde715c1 and this SHA has been fixed upstream - Images with this SHA already exist in the GHCR registry - Restores fast warm-cache builds (~34 minutes vs 5+ hours) 2. Replace openhands.sdk.get_logger with stdlib logging in build modules - build_utils.py, buildx_utils.py, image_utils.py, build_images.py - Prevents Rich console state from being inherited across ProcessPoolExecutor forks (deadlock fix) 3. Add cold-cache survivability improvements to CI workflows - timeout-minutes: 180 on both swebench and swtbench build jobs - Post-build disk/timing instrumentation for observability - Preflight BuildKit prune + disk check for swtbench (was missing) - BUILDKIT_RESET_ON_FAILURE for swtbench build step Fixes #504 Refs #502, #503 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The root cause of the 500-image build regression (#504) is that `uv build --sdist` runs once PER IMAGE (~15s × 500 = ~2 hours). The sdist is identical for all images (same SDK code, different base images), so building it 500 times is pure waste. This fix: - Adds `_pre_build_sdist()` that builds the sdist ONCE before the batch loop in `build_all_images()` - Sets `OPENHANDS_CACHED_SDIST` env var so forked worker processes inherit the cached path - Adds `_build_with_cached_sdist()` that extracts the pre-built sdist and runs `docker buildx build` directly, bypassing the SDK's per-image `uv build --sdist` - Falls back to the SDK's `build()` function if pre-build fails This approach works for ANY future SDK version since it operates entirely in the benchmarks repo. The SDK submodule SHA no longer affects build duration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove the temporary SDK pin to 30819566. With the sdist caching fix, the build works with any SDK version. Updating to the latest upstream (aa9df699) proves this and picks up recent SDK improvements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ported from PR #505 - explains why openhands.sdk.get_logger is not used in build-related modules to avoid Rich console state deadlocks when ProcessPoolExecutor forks worker processes. Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg · 2026-03-12T14:31:30Z

Closing this PR since it's a duplicated of #507

simonrosenberg added build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. build-swt-bench Build SWT-Bench images based on SDK version on this PR. labels Mar 12, 2026

all-hands-bot reviewed Mar 12, 2026

View reviewed changes

simonrosenberg force-pushed the fix/504-build-regression-sdk-revert-and-logging branch 2 times, most recently from 7807fef to c00ae42 Compare March 12, 2026 06:34

simonrosenberg force-pushed the fix/504-build-regression-sdk-revert-and-logging branch from c00ae42 to 7780c58 Compare March 12, 2026 06:37

simonrosenberg mentioned this pull request Mar 12, 2026

Track root cause isolation and fix plan for the 500-image build regression after #456 #504

Open

1 task

simonrosenberg changed the title ~~fix: revert SDK to known-good baseline and fix 500-image build regression~~ Ffix 500-image build regression Mar 12, 2026

simonrosenberg changed the title ~~Ffix 500-image build regression~~ fix: pre-build SDK sdist once to eliminate 500-image build regression (#504) Mar 12, 2026

simonrosenberg mentioned this pull request Mar 12, 2026

fix: cache SDK sdist across image builds #507

Open

simonrosenberg closed this Mar 12, 2026

simonrosenberg deleted the fix/504-build-regression-sdk-revert-and-logging branch March 12, 2026 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pre-build SDK sdist once to eliminate 500-image build regression (#504)#505

fix: pre-build SDK sdist once to eliminate 500-image build regression (#504)#505
simonrosenberg wants to merge 3 commits intomainfrom
fix/504-build-regression-sdk-revert-and-logging

simonrosenberg commented Mar 12, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Mar 12, 2026

Uh oh!

all-hands-bot Mar 12, 2026

Uh oh!

all-hands-bot Mar 12, 2026

Uh oh!

simonrosenberg commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1 +1 @@
		Subproject commit bde715c12bce8fb112980529d5ad162f6b81a7f1
		Subproject commit b498a69908f7d06feb3921ffe05ff7e781a6f108

Conversation

simonrosenberg commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Results

Test plan

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Taste Rating

Linus-Style Analysis

Uh oh!

all-hands-bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

simonrosenberg commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonrosenberg commented Mar 12, 2026 •

edited

Loading