Skip to content

fix: pre-build SDK sdist once to eliminate 500-image build regression (#504)#505

Closed
simonrosenberg wants to merge 3 commits intomainfrom
fix/504-build-regression-sdk-revert-and-logging
Closed

fix: pre-build SDK sdist once to eliminate 500-image build regression (#504)#505
simonrosenberg wants to merge 3 commits intomainfrom
fix/504-build-regression-sdk-revert-and-logging

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Mar 12, 2026

Summary

  • Pre-build SDK sdist once and share across all 500 image builds, eliminating ~2h of redundant uv build --sdist overhead
  • Replace openhands.sdk.get_logger with stdlib logging in benchmarks utilities to prevent Rich/fork deadlocks in ProcessPoolExecutor workers
  • Add CI workflow improvements: timeout-minutes, build timing instrumentation, preflight BuildKit prune

Root Cause

The SDK's _make_build_context() runs uv build --sdist for every image build (~15s each). With 500 images, this adds ~2 hours of pure overhead — the sdist is identical for all images (same SDK code, different base images).

Fix

build_all_images() now calls _pre_build_sdist() once before the batch loop. The cached sdist path is passed to workers via OPENHANDS_CACHED_SDIST env var. _build_with_cached_sdist() extracts the pre-built tarball and runs docker buildx build directly, bypassing the per-image uv build --sdist. Falls back to the SDK's native build() if pre-build fails.

This works for any future SDK version since it builds from whatever submodule SHA is checked out.

Results

Benchmark Images Build Step Total Run Failures
SWE-bench 500 28 min 41 min 0
SWT-bench 433 18 min 32 min 0

vs. the regression: 5h49m+ (cancelled at batch 19/34, 599 GiB BuildKit usage)

Evidence from CI logs:

Pre-built SDK sdist in 3.5s: /tmp/shared-sdist-.../unknown-0.0.0.tar.gz

Test plan

  • SWE-bench 500-image build: 0 failures, 28 min
  • SWT-bench 433-image build: 0 failures, 18 min
  • Sdist pre-build logged at 3.5s (vs ~2h cumulative before)
  • Graceful fallback: if _pre_build_sdist() fails, per-image build() still works

Closes #504

🤖 Generated with Claude Code

@simonrosenberg simonrosenberg added build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. build-swt-bench Build SWT-Bench images based on SDK version on this PR. labels Mar 12, 2026
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating

🟡 Acceptable - Pragmatic fix for real production problem, but needs evidence the SDK revert is safe.

Linus-Style Analysis

Core Philosophy Check:
✅ Solving a REAL problem (5hr builds → 60min), not theoretical
✅ Logging fix is the RIGHT solution - Rich console state DOES deadlock in forked workers
✅ Simple, focused changes - no over-engineering
⚠️ Missing proof that SDK revert won't break things

[CRITICAL ISSUES]

SDK Revert Safety: You claim b498a699 is "known-good baseline" but provide ZERO evidence. What if code written after that commit depends on features from bde715c1? Show me test results or build logs proving this revert doesn't break functionality. This is "we don't break userspace" territory.

Logging Configuration: You replaced Rich logging with stdlib in 6 files, but I see no logging.basicConfig() call. Where is logging configured? Stdlib logging is silent by default - users will lose all log output unless something configures handlers and formatters.

[IMPROVEMENT OPPORTUNITIES]

CI Bash Complexity: The preflight disk check has nested if/else with multiple fallbacks. Acceptable for infrastructure tooling, but could be simpler.

[ACCEPTABLE PRAGMATISM]

🟢 No automated tests for logging changes - for identical import swaps across files, manual CI verification is reasonable given the build infrastructure focus.

🟢 CI instrumentation (timeouts, disk monitoring) - necessary observability for debugging build issues.

VERDICT:
⚠️ Conditionally approve: Core logging fix is sound, but you MUST provide evidence the SDK revert is safe before merging. Run the full test suite and show it passes.

KEY INSIGHT:
The logging fix demonstrates good engineering taste - you identified the root cause (fork-unsafe Rich state) and applied the minimal fix (stdlib logging). But the SDK revert is flying blind without test coverage.

import logging

logger = get_logger(__name__)
logger = logging.getLogger(__name__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: You replaced openhands logger with stdlib logging, but where is logging.basicConfig() configured? Stdlib logging is silent by default - it needs handlers and formatters to actually output anything. Without configuration, all your log statements will go nowhere.

Check if there's a central initialization point, or add basic config in the main entry points.

@@ -1 +1 @@
Subproject commit bde715c12bce8fb112980529d5ad162f6b81a7f1
Subproject commit b498a69908f7d06feb3921ffe05ff7e781a6f108
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: This SDK revert claims b498a699 is a "known-good baseline" but you provide ZERO evidence it won't break functionality. What if code written after this commit depends on features introduced in bde715c1?

Required before merge: Run the full test suite against this reverted SDK and show passing results. Or at minimum, explain what changed between these commits and why reverting is safe.

Comment on lines +143 to +169
- name: "Preflight: prune cache and verify BuildKit disk"
run: |
set -euo pipefail
KEEP_GB=60
echo "Pruning BuildKit cache (target max-storage ${KEEP_GB} GiB, no filters)..."
if ! docker buildx prune --all --force --max-storage ${KEEP_GB}g; then
docker buildx prune --all --force --keep-storage ${KEEP_GB}g || true
fi

if df -B1 /var/lib/buildkit > /tmp/buildkit_df 2>/dev/null; then
LINE=$(tail -n1 /tmp/buildkit_df)
TOTAL=$(echo "$LINE" | awk '{print $2}')
USED=$(echo "$LINE" | awk '{print $3}')
FREE=$(echo "$LINE" | awk '{print $4}')
if [ -n "$TOTAL" ] && [ -n "$FREE" ]; then
PCT=$(( 100 * USED / TOTAL ))
echo "BuildKit disk: used ${USED} / ${TOTAL} bytes (${PCT}%); free ${FREE} bytes"
MIN=$((75 * 1024 * 1024 * 1024))
if [ "$FREE" -lt "$MIN" ]; then
echo "::error::Not enough free space on /var/lib/buildkit (${FREE} bytes free, need >= ${MIN})"
exit 1
fi
else
echo "Warning: unable to parse df output for /var/lib/buildkit"
fi
else
echo "Warning: /var/lib/buildkit not found; skipping disk check"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: This disk check has nested conditionals and multiple fallback strategies. It works, but consider extracting to a script file for clarity.

That said, for one-off infrastructure tooling, this level of inline complexity is acceptable - not worth blocking the PR over.

@simonrosenberg simonrosenberg force-pushed the fix/504-build-regression-sdk-revert-and-logging branch 2 times, most recently from 7807fef to c00ae42 Compare March 12, 2026 06:34
Three-part fix for the 500-image build regression introduced in #456:

1. Pin SDK submodule to 30819566 (proven in 34-min 500-image build)
   - The regression between bde715c1 and this SHA has been fixed upstream
   - Images with this SHA already exist in the GHCR registry
   - Restores fast warm-cache builds (~34 minutes vs 5+ hours)

2. Replace openhands.sdk.get_logger with stdlib logging in build modules
   - build_utils.py, buildx_utils.py, image_utils.py, build_images.py
   - Prevents Rich console state from being inherited across
     ProcessPoolExecutor forks (deadlock fix)

3. Add cold-cache survivability improvements to CI workflows
   - timeout-minutes: 180 on both swebench and swtbench build jobs
   - Post-build disk/timing instrumentation for observability
   - Preflight BuildKit prune + disk check for swtbench (was missing)
   - BUILDKIT_RESET_ON_FAILURE for swtbench build step

Fixes #504
Refs #502, #503

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg force-pushed the fix/504-build-regression-sdk-revert-and-logging branch from c00ae42 to 7780c58 Compare March 12, 2026 06:37
The root cause of the 500-image build regression (#504) is that
`uv build --sdist` runs once PER IMAGE (~15s × 500 = ~2 hours).
The sdist is identical for all images (same SDK code, different
base images), so building it 500 times is pure waste.

This fix:
- Adds `_pre_build_sdist()` that builds the sdist ONCE before
  the batch loop in `build_all_images()`
- Sets `OPENHANDS_CACHED_SDIST` env var so forked worker processes
  inherit the cached path
- Adds `_build_with_cached_sdist()` that extracts the pre-built
  sdist and runs `docker buildx build` directly, bypassing the
  SDK's per-image `uv build --sdist`
- Falls back to the SDK's `build()` function if pre-build fails

This approach works for ANY future SDK version since it operates
entirely in the benchmarks repo. The SDK submodule SHA no longer
affects build duration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg changed the title fix: revert SDK to known-good baseline and fix 500-image build regression Ffix 500-image build regression Mar 12, 2026
@simonrosenberg simonrosenberg changed the title Ffix 500-image build regression fix: pre-build SDK sdist once to eliminate 500-image build regression (#504) Mar 12, 2026
Remove the temporary SDK pin to 30819566. With the sdist caching
fix, the build works with any SDK version. Updating to the latest
upstream (aa9df699) proves this and picks up recent SDK improvements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
simonrosenberg pushed a commit that referenced this pull request Mar 12, 2026
Ported from PR #505 - explains why openhands.sdk.get_logger is not used
in build-related modules to avoid Rich console state deadlocks when
ProcessPoolExecutor forks worker processes.

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg
Copy link
Collaborator Author

Closing this PR since it's a duplicated of #507

@simonrosenberg simonrosenberg deleted the fix/504-build-regression-sdk-revert-and-logging branch March 12, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. build-swt-bench Build SWT-Bench images based on SDK version on this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track root cause isolation and fix plan for the 500-image build regression after #456

2 participants