fix: pre-build SDK sdist once to eliminate 500-image build regression (#504)#505
fix: pre-build SDK sdist once to eliminate 500-image build regression (#504)#505simonrosenberg wants to merge 3 commits intomainfrom
Conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating
🟡 Acceptable - Pragmatic fix for real production problem, but needs evidence the SDK revert is safe.
Linus-Style Analysis
Core Philosophy Check:
✅ Solving a REAL problem (5hr builds → 60min), not theoretical
✅ Logging fix is the RIGHT solution - Rich console state DOES deadlock in forked workers
✅ Simple, focused changes - no over-engineering
[CRITICAL ISSUES]
SDK Revert Safety: You claim b498a699 is "known-good baseline" but provide ZERO evidence. What if code written after that commit depends on features from bde715c1? Show me test results or build logs proving this revert doesn't break functionality. This is "we don't break userspace" territory.
Logging Configuration: You replaced Rich logging with stdlib in 6 files, but I see no logging.basicConfig() call. Where is logging configured? Stdlib logging is silent by default - users will lose all log output unless something configures handlers and formatters.
[IMPROVEMENT OPPORTUNITIES]
CI Bash Complexity: The preflight disk check has nested if/else with multiple fallbacks. Acceptable for infrastructure tooling, but could be simpler.
[ACCEPTABLE PRAGMATISM]
🟢 No automated tests for logging changes - for identical import swaps across files, manual CI verification is reasonable given the build infrastructure focus.
🟢 CI instrumentation (timeouts, disk monitoring) - necessary observability for debugging build issues.
VERDICT:
KEY INSIGHT:
The logging fix demonstrates good engineering taste - you identified the root cause (fork-unsafe Rich state) and applied the minimal fix (stdlib logging). But the SDK revert is flying blind without test coverage.
| import logging | ||
|
|
||
| logger = get_logger(__name__) | ||
| logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
🟠 Important: You replaced openhands logger with stdlib logging, but where is logging.basicConfig() configured? Stdlib logging is silent by default - it needs handlers and formatters to actually output anything. Without configuration, all your log statements will go nowhere.
Check if there's a central initialization point, or add basic config in the main entry points.
vendor/software-agent-sdk
Outdated
| @@ -1 +1 @@ | |||
| Subproject commit bde715c12bce8fb112980529d5ad162f6b81a7f1 | |||
| Subproject commit b498a69908f7d06feb3921ffe05ff7e781a6f108 | |||
There was a problem hiding this comment.
🔴 Critical: This SDK revert claims b498a699 is a "known-good baseline" but you provide ZERO evidence it won't break functionality. What if code written after this commit depends on features introduced in bde715c1?
Required before merge: Run the full test suite against this reverted SDK and show passing results. Or at minimum, explain what changed between these commits and why reverting is safe.
| - name: "Preflight: prune cache and verify BuildKit disk" | ||
| run: | | ||
| set -euo pipefail | ||
| KEEP_GB=60 | ||
| echo "Pruning BuildKit cache (target max-storage ${KEEP_GB} GiB, no filters)..." | ||
| if ! docker buildx prune --all --force --max-storage ${KEEP_GB}g; then | ||
| docker buildx prune --all --force --keep-storage ${KEEP_GB}g || true | ||
| fi | ||
|
|
||
| if df -B1 /var/lib/buildkit > /tmp/buildkit_df 2>/dev/null; then | ||
| LINE=$(tail -n1 /tmp/buildkit_df) | ||
| TOTAL=$(echo "$LINE" | awk '{print $2}') | ||
| USED=$(echo "$LINE" | awk '{print $3}') | ||
| FREE=$(echo "$LINE" | awk '{print $4}') | ||
| if [ -n "$TOTAL" ] && [ -n "$FREE" ]; then | ||
| PCT=$(( 100 * USED / TOTAL )) | ||
| echo "BuildKit disk: used ${USED} / ${TOTAL} bytes (${PCT}%); free ${FREE} bytes" | ||
| MIN=$((75 * 1024 * 1024 * 1024)) | ||
| if [ "$FREE" -lt "$MIN" ]; then | ||
| echo "::error::Not enough free space on /var/lib/buildkit (${FREE} bytes free, need >= ${MIN})" | ||
| exit 1 | ||
| fi | ||
| else | ||
| echo "Warning: unable to parse df output for /var/lib/buildkit" | ||
| fi | ||
| else | ||
| echo "Warning: /var/lib/buildkit not found; skipping disk check" |
There was a problem hiding this comment.
🟡 Suggestion: This disk check has nested conditionals and multiple fallback strategies. It works, but consider extracting to a script file for clarity.
That said, for one-off infrastructure tooling, this level of inline complexity is acceptable - not worth blocking the PR over.
7807fef to
c00ae42
Compare
Three-part fix for the 500-image build regression introduced in #456: 1. Pin SDK submodule to 30819566 (proven in 34-min 500-image build) - The regression between bde715c1 and this SHA has been fixed upstream - Images with this SHA already exist in the GHCR registry - Restores fast warm-cache builds (~34 minutes vs 5+ hours) 2. Replace openhands.sdk.get_logger with stdlib logging in build modules - build_utils.py, buildx_utils.py, image_utils.py, build_images.py - Prevents Rich console state from being inherited across ProcessPoolExecutor forks (deadlock fix) 3. Add cold-cache survivability improvements to CI workflows - timeout-minutes: 180 on both swebench and swtbench build jobs - Post-build disk/timing instrumentation for observability - Preflight BuildKit prune + disk check for swtbench (was missing) - BUILDKIT_RESET_ON_FAILURE for swtbench build step Fixes #504 Refs #502, #503 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c00ae42 to
7780c58
Compare
The root cause of the 500-image build regression (#504) is that `uv build --sdist` runs once PER IMAGE (~15s × 500 = ~2 hours). The sdist is identical for all images (same SDK code, different base images), so building it 500 times is pure waste. This fix: - Adds `_pre_build_sdist()` that builds the sdist ONCE before the batch loop in `build_all_images()` - Sets `OPENHANDS_CACHED_SDIST` env var so forked worker processes inherit the cached path - Adds `_build_with_cached_sdist()` that extracts the pre-built sdist and runs `docker buildx build` directly, bypassing the SDK's per-image `uv build --sdist` - Falls back to the SDK's `build()` function if pre-build fails This approach works for ANY future SDK version since it operates entirely in the benchmarks repo. The SDK submodule SHA no longer affects build duration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the temporary SDK pin to 30819566. With the sdist caching fix, the build works with any SDK version. Updating to the latest upstream (aa9df699) proves this and picks up recent SDK improvements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ported from PR #505 - explains why openhands.sdk.get_logger is not used in build-related modules to avoid Rich console state deadlocks when ProcessPoolExecutor forks worker processes. Co-authored-by: openhands <openhands@all-hands.dev>
|
Closing this PR since it's a duplicated of #507 |
Summary
uv build --sdistoverheadopenhands.sdk.get_loggerwith stdlibloggingin benchmarks utilities to prevent Rich/fork deadlocks inProcessPoolExecutorworkerstimeout-minutes, build timing instrumentation, preflight BuildKit pruneRoot Cause
The SDK's
_make_build_context()runsuv build --sdistfor every image build (~15s each). With 500 images, this adds ~2 hours of pure overhead — the sdist is identical for all images (same SDK code, different base images).Fix
build_all_images()now calls_pre_build_sdist()once before the batch loop. The cached sdist path is passed to workers viaOPENHANDS_CACHED_SDISTenv var._build_with_cached_sdist()extracts the pre-built tarball and runsdocker buildx builddirectly, bypassing the per-imageuv build --sdist. Falls back to the SDK's nativebuild()if pre-build fails.This works for any future SDK version since it builds from whatever submodule SHA is checked out.
Results
vs. the regression: 5h49m+ (cancelled at batch 19/34, 599 GiB BuildKit usage)
Evidence from CI logs:
Test plan
_pre_build_sdist()fails, per-imagebuild()still worksCloses #504
🤖 Generated with Claude Code