fix: Reuse pre-built SDK sdist to speed up image builds#499
fix: Reuse pre-built SDK sdist to speed up image builds#499juanmichelini wants to merge 4 commits intomainfrom
Conversation
Avoids running 'uv build' 500 times (once per image), saving ~15s per image and eliminating logging spam.
🚨 Evaluation OOM Failure ReportI tested this branch with two evaluation runs, and both failed with Out-Of-Memory (OOM) errors before any evaluations could start. Test Runs
OOM Details (Pod: eval-22921567775-qwen3-5-fl-vfrwq)Kubernetes Event: Memory Stats:
Timeline:
Last Captured Log Lines: The pod died very early - during initial package installation, before any evaluation logic ran. Code Changes AnalysisI compared this branch with main and found these key changes:
Theories on Root CauseMost Likely: The memory spike happens during evaluation initialization, possibly due to:
Evidence Supporting This:
Comparison with Working EvalsOther pods running on main branch:
These successfully run evaluations without OOM issues. Questions for Debugging
RecommendationThe build optimization looks good (benchmarks build succeeded), but something in this branch causes evaluation pods to OOM during initialization. I'd suggest:
Full investigation details available in these files:
|
|
@OpenHands please analyze why we got OOM twice and then update the PR to address the most likely cause. |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
Convert build_sdist_package() to a context manager that: 1. Creates temp directory for sdist build 2. Yields the path for use in builds 3. Ensures cleanup in finally block (shutil.rmtree) This addresses potential OOM issues by guaranteeing cleanup even if builds fail or are interrupted. Co-authored-by: openhands <openhands@all-hands.dev>
OOM Analysis & FixAnalysis SummaryI analyzed the OOM failures and the code changes in this PR. Here are my findings: Key Observations:
Code Path Analysis:
Identified Issue: cache_dir = Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
# ... no cleanupIf The FixI converted
@contextlib.contextmanager
def build_sdist_package():
cache_dir = Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
try:
# ... build sdist
yield sdist_path
finally:
# Clean up the temp directory
if cache_dir.exists():
logger.info(f"Cleaning up sdist cache: {cache_dir}")
shutil.rmtree(cache_dir, ignore_errors=True)Note on OOM Root CauseCaveat: While the cleanup fix is a definite improvement, the OOM during evaluation (which dies during
The cleanup fix ensures proper resource management during builds, but if OOMs persist after this fix, we should investigate other factors like:
Commit: |
Summary of Work CompletedAnalysisI analyzed the OOM failures reported in PR #499. Key findings:
Issue IdentifiedThe original Fix AppliedConverted
Verification
CaveatWhile this fix addresses a real resource leak, the OOM happening during Commit: |
Co-authored-by: openhands <openhands@all-hands.dev>
|
testing again here https://github.com/OpenHands/software-agent-sdk/actions/runs/22929111032 |
|
okay, test timedout when building the images (it got up to 411 images in 6 hours) |
Fix for Build Speed + OOM IssueI analyzed the failed build workflow and found:
The Real Issue: /tmp as tmpfsThe previous fix stored the sdist cache in The Fix (Commit
|
|
that version of tmp, still slows it down https://github.com/OpenHands/benchmarks/actions/runs/22958252581/job/66641112530 |
559a173 to
3f5078e
Compare
Changes MadeI replaced the previous OOM fix commits (2624c6a, 8d1ffc8, 559a173) with a single cleaner fix that: ProblemThe original sdist optimization in SolutionStore all temp files in
Why This Works Better Than Previous Fixes
TestingAll 14 tests in |
The sdist reuse optimization stores the pre-built sdist in /tmp by default. On Kubernetes pods, /tmp is often mounted as tmpfs (backed by RAM), which can cause OOM when building many images. This fix: - Adds cache_dir parameter to build_sdist_package() - Stores the sdist in build_dir/sdist-cache/ (disk-backed) instead of /tmp - Keeps the sdist around for the entire build process (no cleanup) The monkey patch (patch_uv_build) still intercepts 'uv build --sdist' calls and copies the pre-built sdist instead of rebuilding, saving ~15s per image. Co-authored-by: openhands <openhands@all-hands.dev>
3f5078e to
e46379b
Compare
Simplified Fix (e46379b)I simplified the fix based on your feedback. The previous version was too aggressive with TMPDIR changes. What ChangedThe ONLY change now is:
What's NOT Changed
Why This Should Work
Diff from ade0074- cache_dir = Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
+ sdist_cache_dir = build_dir / "sdist-cache"
+ with build_sdist_package(cache_dir=sdist_cache_dir) as sdist_path:This is a minimal change that should preserve the speedup while avoiding OOM. |
Root Cause Analysis:
The Fix:
This PR optimizes the build process by reusing the SDK :
This reduces the step from ~15s (noisy) to <1s (silent).
Why not change the SDK?
The SDK change that introduced is valid for standalone builds. However, for mass-build scenarios like benchmarks, we need this optimization without forcing changes upstream in the SDK repo. Monkeypatching is a pragmatic solution to inject this optimization locally within the benchmarks tooling.