-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Summary
Issue #502 established that full 500-image builds regressed badly after 2bfcc6c / PR #456. That change bundled two things:
- a useful
benchmarksrefactor for auto-detecting local Docker images - an SDK submodule bump from
b498a699tobde715c1
The observed regression is severe:
- historical fast run:
20806796288built 500 images in about 52 minutes and ended around 55 GiB BuildKit usage - historical slow run:
22694639473reached about 599 GiB BuildKit usage by batch 19 and ran for ~5h49m before cancellation - PR fix: Reuse pre-built SDK sdist to speed up image builds #499 (
22929182421) reused a prebuilt sdist but was still slow and still grew past 550 GiB, so per-image sdist creation is real but not the dominant issue
Current working hypothesis:
- The primary regression is in the SDK build path introduced by the submodule bump, not in the
benchmarkshelper refactor itself. - The slowdown is amplified by BuildKit prune/rebuild cycles once cold-cache runs blow past the runner disk budget.
- The Rich/fork logger issue is a real correctness bug, but it does not match the observed symptom of continuous, very slow progress.
This issue is to track a clean isolation plan and the fix strategy.
Related:
- Revert commit 2bfcc6c to speed up image builds #502
- Revert "feat: auto-detect pre-built Docker images across all benchmarks (#456)" #503
Why the previous fixes were only partial
#486addressed fork/spawn behavior, but not the cache-footprint regression.#487addressed SDK logger initialization behavior, but not the cache-footprint regression.#499addressed repeateduv build --sdist, but evidence suggests that only removes a secondary cost.#503is the correct short-term operational rollback, but it does not isolate the root cause.
Plan
Phase 0: unblock builds
- Merge
#503to restore known-good behavior while the regression is investigated.
Phase 1: add lightweight instrumentation
Add temporary env-gated timing and disk-usage logging around the hot path so we stop inferring from noisy historical runs alone:
- SDK
_make_build_context()duration - SDK
docker buildx buildduration - prune duration and pre/post
docker buildx duin thebenchmarksbatch loop
This should go into the exact historical branches / SHAs under test so the numbers are comparable.
Phase 2: run a reduced isolation matrix
Use isolated image namespaces for every run so cache-from / cache-to state does not bleed between experiments.
Run these cells:
- A. pre-
2bfcc6cbenchmarks+ old SDKb498a699(cold only) - B.
2bfcc6ccode path + old SDKb498a699(cold only) - C. pre-
2bfcc6cbenchmarks+ new SDKbde715c1(cold + warm) - D. full
2bfcc6c+ new SDKbde715c1(cold + warm)
Why this trimmed matrix:
- historical evidence already shows the old SDK can warm successfully, so we should not spend two extra 500-image runs proving A/B warm again
- the key open question is whether the new SDK can warm successfully at all, or whether prune pressure keeps it effectively cold between batches
These first runs are independent and should be dispatched in parallel once isolated namespaces are in place.
Phase 3: bisect the SDK hot path with real historical checkpoints
If Phase 2 points to the SDK bump, test these exact SDK refs against the same benchmarks commit and workflow inputs:
b498a699(baseline)35d75e3a(cache-tag truncation change in_base_slug())97731fe5(Python 3.12 change)5d65f389(--extra boto3)bde715c1(final bundled SDK ref)
Why include 35d75e3a:
_base_slug()changed from splitting on the first_tag_to the last_tag_- for some long SWE-bench image names, that changes the actual
buildcache-*tag values - this can create partial cache misses independently of the Python 3.12 /
boto3changes
This phase should tell us whether the main jump comes from:
- cache tag mismatch for a subset of images
- Python 3.12 base-image / layer changes
--extra boto3layer growth- or the combination
Phase 4: fix the real regression and the correctness bug
- Reintroduce the
benchmarkshelper refactor separately from any SDK bump. - Replace early
from openhands.sdk import get_loggerimports inbenchmarksutility modules with stdlib logging so workers do not inherit SDK/Rich logger state beforeProcessPoolExecutorforks. - If the SDK Dockerfile / dependency profile is confirmed as the main regression, fix that in the SDK build path rather than only tuning prune settings. Likely directions:
- make benchmark-oriented
source-minimalbuilds use a lighter dependency profile - avoid installing extras such as
boto3unless they are required for that target - scope Python-version workarounds to the targets that actually need them
- Keep the useful sdist-reuse optimization from
#499, but treat it as a follow-up optimization after the primary regression is fixed.
Phase 5: make cold-cache runs survivable
Cold-cache events will happen again whenever the SDK Dockerfile changes materially. Even if the SDK build profile is improved, the infrastructure should tolerate those transitions.
Treat this as co-equal with reducing SDK layer footprint, not as a minor operational footnote.
Candidate changes:
- use a larger-disk runner or higher
BUILDKIT_PRUNE_KEEP_GB(for example 150-200) for the first seed run after major SDK Dockerfile changes - reduce
max-workersfor intentionally cold-cache runs to limit concurrent disk pressure - optionally run a smaller warm-up build before the full 500-image build after an SDK Dockerfile change
- add
timeout-minutesand explicit prune timing logs so pathological runs fail faster and are easier to diagnose
Suggested execution order
- Confirm
#503restores build health. - Dispatch the reduced Phase 2 matrix with isolated namespaces.
- If the SDK bump is implicated, run the five-checkpoint SDK bisection.
- Land the low-risk correctness fix for stdlib logging in
benchmarksutilities. - Land the helper refactor without the SDK bump.
- Land the SDK-side fix for the benchmark build profile and/or cold-cache seed strategy.
Exit criteria
We should consider this fixed when we can show, with pinned refs and isolated namespaces, that:
- a 500-image cold-cache run completes without pathological prune/rebuild behavior
- a corresponding warm-cache run stays fast and within a stable disk envelope
- the
benchmarksrefactor can be carried without reintroducing the regression - SDK logger initialization is no longer happening in utility modules before process fork