Skip to content

Track root cause isolation and fix plan for the 500-image build regression after #456 #504

@simonrosenberg

Description

@simonrosenberg

Summary

Issue #502 established that full 500-image builds regressed badly after 2bfcc6c / PR #456. That change bundled two things:

  • a useful benchmarks refactor for auto-detecting local Docker images
  • an SDK submodule bump from b498a699 to bde715c1

The observed regression is severe:

  • historical fast run: 20806796288 built 500 images in about 52 minutes and ended around 55 GiB BuildKit usage
  • historical slow run: 22694639473 reached about 599 GiB BuildKit usage by batch 19 and ran for ~5h49m before cancellation
  • PR fix: Reuse pre-built SDK sdist to speed up image builds #499 (22929182421) reused a prebuilt sdist but was still slow and still grew past 550 GiB, so per-image sdist creation is real but not the dominant issue

Current working hypothesis:

  1. The primary regression is in the SDK build path introduced by the submodule bump, not in the benchmarks helper refactor itself.
  2. The slowdown is amplified by BuildKit prune/rebuild cycles once cold-cache runs blow past the runner disk budget.
  3. The Rich/fork logger issue is a real correctness bug, but it does not match the observed symptom of continuous, very slow progress.

This issue is to track a clean isolation plan and the fix strategy.

Related:

Why the previous fixes were only partial

  • #486 addressed fork/spawn behavior, but not the cache-footprint regression.
  • #487 addressed SDK logger initialization behavior, but not the cache-footprint regression.
  • #499 addressed repeated uv build --sdist, but evidence suggests that only removes a secondary cost.
  • #503 is the correct short-term operational rollback, but it does not isolate the root cause.

Plan

Phase 0: unblock builds

  1. Merge #503 to restore known-good behavior while the regression is investigated.

Phase 1: add lightweight instrumentation

Add temporary env-gated timing and disk-usage logging around the hot path so we stop inferring from noisy historical runs alone:

  • SDK _make_build_context() duration
  • SDK docker buildx build duration
  • prune duration and pre/post docker buildx du in the benchmarks batch loop

This should go into the exact historical branches / SHAs under test so the numbers are comparable.

Phase 2: run a reduced isolation matrix

Use isolated image namespaces for every run so cache-from / cache-to state does not bleed between experiments.

Run these cells:

  • A. pre-2bfcc6c benchmarks + old SDK b498a699 (cold only)
  • B. 2bfcc6c code path + old SDK b498a699 (cold only)
  • C. pre-2bfcc6c benchmarks + new SDK bde715c1 (cold + warm)
  • D. full 2bfcc6c + new SDK bde715c1 (cold + warm)

Why this trimmed matrix:

  • historical evidence already shows the old SDK can warm successfully, so we should not spend two extra 500-image runs proving A/B warm again
  • the key open question is whether the new SDK can warm successfully at all, or whether prune pressure keeps it effectively cold between batches

These first runs are independent and should be dispatched in parallel once isolated namespaces are in place.

Phase 3: bisect the SDK hot path with real historical checkpoints

If Phase 2 points to the SDK bump, test these exact SDK refs against the same benchmarks commit and workflow inputs:

  1. b498a699 (baseline)
  2. 35d75e3a (cache-tag truncation change in _base_slug())
  3. 97731fe5 (Python 3.12 change)
  4. 5d65f389 (--extra boto3)
  5. bde715c1 (final bundled SDK ref)

Why include 35d75e3a:

  • _base_slug() changed from splitting on the first _tag_ to the last _tag_
  • for some long SWE-bench image names, that changes the actual buildcache-* tag values
  • this can create partial cache misses independently of the Python 3.12 / boto3 changes

This phase should tell us whether the main jump comes from:

  • cache tag mismatch for a subset of images
  • Python 3.12 base-image / layer changes
  • --extra boto3 layer growth
  • or the combination

Phase 4: fix the real regression and the correctness bug

  1. Reintroduce the benchmarks helper refactor separately from any SDK bump.
  2. Replace early from openhands.sdk import get_logger imports in benchmarks utility modules with stdlib logging so workers do not inherit SDK/Rich logger state before ProcessPoolExecutor forks.
  3. If the SDK Dockerfile / dependency profile is confirmed as the main regression, fix that in the SDK build path rather than only tuning prune settings. Likely directions:
  • make benchmark-oriented source-minimal builds use a lighter dependency profile
  • avoid installing extras such as boto3 unless they are required for that target
  • scope Python-version workarounds to the targets that actually need them
  1. Keep the useful sdist-reuse optimization from #499, but treat it as a follow-up optimization after the primary regression is fixed.

Phase 5: make cold-cache runs survivable

Cold-cache events will happen again whenever the SDK Dockerfile changes materially. Even if the SDK build profile is improved, the infrastructure should tolerate those transitions.

Treat this as co-equal with reducing SDK layer footprint, not as a minor operational footnote.

Candidate changes:

  • use a larger-disk runner or higher BUILDKIT_PRUNE_KEEP_GB (for example 150-200) for the first seed run after major SDK Dockerfile changes
  • reduce max-workers for intentionally cold-cache runs to limit concurrent disk pressure
  • optionally run a smaller warm-up build before the full 500-image build after an SDK Dockerfile change
  • add timeout-minutes and explicit prune timing logs so pathological runs fail faster and are easier to diagnose

Suggested execution order

  1. Confirm #503 restores build health.
  2. Dispatch the reduced Phase 2 matrix with isolated namespaces.
  3. If the SDK bump is implicated, run the five-checkpoint SDK bisection.
  4. Land the low-risk correctness fix for stdlib logging in benchmarks utilities.
  5. Land the helper refactor without the SDK bump.
  6. Land the SDK-side fix for the benchmark build profile and/or cold-cache seed strategy.

Exit criteria

We should consider this fixed when we can show, with pinned refs and isolated namespaces, that:

  • a 500-image cold-cache run completes without pathological prune/rebuild behavior
  • a corresponding warm-cache run stays fast and within a stable disk envelope
  • the benchmarks refactor can be carried without reintroducing the regression
  • SDK logger initialization is no longer happening in utility modules before process fork

Sub-issues

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions