Skip to content

Revert "feat: auto-detect pre-built Docker images across all benchmarks (#456)"#503

Open
juanmichelini wants to merge 2 commits intomainfrom
revert-2bfcc6c-image-build-speedup
Open

Revert "feat: auto-detect pre-built Docker images across all benchmarks (#456)"#503
juanmichelini wants to merge 2 commits intomainfrom
revert-2bfcc6c-image-build-speedup

Conversation

@juanmichelini
Copy link
Collaborator

This PR reverts commit 2bfcc6c.

Context

The auto-detect pre-built Docker images feature introduced in #456 is causing slow image builds and timeouts, preventing full benchmark runs from completing.

Changes

This is a clean revert of commit 2bfcc6c, which:

  • Removes the create_docker_workspace() and ensure_local_image() helpers from benchmarks/utils/image_utils.py and benchmarks/utils/build_utils.py
  • Restores direct DockerDevWorkspace usage in commit0, gaia, and swtbench
  • Restores previous image building logic in swebench, swebenchmultimodal, multiswebench, and swefficiency
  • Removes the test file tests/test_image_utils.py
  • Reverts SDK submodule and test config changes

Testing

No automated tests included per request. Manual testing will be performed to verify image build performance is restored.

Fixes #502

References

…ks (#456)"

This reverts commit 2bfcc6c.

The auto-detect feature introduced in #456 is causing slow image builds
and timeouts. This revert restores the previous image building behavior
to restore benchmark build performance.

Fixes #502

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini
Copy link
Collaborator Author

@juanmichelini
Copy link
Collaborator Author

juanmichelini commented Mar 11, 2026

@juanmichelini juanmichelini marked this pull request as draft March 11, 2026 22:54
@juanmichelini juanmichelini marked this pull request as ready for review March 11, 2026 22:54
@juanmichelini
Copy link
Collaborator Author

@OpenHands can you double check we reverted the commit correctly?
SWE-bench is failing https://github.com/OpenHands/benchmarks/actions/runs/22978397474/job/66712426521

@openhands-ai
Copy link

openhands-ai bot commented Mar 11, 2026

I'm on it! juanmichelini can track my progress at all-hands.dev

The revert of 2bfcc6c correctly removed the functions added in that commit,
but didn't account for the subsequent rename of image_exists to remote_image_exists
in commit 92efb47 (#471).

Since we're reverting to the pre-2bfcc6c state, the function should be called
image_exists, not remote_image_exists.

Changes:
- benchmarks/swebench/build_images.py: import and use image_exists
- benchmarks/gaia/build_images.py: import and use image_exists
- benchmarks/swtbench/build_eval_env_images.py: use alias (as it was pre-2bfcc6c)

This fixes the ImportError that was causing SWE-bench builds to fail.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Mar 11, 2026

Final Summary

I've successfully identified and fixed the incomplete revert in PR #503.

Problem Found

The revert of commit 2bfcc6c was incomplete. While it correctly removed the functions added in that commit (local_image_exists, create_docker_workspace), it didn't account for commit #471 which came AFTER 2bfcc6c and renamed image_existsremote_image_exists. This caused an ImportError in the SWE-bench build:

ImportError: cannot import name 'remote_image_exists' from 'benchmarks.utils.image_utils'

Changes Made

Updated 3 files to restore the pre-2bfcc6c function name:

  • benchmarks/swebench/build_images.py: Changed remote_image_existsimage_exists
  • benchmarks/gaia/build_images.py: Changed remote_image_existsimage_exists
  • benchmarks/swtbench/build_eval_env_images.py: Restored the alias pattern image_exists as remote_image_exists

Request Checklist

✅ Verified the revert was incomplete and identified the root cause
✅ Fixed the ImportError causing SWE-bench builds to fail
✅ Changes are minimal and focused (only import/usage updates)
✅ Committed with detailed explanation and co-author attribution
✅ Pushed to PR #503 branch revert-2bfcc6c-image-build-speedup

The PR should now pass the SWE-bench build step. The revert is now complete and consistent with the pre-2bfcc6c codebase state.

@juanmichelini
Copy link
Collaborator Author

simonrosenberg added a commit that referenced this pull request Mar 12, 2026
Three-part fix for the 500-image build regression introduced in #456:

1. Revert SDK submodule from bde715c1 to b498a699 (known-good)
   - Eliminates cache tag mismatch that forced full rebuilds of all 500 images
   - Removes boto3 extra and Python 3.12 changes that bloated layers

2. Replace openhands.sdk.get_logger with stdlib logging in build modules
   - build_utils.py, buildx_utils.py, image_utils.py, build_images.py
   - Prevents Rich console state from being inherited across
     ProcessPoolExecutor forks (deadlock fix)

3. Add cold-cache survivability improvements to CI workflows
   - timeout-minutes: 180 on both swebench and swtbench build jobs
   - Post-build disk/timing instrumentation for observability
   - Preflight BuildKit prune + disk check for swtbench (was missing)
   - BUILDKIT_RESET_ON_FAILURE for swtbench build step

Fixes #504
Refs #502, #503

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
simonrosenberg added a commit that referenced this pull request Mar 12, 2026
Three-part fix for the 500-image build regression introduced in #456:

1. Pin SDK submodule to 30819566 (proven in 34-min 500-image build)
   - The regression between bde715c1 and this SHA has been fixed upstream
   - Images with this SHA already exist in the GHCR registry
   - Restores fast warm-cache builds (~34 minutes vs 5+ hours)

2. Replace openhands.sdk.get_logger with stdlib logging in build modules
   - build_utils.py, buildx_utils.py, image_utils.py, build_images.py
   - Prevents Rich console state from being inherited across
     ProcessPoolExecutor forks (deadlock fix)

3. Add cold-cache survivability improvements to CI workflows
   - timeout-minutes: 180 on both swebench and swtbench build jobs
   - Post-build disk/timing instrumentation for observability
   - Preflight BuildKit prune + disk check for swtbench (was missing)
   - BUILDKIT_RESET_ON_FAILURE for swtbench build step

Fixes #504
Refs #502, #503

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
simonrosenberg added a commit that referenced this pull request Mar 12, 2026
Three-part fix for the 500-image build regression introduced in #456:

1. Pin SDK submodule to 30819566 (proven in 34-min 500-image build)
   - The regression between bde715c1 and this SHA has been fixed upstream
   - Images with this SHA already exist in the GHCR registry
   - Restores fast warm-cache builds (~34 minutes vs 5+ hours)

2. Replace openhands.sdk.get_logger with stdlib logging in build modules
   - build_utils.py, buildx_utils.py, image_utils.py, build_images.py
   - Prevents Rich console state from being inherited across
     ProcessPoolExecutor forks (deadlock fix)

3. Add cold-cache survivability improvements to CI workflows
   - timeout-minutes: 180 on both swebench and swtbench build jobs
   - Post-build disk/timing instrumentation for observability
   - Preflight BuildKit prune + disk check for swtbench (was missing)
   - BUILDKIT_RESET_ON_FAILURE for swtbench build step

Fixes #504
Refs #502, #503

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
simonrosenberg added a commit that referenced this pull request Mar 12, 2026
Three-part fix for the 500-image build regression introduced in #456:

1. Pin SDK submodule to 30819566 (proven in 34-min 500-image build)
   - The regression between bde715c1 and this SHA has been fixed upstream
   - Images with this SHA already exist in the GHCR registry
   - Restores fast warm-cache builds (~34 minutes vs 5+ hours)

2. Replace openhands.sdk.get_logger with stdlib logging in build modules
   - build_utils.py, buildx_utils.py, image_utils.py, build_images.py
   - Prevents Rich console state from being inherited across
     ProcessPoolExecutor forks (deadlock fix)

3. Add cold-cache survivability improvements to CI workflows
   - timeout-minutes: 180 on both swebench and swtbench build jobs
   - Post-build disk/timing instrumentation for observability
   - Preflight BuildKit prune + disk check for swtbench (was missing)
   - BUILDKIT_RESET_ON_FAILURE for swtbench build step

Fixes #504
Refs #502, #503

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revert commit 2bfcc6c to speed up image builds

2 participants