Revert "feat: auto-detect pre-built Docker images across all benchmarks (#456)"#503
Revert "feat: auto-detect pre-built Docker images across all benchmarks (#456)"#503juanmichelini wants to merge 2 commits intomainfrom
Conversation
|
testing here too https://github.com/OpenHands/benchmarks/actions/runs/22976430169 |
|
@OpenHands can you double check we reverted the commit correctly? |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
The revert of 2bfcc6c correctly removed the functions added in that commit, but didn't account for the subsequent rename of image_exists to remote_image_exists in commit 92efb47 (#471). Since we're reverting to the pre-2bfcc6c state, the function should be called image_exists, not remote_image_exists. Changes: - benchmarks/swebench/build_images.py: import and use image_exists - benchmarks/gaia/build_images.py: import and use image_exists - benchmarks/swtbench/build_eval_env_images.py: use alias (as it was pre-2bfcc6c) This fixes the ImportError that was causing SWE-bench builds to fail. Co-authored-by: openhands <openhands@all-hands.dev>
Final SummaryI've successfully identified and fixed the incomplete revert in PR #503. Problem FoundThe revert of commit 2bfcc6c was incomplete. While it correctly removed the functions added in that commit ( Changes MadeUpdated 3 files to restore the pre-2bfcc6c function name:
Request Checklist✅ Verified the revert was incomplete and identified the root cause The PR should now pass the SWE-bench build step. The revert is now complete and consistent with the pre-2bfcc6c codebase state. |
Three-part fix for the 500-image build regression introduced in #456: 1. Revert SDK submodule from bde715c1 to b498a699 (known-good) - Eliminates cache tag mismatch that forced full rebuilds of all 500 images - Removes boto3 extra and Python 3.12 changes that bloated layers 2. Replace openhands.sdk.get_logger with stdlib logging in build modules - build_utils.py, buildx_utils.py, image_utils.py, build_images.py - Prevents Rich console state from being inherited across ProcessPoolExecutor forks (deadlock fix) 3. Add cold-cache survivability improvements to CI workflows - timeout-minutes: 180 on both swebench and swtbench build jobs - Post-build disk/timing instrumentation for observability - Preflight BuildKit prune + disk check for swtbench (was missing) - BUILDKIT_RESET_ON_FAILURE for swtbench build step Fixes #504 Refs #502, #503 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three-part fix for the 500-image build regression introduced in #456: 1. Pin SDK submodule to 30819566 (proven in 34-min 500-image build) - The regression between bde715c1 and this SHA has been fixed upstream - Images with this SHA already exist in the GHCR registry - Restores fast warm-cache builds (~34 minutes vs 5+ hours) 2. Replace openhands.sdk.get_logger with stdlib logging in build modules - build_utils.py, buildx_utils.py, image_utils.py, build_images.py - Prevents Rich console state from being inherited across ProcessPoolExecutor forks (deadlock fix) 3. Add cold-cache survivability improvements to CI workflows - timeout-minutes: 180 on both swebench and swtbench build jobs - Post-build disk/timing instrumentation for observability - Preflight BuildKit prune + disk check for swtbench (was missing) - BUILDKIT_RESET_ON_FAILURE for swtbench build step Fixes #504 Refs #502, #503 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three-part fix for the 500-image build regression introduced in #456: 1. Pin SDK submodule to 30819566 (proven in 34-min 500-image build) - The regression between bde715c1 and this SHA has been fixed upstream - Images with this SHA already exist in the GHCR registry - Restores fast warm-cache builds (~34 minutes vs 5+ hours) 2. Replace openhands.sdk.get_logger with stdlib logging in build modules - build_utils.py, buildx_utils.py, image_utils.py, build_images.py - Prevents Rich console state from being inherited across ProcessPoolExecutor forks (deadlock fix) 3. Add cold-cache survivability improvements to CI workflows - timeout-minutes: 180 on both swebench and swtbench build jobs - Post-build disk/timing instrumentation for observability - Preflight BuildKit prune + disk check for swtbench (was missing) - BUILDKIT_RESET_ON_FAILURE for swtbench build step Fixes #504 Refs #502, #503 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three-part fix for the 500-image build regression introduced in #456: 1. Pin SDK submodule to 30819566 (proven in 34-min 500-image build) - The regression between bde715c1 and this SHA has been fixed upstream - Images with this SHA already exist in the GHCR registry - Restores fast warm-cache builds (~34 minutes vs 5+ hours) 2. Replace openhands.sdk.get_logger with stdlib logging in build modules - build_utils.py, buildx_utils.py, image_utils.py, build_images.py - Prevents Rich console state from being inherited across ProcessPoolExecutor forks (deadlock fix) 3. Add cold-cache survivability improvements to CI workflows - timeout-minutes: 180 on both swebench and swtbench build jobs - Post-build disk/timing instrumentation for observability - Preflight BuildKit prune + disk check for swtbench (was missing) - BUILDKIT_RESET_ON_FAILURE for swtbench build step Fixes #504 Refs #502, #503 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This PR reverts commit 2bfcc6c.
Context
The auto-detect pre-built Docker images feature introduced in #456 is causing slow image builds and timeouts, preventing full benchmark runs from completing.
Changes
This is a clean revert of commit 2bfcc6c, which:
create_docker_workspace()andensure_local_image()helpers frombenchmarks/utils/image_utils.pyandbenchmarks/utils/build_utils.pyDockerDevWorkspaceusage in commit0, gaia, and swtbenchtests/test_image_utils.pyTesting
No automated tests included per request. Manual testing will be performed to verify image build performance is restored.
Fixes #502
References