Skip to content

SWE-Bench image builds stuck for 5+ hours, blocking evaluation jobs #476

@juanmichelini

Description

@juanmichelini

Problem

Three GitHub Actions builds for SWE-Bench images have been stuck for 5+ hours on the "Build and push SWE-Bench images" step, blocking evaluation jobs from running.

Affected Builds

All three builds are for SDK commit 217b218272499aa5b17335d82627e090b67cf9aa (Update orjson to 3.11.7):

Run ID Status Elapsed (reported) Actual Time Stuck
22627544120 In Progress 13s 5+ hours
22627606528 In Progress 12s 5+ hours
22627616885 In Progress 12s 5+ hours

Impact

Evaluation pods blocked: Multiple evaluation pods have been waiting 5+ hours for the build to complete:

  • eval-22627492274-gemini-3-1-p9tlk - swebench (5h22m waiting)
  • eval-22627517186-glm-5-k5xn4 - swebench (5h21m waiting)
  • eval-22627517186-qwen3-5-fl-n9vvw - swebench (5h21m waiting)

Logs show continuous polling:

[2026-03-03 19:52:47 UTC] Benchmarks build run 22627544120: status=in_progress, conclusion=None

Expected Behavior

Normal SWE-Bench image builds complete in 5-40 minutes:

  • Recent successful builds: 5m40s, 8m8s, 10m28s, 24m10s, 40m38s
  • Longest recent build: 1h9m20s

Build Details

Workflow: build-swebench-images.yml

Stuck step:

- name: Build and push SWE-Bench images
  run: |
    uv run benchmarks/swebench/build_images.py \
      --dataset '${DATASET}' \
      --split '${SPLIT}' \
      --image ghcr.io/openhands/eval-agent-server \
      --push \
      --max-workers '${MAX_WORKERS}' \
      --max-retries '${MAX_RETRIES}'

Job status shows:

* Build and push SWE-Bench images (still running after 5 hours)
* Archive build logs (queued)
* Upload build logs (queued)
* Display build summary (queued)

Possible Causes

  1. GitHub Actions runner hung/deadlocked - all 3 builds stuck suggests runner/infrastructure issues
  2. Docker BuildKit deadlock - parallel image building with --max-workers may have deadlocked
  3. Network/registry timeout - pushing to ghcr.io may be timing out silently
  4. Resource exhaustion - runner out of disk space or memory during build
  5. Silent failure - build process crashed but runner didn't detect failure

Investigation Needed

  • Check GitHub Actions runner logs (requires admin access)
  • Verify Docker BuildKit is not deadlocked
  • Check ghcr.io push operations for timeouts
  • Review disk space and memory usage on runners
  • Check if SDK commit 217b2182 introduced any dependency issues

Recommended Actions

  1. Immediate: Cancel stuck builds:

    gh run cancel 22627544120
    gh run cancel 22627606528
    gh run cancel 22627616885
  2. Short-term: Re-trigger build for SDK commit 217b2182 and monitor closely

  3. Long-term:

    • Add timeout to build step (e.g., timeout: 60 minutes)
    • Add progress logging to build_images.py
    • Add health checks during long-running builds
    • Consider build step telemetry/monitoring

Environment

  • SDK Commit: 217b2182 (Update orjson to 3.11.7 to address CVE-2025-67221)
  • Workflow: build-swebench-images.yml
  • Runner: GitHub-hosted (specific runner unknown due to stuck status)
  • First stuck: ~2026-03-03 14:30 UTC
  • Affected evaluations: swebench benchmark

Additional Context

This is blocking evaluation issue #287 investigation where we're trying to determine if recent SDK changes caused OOM issues. The stuck builds prevent us from running controlled tests with the latest SDK commit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions