Harden benchmark workflow: retry builds, proactive clean, robust monitoring#1170
Harden benchmark workflow: retry builds, proactive clean, robust monitoring#1170sbryngelson merged 3 commits intoMFlowCode:masterfrom
Conversation
…toring - Wrap bench builds in nick-fields/retry with 3 attempts and automatic ./mfc.sh clean between retries - Add proactive ./mfc.sh clean at start of all build scripts to prevent cross-compiler contamination from stale artifacts on persistent runners - Improve monitor_slurm_job.sh with better state detection and heartbeats - Add concurrency group to prevent duplicate bench runs per branch - Reduce timeout from 1400 to 480 minutes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
📝 WalkthroughWalkthroughThe pull request enhances SLURM job monitoring with state-driven polling and cleanup logic, refactors the benchmark workflow to use pull request triggers with retry-based build orchestration, and adds automatic job requeue support on preemption in the submission script. Changes
Sequence Diagram(s)sequenceDiagram
participant Script as monitor_slurm_job.sh
participant SLURM as SLURM Scheduler
participant FileSystem as Output File
Script->>SLURM: get_job_state(job_id)
SLURM-->>Script: squeue query
alt squeue succeeds
SLURM-->>Script: job state
else squeue fails
SLURM-->>Script: sacct fallback
end
loop Until Terminal State
Script->>SLURM: query job state
SLURM-->>Script: PENDING/RUNNING/CONFIGURING
Script->>FileSystem: tail output file (with timeout)
FileSystem-->>Script: latest output lines
Script->>Script: check is_terminal_state()
end
alt Terminal State Reached
Script->>FileSystem: wait for output quiescence
FileSystem-->>Script: output stabilized
Script->>Script: set monitor_success = 1
else Output File Never Created
Script->>SLURM: cancel job (cleanup)
SLURM-->>Script: job cancelled
end
sequenceDiagram
participant GitHub as GitHub Actions
participant Workflow as Benchmark Workflow
participant RetryMechanism as nick-fields/retry
participant BuildSystem as Build Executor
GitHub->>Workflow: trigger on pull_request event
Workflow->>Workflow: evaluate consolidated gating
alt Gate conditions met
Workflow->>RetryMechanism: invoke retry wrapper
loop Retry Attempts
RetryMechanism->>BuildSystem: execute parallel builds
BuildSystem-->>RetryMechanism: build result
alt Build fails and retries remain
RetryMechanism->>BuildSystem: cleanup worktrees
RetryMechanism->>BuildSystem: retry build
else Build succeeds
RetryMechanism-->>Workflow: success
end
end
else Gate conditions not met
Workflow->>Workflow: skip job execution
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Nitpicks 🔍
|
|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Pull request overview
Hardens the CI benchmark workflow and cluster-side scripts to be more resilient on persistent/self-hosted runners and SLURM systems, reducing flaky benchmark runs and improving observability.
Changes:
- Add proactive
./mfc.sh cleanand simplify build scripts for Frontier/Frontier AMD and Phoenix bench runs. - Update
bench.ymltriggers/authorization logic, add a concurrency group, wrap builds innick-fields/retry, and reduce workflow timeout. - Improve
.github/scripts/monitor_slurm_job.shwith more robust SLURM state polling, heartbeats, and cleanup behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/phoenix/submit-bench.sh |
Enables SLURM --requeue for Phoenix bench submissions. |
.github/workflows/phoenix/bench.sh |
Adds an upfront clean to avoid stale artifacts on persistent runners. |
.github/workflows/frontier/build.sh |
Adds set -e, proactive clean, and removes inline retry loop (now handled by workflow). |
.github/workflows/frontier_amd/build.sh |
Same as Frontier CCE script adjustments (clean + set -e + simplified build). |
.github/workflows/bench.yml |
Changes workflow triggers/conditions, adds concurrency grouping, wraps builds in retry, and adjusts timeouts. |
.github/scripts/monitor_slurm_job.sh |
Adds job-state helpers, better waiting logic, and improved streaming/heartbeat/cleanup. |
There was a problem hiding this comment.
1 issue found across 6 files
Confidence score: 4/5
- This PR looks safe to merge; the only concern is a low-severity workflow inefficiency rather than a functional bug.
- In
/.github/workflows/bench.yml, using${{ github.event_name }}in the concurrency group can allow duplicate benchmark runs for the same ref, which may waste CI resources but shouldn’t affect product behavior. - Pay close attention to
/.github/workflows/bench.yml- concurrency grouping may not cancel overlapping benchmark runs.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".github/workflows/bench.yml">
<violation number="1" location=".github/workflows/bench.yml:10">
P2: Including `${{ github.event_name }}` in the concurrency group prevents `pull_request` and `pull_request_review` runs from canceling each other for the same ref, so duplicate benchmark runs can still happen. Use a single group key per ref to ensure only one run per branch is active.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
.github/workflows/frontier_amd/build.sh (1)
26-32: Use=instead of==inside[ ]for POSIX portability.
==inside[ ]is a bash extension; the portable and idiomatic form is=. While this script uses bash, it's a trivial correctness improvement.♻️ Proposed fix
-if [ "$run_bench" == "bench" ]; then +if [ "$run_bench" = "bench" ]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/frontier_amd/build.sh around lines 26 - 32, The shell conditional in build.sh uses the non-portable test operator `==` to compare the variable run_bench; update the condition in the `if [ "$run_bench" == "bench" ]` check to use the POSIX-compatible `=` operator instead so the `if` branch (the loop invoking ./mfc.sh run ...) becomes portable across /bin/sh implementations..github/workflows/frontier/build.sh (1)
26-32: Same==in[ ]as infrontier_amd/build.sh— use=for portability.♻️ Proposed fix
-if [ "$run_bench" == "bench" ]; then +if [ "$run_bench" = "bench" ]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/frontier/build.sh around lines 26 - 32, The shell conditional uses the non-portable operator `==` in the test expression; change the conditional in the `if [ "$run_bench" == "bench" ];` line to use the portable `=` operator (i.e., `if [ "$run_bench" = "bench" ];`), keeping the variable `run_bench` quoted and leaving the rest of the block (the `for dir in benchmarks/*/; ...` and `else` branch invoking `./mfc.sh`) unchanged..github/workflows/phoenix/submit-bench.sh (1)
47-47: Verify--requeueinteraction withmonitor_slurm_job.shoutput-file tracking.The state transitions
PREEMPTED → REQUEUED → PENDING → RUNNINGare handled correctly —get_job_state()treats all three as non-terminal, so the monitor stays alive. However, there is a real vulnerability:tail -ffollows by inode (line 110), and if SLURM truncates the output file on requeue (the default behavior without--open-mode=append), the in-flight tail process will lose its inode reference and miss all new content from the requeued run. Additionally, if the CI runner is killed while the job is inREQUEUEDorPENDINGstate,monitor_slurm_job.sh'scleanup()(lines 14–16) will callscancel, cancelling the requeued job.Document the expected SLURM output-file behavior for requeued jobs on Phoenix (whether
--open-mode=appendis needed or already configured), and confirm whether thescancel-on-abnormal-exit behavior is intentional for requeued jobs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/phoenix/submit-bench.sh at line 47, Add documentation and a confirmation check about SLURM output-file behavior and scancel-on-exit in the submit-bench/monitor workflow: update the submit-bench.sh or repository CI docs to state whether Phoenix config (or our job submission flags) uses --open-mode=append for requeued runs so tail -f (referenced in monitor_slurm_job.sh around the tail -f at line ~110) won't lose the file inode on REQUEUED → RUNNING transitions; if not, state that we must add --open-mode=append to sbatch invocation in submit-bench.sh. Also clarify and confirm whether monitor_slurm_job.sh's cleanup() (lines ~14–16) intentionally calls scancel on abnormal CI exits while a job is REQUEUED/PENDING, and if that behavior is undesired, document that we should avoid scancel on cleanup for non-terminal states or add a guard that checks get_job_state() before cancelling. Ensure references to get_job_state(), tail -f, and cleanup() are included so reviewers can locate the affected code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/scripts/monitor_slurm_job.sh:
- Around line 59-66: The is_terminal_state() function currently omits the
PREEMPTED SLURM state, causing preempted jobs to be treated as non-terminal and
hang; update the case in is_terminal_state to include PREEMPTED alongside
COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|BOOT_FAIL|DEADLINE
so it returns 0 for PREEMPTED (ensuring the initial wait loop and main
monitoring loop detect it as terminal and exit appropriately with an error).
In @.github/workflows/bench.yml:
- Around line 110-118: The current parallel-run uses "wait $pid1 && wait $pid2",
which short-circuits and can leave the master build orphaned; change the logic
that launches the two background builds (the lines that set pid1 and pid2 and
call matrix.build_script) to always wait for both PIDs unconditionally by
calling wait for each PID separately, capture each exit status (e.g., rc1 from
pid1 and rc2 from pid2), and then exit with failure if either rc1 or rc2 is
non-zero; ensure the on_retry_command cleaning step (./mfc.sh clean in the
master directory) only runs after both waits complete so it cannot race with a
still-running master build.
---
Nitpick comments:
In @.github/workflows/frontier_amd/build.sh:
- Around line 26-32: The shell conditional in build.sh uses the non-portable
test operator `==` to compare the variable run_bench; update the condition in
the `if [ "$run_bench" == "bench" ]` check to use the POSIX-compatible `=`
operator instead so the `if` branch (the loop invoking ./mfc.sh run ...) becomes
portable across /bin/sh implementations.
In @.github/workflows/frontier/build.sh:
- Around line 26-32: The shell conditional uses the non-portable operator `==`
in the test expression; change the conditional in the `if [ "$run_bench" ==
"bench" ];` line to use the portable `=` operator (i.e., `if [ "$run_bench" =
"bench" ];`), keeping the variable `run_bench` quoted and leaving the rest of
the block (the `for dir in benchmarks/*/; ...` and `else` branch invoking
`./mfc.sh`) unchanged.
In @.github/workflows/phoenix/submit-bench.sh:
- Line 47: Add documentation and a confirmation check about SLURM output-file
behavior and scancel-on-exit in the submit-bench/monitor workflow: update the
submit-bench.sh or repository CI docs to state whether Phoenix config (or our
job submission flags) uses --open-mode=append for requeued runs so tail -f
(referenced in monitor_slurm_job.sh around the tail -f at line ~110) won't lose
the file inode on REQUEUED → RUNNING transitions; if not, state that we must add
--open-mode=append to sbatch invocation in submit-bench.sh. Also clarify and
confirm whether monitor_slurm_job.sh's cleanup() (lines ~14–16) intentionally
calls scancel on abnormal CI exits while a job is REQUEUED/PENDING, and if that
behavior is undesired, document that we should avoid scancel on cleanup for
non-terminal states or add a guard that checks get_job_state() before
cancelling. Ensure references to get_job_state(), tail -f, and cleanup() are
included so reviewers can locate the affected code.
…ry timeout - Add PREEMPTED and REVOKED to monitor_slurm_job.sh terminal states so preempted jobs don't hang the monitor loop indefinitely - Wait for both build PIDs unconditionally to prevent orphaned processes racing with on_retry_command clean - Drop event_name from concurrency group so PR and review events for the same branch properly cancel each other - Reduce retry timeout to 150min so retries have room within the 480min job timeout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1170 +/- ##
=======================================
Coverage 44.05% 44.05%
=======================================
Files 70 70
Lines 20498 20498
Branches 1990 1990
=======================================
Hits 9030 9030
Misses 10329 10329
Partials 1139 1139 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Resolve conflicts with MFlowCode#1148 (build caching): - frontier/build.sh, frontier_amd/build.sh: take upstream's cache + retry logic (proactive clean would defeat caching) - bench.yml: keep our pull_request trigger model (upstream's workflow_run Get PR Info step doesn't apply) - phoenix/bench.sh: remove proactive clean (unnecessary overhead for fresh checkouts, and would break caching) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/scripts/monitor_slurm_job.sh (1)
195-198:⚠️ Potential issue | 🔴 CriticalBug:
ExitCoderegex also matchesDerivedExitCode, causing false failures.
scontrol show joboutput contains bothExitCode=X:YandDerivedExitCode=X:Y. The patternExitCode=[0-9]+:[0-9]+is a substring ofDerivedExitCode=…, sogrep -oEemits two matches. Aftercut,exit_codebecomes a two-line string ("0:0\n0:0"), which never equals"0:0"on line 217, making every successful job report as failed.🐛 Proposed fix — take only the first match
scontrol_output=$(scontrol show job "$job_id" 2>/dev/null || echo "") if [ -n "$scontrol_output" ]; then - exit_code=$(echo "$scontrol_output" | grep -oE 'ExitCode=[0-9]+:[0-9]+' | cut -d= -f2 || echo "") + exit_code=$(echo "$scontrol_output" | grep -oE 'ExitCode=[0-9]+:[0-9]+' | head -n1 | cut -d= -f2 || echo "") fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/scripts/monitor_slurm_job.sh around lines 195 - 198, The ExitCode extraction from scontrol_output is matching both ExitCode and DerivedExitCode, producing multiple lines in exit_code; update the extraction pipeline that sets exit_code (the grep/cut sequence that reads scontrol_output) to only take the first match (for example use grep -m 1 or pipe through head -n 1 after grep) so exit_code becomes a single "X:Y" string; keep the variable names scontrol_output and exit_code and ensure subsequent comparison logic still expects a single-line "0:0" value.
🧹 Nitpick comments (1)
.github/workflows/bench.yml (1)
31-32: Gating condition is comprehensive but very dense — consider a trailing comment.The multi-clause
ifcorrectly restricts self-hosted runner execution to (a) approved reviews, (b) PRs by trusted authors, or (c) manual dispatch. It reads correctly, but a brief inline YAML comment summarizing the intent would help future maintainers parse it faster.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/bench.yml around lines 31 - 32, Add a short trailing YAML comment to the long `if:` expression that summarizes its intent (e.g., "run on MFlowCode/MFC when changes detected and either PR approved, PR by trusted authors, or manual dispatch") so future maintainers can quickly understand the gating; locate the multi-clause `if: ${{ github.repository=='MFlowCode/MFC' && needs.file-changes.outputs.checkall=='true' && ((github.event_name=='pull_request_review' && github.event.review.state=='approved') || (github.event_name=='pull_request' && (github.event.pull_request.user.login=='sbryngelson' || github.event.pull_request.user.login=='wilfonba')) || github.event_name=='workflow_dispatch') }}` and append a concise comment to that line.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In @.github/scripts/monitor_slurm_job.sh:
- Around line 195-198: The ExitCode extraction from scontrol_output is matching
both ExitCode and DerivedExitCode, producing multiple lines in exit_code; update
the extraction pipeline that sets exit_code (the grep/cut sequence that reads
scontrol_output) to only take the first match (for example use grep -m 1 or pipe
through head -n 1 after grep) so exit_code becomes a single "X:Y" string; keep
the variable names scontrol_output and exit_code and ensure subsequent
comparison logic still expects a single-line "0:0" value.
---
Duplicate comments:
In @.github/scripts/monitor_slurm_job.sh:
- Around line 58-66: The review note is a duplicate—there is no code change
required because is_terminal_state() already includes PREEMPTED and REVOKED;
resolve this by removing the duplicate review comment or marking it resolved in
the PR so no further action is expected on the is_terminal_state function.
---
Nitpick comments:
In @.github/workflows/bench.yml:
- Around line 31-32: Add a short trailing YAML comment to the long `if:`
expression that summarizes its intent (e.g., "run on MFlowCode/MFC when changes
detected and either PR approved, PR by trusted authors, or manual dispatch") so
future maintainers can quickly understand the gating; locate the multi-clause
`if: ${{ github.repository=='MFlowCode/MFC' &&
needs.file-changes.outputs.checkall=='true' &&
((github.event_name=='pull_request_review' &&
github.event.review.state=='approved') || (github.event_name=='pull_request' &&
(github.event.pull_request.user.login=='sbryngelson' ||
github.event.pull_request.user.login=='wilfonba')) ||
github.event_name=='workflow_dispatch') }}` and append a concise comment to that
line.
User description
Summary
nick-fields/retrywith 3 attempts and automatic./mfc.sh cleanbetween retries./mfc.sh cleanat start of all build scripts to prevent cross-compiler contamination from stale artifacts on persistent runnersmonitor_slurm_job.shwith better state detection and heartbeatsTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit
CodeAnt-AI Description
Harden benchmark CI: cancel orphaned cluster jobs, robust job monitoring, and retry builds with proactive clean
What Changed
Impact
✅ Fewer orphaned SLURM jobs✅ Fewer CI failures from stale build artifacts✅ Fewer duplicate benchmark runs💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.