Add test sharding, proactive clean, and retry logic for self-hosted CI#1171
Add test sharding, proactive clean, and retry logic for self-hosted CI#1171sbryngelson wants to merge 2 commits intoMFlowCode:masterfrom
Conversation
- Shard Frontier GPU tests into 2 parts for faster parallel execution - Add proactive ./mfc.sh clean in Phoenix test scripts to prevent cross-compiler contamination from stale build artifacts - Add --requeue to Phoenix SLURM jobs for preemption recovery - Add lint-gate job that must pass before self-hosted tests run - Add retry logic for GitHub runner tests (retry <=5 failures) - Add Frontier AMD test support with dedicated submit/test scripts - Restructure self-hosted matrix with explicit cluster names Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
📝 WalkthroughWalkthroughAdds test sharding and retry orchestration to CI, updates SLURM job directives (accounts, time, partition, QOS, requeue), introduces shard propagation through submission/test scripts, adds build cleanup, and extends test CLI with a shard option and shard-aware filtering and failure reporting. Changes
Sequence Diagram(s)sequenceDiagram
participant GH as "GitHub Actions\n(Workflow)"
participant Runner as "Actions Runner\n(job matrix)"
participant Submit as "submit.sh\n(cluster-specific)"
participant SLURM as "SLURM (sbatch)"
participant Node as "Compute Node\n(mfc.sh)"
participant Test as "mfc test\n(toolchain test.py)"
participant GHArtifacts as "GH workspace\n(tests/failed_uuids.txt)"
GH->>Runner: start job (includes shard)
Runner->>Submit: run submit.sh (pass shard)
Submit->>SLURM: sbatch (includes SBATCH directives, --requeue where set)
SLURM->>Node: allocate node & run job
Node->>Test: invoke mfc.sh / mfc test --shard (if provided)
Test->>GHArtifacts: write tests/failed_uuids.txt if failures
GH->>GHArtifacts: check tests/failed_uuids.txt
alt failures ≤ threshold
GH->>Runner: rerun only failed UUIDs (retry flow)
else too many failures
GH->>GH: mark job failed
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Nitpicks 🔍
|
|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Pull request overview
This PR enhances the self-hosted CI infrastructure with test sharding, proactive cleanup, and retry mechanisms to improve reliability and reduce execution time. It addresses cross-compiler contamination issues on persistent runners and enables faster parallel test execution on batch partition systems.
Changes:
- Add retry logic for GitHub runner tests (≤5 failures trigger automatic retest)
- Shard Frontier GPU tests into 2 parallel jobs for faster execution
- Add proactive
./mfc.sh cleanto Phoenix test scripts - Add
--requeueflag to Phoenix SLURM jobs for preemption recovery - Wrap Frontier build steps in retry action with automatic cleanup
- Update Frontier SLURM configuration (account, partition, timeout, QOS)
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/test.yml | Add retry logic for ≤5 test failures, add shard parameter to matrix, wrap builds in retry action, remove deprecated environment variables |
| .github/workflows/phoenix/test.sh | Add proactive ./mfc.sh clean to prevent cross-compiler contamination |
| .github/workflows/phoenix/submit.sh | Add --requeue flag for automatic preemption recovery |
| .github/workflows/frontier/test.sh | Add shard parameter handling for test splitting |
| .github/workflows/frontier/submit.sh | Update SLURM config (account, partition, timeout, QOS) and add shard parameter |
| .github/workflows/frontier_amd/test.sh | Add shard parameter handling for test splitting |
| .github/workflows/frontier_amd/submit.sh | Update SLURM config (account, partition, timeout, QOS) and add shard parameter |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/test.yml (1)
265-274:⚠️ Potential issue | 🟡 MinorLog file references don't account for shard — will break if job_slug is fixed.
test-${{ matrix.device }}-${{ matrix.interface }}.outon line 267 assumes the output filename doesn't include a shard suffix. This is currently consistent with the submit scripts, but if the job_slug collision (flagged onfrontier_amd/submit.sh) is fixed by incorporating the shard, these references must be updated in tandem.Also, the artifact
nameon line 273 doesn't include shard, which could cause upload conflicts for sharded matrix entries with the same device/interface (e.g., twogpu-accfrontier shards).strategy.job-indexmakes it unique, but adding shard would improve clarity.Proposed fix (apply after fixing job_slug in submit scripts)
- name: Print Logs if: always() - run: cat test-${{ matrix.device }}-${{ matrix.interface }}.out + run: cat test-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' }}.out - name: Archive Logs uses: actions/upload-artifact@v4 if: matrix.cluster != 'phoenix' with: - name: logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }} + name: logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' }} - path: test-${{ matrix.device }}-${{ matrix.interface }}.out + path: test-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' }}.outNote: The shard value contains
/(e.g.,1/2) which is invalid in filenames. The submit script slug sanitization would need to handle this (e.g., replace/with-of-), and the workflow expressions here would need to match.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test.yml around lines 265 - 274, Update the Print Logs and Archive Logs steps so the logfile and artifact name include the shard-aware slug used by the submit scripts (instead of assuming test-${{ matrix.device }}-${{ matrix.interface }}.out). Locate the "Print Logs" and "Archive Logs" steps and change the referenced filename and artifact name to incorporate the sanitized job slug/shard token produced by the submit scripts (the same slug that replaces "/" with a safe separator such as "-of-"); ensure the workflow expression that builds the filename and the artifact "name" use that sanitized slug so filenames and artifact names remain unique and valid across sharded jobs..github/workflows/frontier_amd/submit.sh (1)
31-32:⚠️ Potential issue | 🔴 CriticalJob slug does not include shard — SLURM output files collide when sharded tests run concurrently.
When multiple shards for the same
device/interfacepair run on the same HPC cluster, they produce identicaljob_slugvalues (e.g.,test-gpu-accfor both shard1/2and2/2), resulting in identicaloutput_filenames. Since both SLURM jobs execute from the sameSLURM_SUBMIT_DIR, one job's output will silently overwrite the other's. This affects both.github/workflows/frontier/submit.shand.github/workflows/frontier_amd/submit.shat line 31.Incorporate the shard into the slug:
Proposed fix
-job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2-$3" +shard_suffix="" +if [ -n "$4" ]; then + shard_suffix="-$(echo "$4" | sed 's|/|-of-|')" +fi +job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2-$3${shard_suffix}"Additionally, update
.github/workflows/test.ymlline 267 and 273 to account for the shard suffix:
- Line 267:
cat test-${{ matrix.device }}-${{ matrix.interface }}.out→cat test-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' | replace('/', '-of-') }}.out- Line 273 artifact name: include shard suffix to match
The usage messages in both scripts (line 9) should also be updated to document the
interfaceandshardparameters.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/frontier_amd/submit.sh around lines 31 - 32, The job_slug currently built by job_slug and used for output_file omits the shard, causing name collisions; update the job_slug generation (the job_slug variable and any references to output_file) to append the shard identifier (formatting the shard like "-{shard}" and replacing "/" with "-of-" for values like "1/2") so each shard produces a unique slug; also update the script usage message (the usage text near the top that lists parameters) to document the interface and shard parameters, and update the workflow steps that read and upload artifacts (the cat command that reads test-${matrix.device}-${matrix.interface}.out and the artifact name) to include the same shard suffix formatting so artifact names and printed output match the new job_slug convention.
🧹 Nitpick comments (1)
.github/workflows/frontier_amd/submit.sh (1)
8-9: Usage message is outdated — does not document the interface or shard arguments.The script accepts up to 4 positional arguments (
$1=script,$2=device,$3=interface,$4=shard), but the usage string only mentions the first two.Proposed fix
usage() { - echo "Usage: $0 [script.sh] [cpu|gpu]" + echo "Usage: $0 [script.sh] [cpu|gpu] [none|acc|omp] [shard]" }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/frontier_amd/submit.sh around lines 8 - 9, The usage() function's message is outdated and only mentions two arguments; update the echo in usage() to document all supported positional params ($1 script, $2 device (cpu|gpu), $3 interface, $4 shard) and any defaults or optional markers (e.g., "[interface]" "[shard]") so callers see the full signature; edit the echo inside usage() to a single clear line listing script, device, interface, and shard and optional/default semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In @.github/workflows/frontier_amd/submit.sh:
- Around line 31-32: The job_slug currently built by job_slug and used for
output_file omits the shard, causing name collisions; update the job_slug
generation (the job_slug variable and any references to output_file) to append
the shard identifier (formatting the shard like "-{shard}" and replacing "/"
with "-of-" for values like "1/2") so each shard produces a unique slug; also
update the script usage message (the usage text near the top that lists
parameters) to document the interface and shard parameters, and update the
workflow steps that read and upload artifacts (the cat command that reads
test-${matrix.device}-${matrix.interface}.out and the artifact name) to include
the same shard suffix formatting so artifact names and printed output match the
new job_slug convention.
In @.github/workflows/test.yml:
- Around line 265-274: Update the Print Logs and Archive Logs steps so the
logfile and artifact name include the shard-aware slug used by the submit
scripts (instead of assuming test-${{ matrix.device }}-${{ matrix.interface
}}.out). Locate the "Print Logs" and "Archive Logs" steps and change the
referenced filename and artifact name to incorporate the sanitized job
slug/shard token produced by the submit scripts (the same slug that replaces "/"
with a safe separator such as "-of-"); ensure the workflow expression that
builds the filename and the artifact "name" use that sanitized slug so filenames
and artifact names remain unique and valid across sharded jobs.
---
Duplicate comments:
In @.github/workflows/frontier/submit.sh:
- Around line 31-32: job_slug and output_file are colliding for parallel shards
because they only use basename("$1") with $2 and $3; update the job_slug
generation (and derived output_file) to include an additional unique shard
identifier (for example a shard index/ID passed as another script argument or a
runtime value like the process/array task id) so each shard produces a distinct
job_slug and output_file; change the construction that sets job_slug and the
assignment of output_file to append that unique identifier.
---
Nitpick comments:
In @.github/workflows/frontier_amd/submit.sh:
- Around line 8-9: The usage() function's message is outdated and only mentions
two arguments; update the echo in usage() to document all supported positional
params ($1 script, $2 device (cpu|gpu), $3 interface, $4 shard) and any defaults
or optional markers (e.g., "[interface]" "[shard]") so callers see the full
signature; edit the echo inside usage() to a single clear line listing script,
device, interface, and shard and optional/default semantics.
There was a problem hiding this comment.
1 issue found across 7 files
Confidence score: 4/5
- Moderate risk only: the cleanup step in
.github/workflows/phoenix/test.shdoesn’t check the./mfc.sh cleanexit status, so failures could allow stale artifacts to affect builds/tests. - This is a CI reliability concern rather than a direct product bug, so it’s likely safe to merge with minimal risk.
- Pay close attention to
.github/workflows/phoenix/test.sh- ensure cleanup failures don’t silently proceed.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".github/workflows/phoenix/test.sh">
<violation number="1" location=".github/workflows/phoenix/test.sh:5">
P2: The `./mfc.sh clean` exit status is not checked. If the clean fails, the script continues and may build/test against stale or corrupted artifacts, defeating the purpose of this proactive cleanup and causing hard-to-diagnose failures.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
The CI test scripts use --shard for splitting Frontier GPU tests across multiple jobs, and failed_uuids.txt for retry logic. These toolchain changes were missing from the cherry-pick. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
toolchain/mfc/test/test.py (1)
102-108: Shard filtering is correct — minor readability nit on line 104.The validation logic handles all edge cases correctly (short-circuit
orensuresint()is never called on non-digit strings), andi % shard_count == shard_idx - 1correctly partitions cases without overlap. The placement after all other filters and before--percentis the right ordering.Optional: the compound condition on line 104 can be split into guard clauses to improve readability:
♻️ Optional readability refactor
- if len(parts) != 2 or not all(p.isdigit() for p in parts) or int(parts[1]) < 1 or not 1 <= int(parts[0]) <= int(parts[1]): - raise MFCException(f"Invalid --shard '{ARG('shard')}': expected 'i/n' with 1 <= i <= n (e.g., '1/2').") + def _bad_shard(): + if len(parts) != 2 or not all(p.isdigit() for p in parts): + return True + n, i = int(parts[1]), int(parts[0]) + return n < 1 or not (1 <= i <= n) + if _bad_shard(): + raise MFCException(f"Invalid --shard '{ARG('shard')}': expected 'i/n' with 1 <= i <= n (e.g., '1/2').")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@toolchain/mfc/test/test.py` around lines 102 - 108, The compound validation in the ARG("shard") block is correct but hard to read; refactor the conditional inside the if ARG("shard") is not None: block by splitting the long compound condition into explicit guard checks: first split = ARG("shard").split("/") and verify length == 2, then check that both parts are digits (using parts[0].isdigit() and parts[1].isdigit()), then parse shard_idx = int(parts[0]) and shard_count = int(parts[1]) and validate shard_count >= 1 and 1 <= shard_idx <= shard_count; on any failure raise MFCException with the same message, then compute skipped_cases and cases using shard_idx and shard_count as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@toolchain/mfc/test/test.py`:
- Around line 217-224: When abort_tests.is_set() causes test() to raise
MFCException the existing cleanup that writes/removes failed_uuids.txt (the
failed_uuids_path handling around failed_tests, open(...), os.remove(...)) is
skipped; modify the exception/exit path to always attempt to remove stale
failed_uuids_path and wrap file I/O (open and os.remove) in try/except catching
OSError (or Exception) so I/O errors are logged but do not replace the real exit
code—i.e., in the MFCException handler and/or finally block ensure you try to
delete failed_uuids_path if it exists and handle/log any OSError from
open()/os.remove() instead of letting it propagate.
---
Nitpick comments:
In `@toolchain/mfc/test/test.py`:
- Around line 102-108: The compound validation in the ARG("shard") block is
correct but hard to read; refactor the conditional inside the if ARG("shard") is
not None: block by splitting the long compound condition into explicit guard
checks: first split = ARG("shard").split("/") and verify length == 2, then check
that both parts are digits (using parts[0].isdigit() and parts[1].isdigit()),
then parse shard_idx = int(parts[0]) and shard_count = int(parts[1]) and
validate shard_count >= 1 and 1 <= shard_idx <= shard_count; on any failure
raise MFCException with the same message, then compute skipped_cases and cases
using shard_idx and shard_count as before.
| # Write failed UUIDs to file for CI retry logic | ||
| failed_uuids_path = os.path.join(common.MFC_TEST_DIR, "failed_uuids.txt") | ||
| if failed_tests: | ||
| with open(failed_uuids_path, "w") as f: | ||
| for test_info in failed_tests: | ||
| f.write(test_info['uuid'] + "\n") | ||
| elif os.path.exists(failed_uuids_path): | ||
| os.remove(failed_uuids_path) |
There was a problem hiding this comment.
Stale failed_uuids.txt when the early-abort path fires.
When abort_tests.is_set() causes test() to raise MFCException at lines 192–203, execution never reaches lines 217–224. A failed_uuids.txt left by a previous run is not cleaned up. Depending on how the CI workflow gates the retry step, it could retry stale UUIDs from the prior run rather than (or in addition to) the current failures.
Additionally, unhandled I/O errors (permissions, disk full) in open()/os.remove() would propagate past exit(nFAIL), masking the real failure count in the process exit code.
🛡️ Suggested fix: clean stale file on abort + guard I/O
+ # Clean up any stale file from a previous run when aborting early
+ if abort_tests.is_set():
+ ... # (existing abort exception block)
+ try:
+ if os.path.exists(failed_uuids_path := os.path.join(common.MFC_TEST_DIR, "failed_uuids.txt")):
+ os.remove(failed_uuids_path)
+ except OSError:
+ pass
+ raise MFCException(...)
# Write failed UUIDs to file for CI retry logic
failed_uuids_path = os.path.join(common.MFC_TEST_DIR, "failed_uuids.txt")
- if failed_tests:
- with open(failed_uuids_path, "w") as f:
- for test_info in failed_tests:
- f.write(test_info['uuid'] + "\n")
- elif os.path.exists(failed_uuids_path):
- os.remove(failed_uuids_path)
+ try:
+ if failed_tests:
+ with open(failed_uuids_path, "w") as f:
+ for test_info in failed_tests:
+ f.write(test_info['uuid'] + "\n")
+ elif os.path.exists(failed_uuids_path):
+ os.remove(failed_uuids_path)
+ except OSError:
+ pass # Non-fatal; CI retry logic may not fire but test results are unaffected🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@toolchain/mfc/test/test.py` around lines 217 - 224, When abort_tests.is_set()
causes test() to raise MFCException the existing cleanup that writes/removes
failed_uuids.txt (the failed_uuids_path handling around failed_tests, open(...),
os.remove(...)) is skipped; modify the exception/exit path to always attempt to
remove stale failed_uuids_path and wrap file I/O (open and os.remove) in
try/except catching OSError (or Exception) so I/O errors are logged but do not
replace the real exit code—i.e., in the MFCException handler and/or finally
block ensure you try to delete failed_uuids_path if it exists and handle/log any
OSError from open()/os.remove() instead of letting it propagate.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1171 +/- ##
==========================================
- Coverage 44.07% 44.05% -0.02%
==========================================
Files 70 70
Lines 20431 20498 +67
Branches 1974 1990 +16
==========================================
+ Hits 9004 9030 +26
- Misses 10291 10329 +38
- Partials 1136 1139 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
User description
Summary
./mfc.sh cleanin Phoenix test scripts to prevent cross-compiler contamination from stale build artifacts--requeueto Phoenix SLURM jobs for preemption recoveryDepends on: #1170 (for
monitor_slurm_job.shand build script changes)Test plan
🤖 Generated with Claude Code
CodeAnt-AI Description
Speed up and harden CI runs for Frontier/Frontier (AMD) and Phoenix clusters
What Changed
Impact
✅ Faster Frontier GPU test completion (parallel shards)✅ Fewer Phoenix build/test failures due to stale artifacts✅ Fewer whole-suite reruns for sporadic test failures (retries only failed tests)💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.
Summary by CodeRabbit
New Features
Improvements