-
Notifications
You must be signed in to change notification settings - Fork 33
Restructure CI workflows to enable parallel GPU job execution #364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
.github/GPU_CONCURRENCY.md
Outdated
| - Test B using GPUs 2,3 | ||
| - Test C using GPUs 4,5,6,7 | ||
|
|
||
| These would run one after another, wasting GPU resources. With 8 GPUs available, tests A, B, and C could all run simultaneously since they don't share any GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot this looks good. Do we need multiple runners or would one runner be able to run all these in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One runner is sufficient. GitHub Actions will run multiple jobs from the same workflow run in parallel on a single runner as long as they have different concurrency group names.
With our implementation:
- Jobs with
gpu-0,1,gpu-2,3, andgpu-4,5,6,7can all run simultaneously on the single runner - The concurrency groups prevent GPU conflicts (e.g., two jobs both trying to use
gpu-0,1will be serialized) - The runner's 8 GPUs are utilized efficiently without needing additional runners
See the "Single Runner Architecture" section (line 83-88) in the documentation for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot remove this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 320f0a3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot your concurrency logic is broken. All tests are canceling each others. Use gh to see the CI logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot these tests did not run.. please use gh to see which ones failed. Here are the ones I see failed:
Iris Performance Regression Test / GEMM All-Scatter Producer-Consumer (pull_request)
Iris Performance Regression Test / GEMM All-Scatter Producer-Consumer (pull_request)
Iris Tests / Test ccl (2 ranks, git install) (pull_request)
Iris Tests / Test ccl (2 ranks, git install) (pull_request)
Iris Tests / Test ccl (4 ranks, git install) (pull_request)
Iris Tests / Test ccl (4 ranks, git install) (pull_request)
Iris Tests / Test examples (1 ranks, git install) (pull_request)
Iris Tests / Test examples (1 ranks, git install) (pull_request)
Iris Tests / Test examples (4 ranks, git install) (pull_request)
Iris Tests / Test examples (4 ranks, git install) (pull_request)
Iris Tests / Test examples (8 ranks, git install) (pull_request)
Iris Tests / Test examples (8 ranks, git install) (pull_request)
Iris Tests / Test ops (2 ranks, git install) (pull_request)
Iris Tests / Test ops (2 ranks, git install) (pull_request)
Iris Tests / Test unittests (1 ranks, git install) (pull_request)
Iris Tests / Test unittests (1 ranks, git install) (pull_request)
Iris Tests / Test unittests (2 ranks, git install) (pull_request)
Iris Tests / Test unittests (2 ranks, git install) (pull_request)
Iris Tests / Test unittests (8 ranks, git install) (pull_request)
Iris Tests / Test unittests (8 ranks, git install) (pull_request)
Iris Tests / Test x (1 ranks, git install) (pull_request)
Iris Tests / Test x (1 ranks, git install) (pull_request)
Iris Tests / Test x (4 ranks, git install) (pull_request)
Iris Tests / Test x (4 ranks, git install) (pull_request)
Iris Tests / Test x (8 ranks, git install) (pull_request)
Iris Tests / Test x (8 ranks, git install) (pull_request)
skipped checks
Iris Tests / Test ${{ matrix.test_dir }} (${{ matrix.num_ranks }} ranks, editable install) (pull_request)
Iris Tests / Test ${{ matrix.test_dir }} (${{ matrix.num_ranks }} ranks, editable install) (pull_request)Skipped 15 hours ago
Iris Tests / Test ${{ matrix.test_dir }} (${{ matrix.num_ranks }} ranks, pip install) (pull_request)
Iris Tests / Test ${{ matrix.test_dir }} (${{ matrix.num_ranks }} ranks, pip install) (pull_request)Skipped 15 hours ago
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in dcbb5eb - reverted all concurrency groups.
The concurrency feature was causing job cancellations because it's designed for serialization, not parallelization. GitHub Actions doesn't support parallel job execution on a single runner through workflow YAML.
To achieve true parallel execution with GPU isolation, you need to:
- Configure the runner for parallel jobs (runner config, not workflow YAML), OR
- Set up multiple runners on the machine, each with specific GPU access, OR
- Use an external job scheduler (SLURM, etc.) to manage GPU allocation
The workflows now run sequentially as before, which is the expected behavior for a standard single GitHub Actions runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot restructure the yamls as needed to achieve concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restructured in a8235a2 - workflows now use separate jobs for each GPU allocation group instead of a single matrix job.
Key changes:
- 12 separate jobs (4 GPU groups × 3 install methods) instead of 3 matrix jobs
- Jobs for different GPU sets (gpu-0-1, gpu-2-3, gpu-4-7, gpu-all) can run in parallel if the runner supports concurrent job execution
- No concurrency groups needed - parallel execution is enabled through job structure
Note: Actual parallel execution requires the self-hosted runner to be configured to accept multiple concurrent jobs. This restructuring enables it, but the runner must be configured accordingly.
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
…execution Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
CI jobs ran serially despite using non-overlapping GPU sets on our 8-GPU runner. This PR restructures the workflows to enable parallel execution of jobs using different GPU allocations.
Problem
Jobs using different GPU sets (e.g., GPUs 0,1 vs GPUs 2,3 vs GPUs 4,5,6,7) were running serially in a single matrix job, wasting GPU resources and CI time.
Solution
Restructured workflows to use separate jobs for each GPU allocation group instead of a single large matrix job. This enables GitHub Actions to schedule jobs independently for parallel execution when the runner supports concurrent jobs.
Changes Made - iris-tests.yml
Before: 3 jobs with large matrices (20 entries each) - all executed serially
After: 12 separate jobs grouped by GPU allocation (4 GPU groups × 3 install methods)
Job Structure:
Each job uses a matrix strategy to run 5 test suites: examples, unittests, ccl, x, ops
How Parallel Execution Works
Example Execution Flow
Git install phase (can run in parallel if runner configured for concurrent jobs):
test-git-gpu-0-1runs on GPUs 0,1test-git-gpu-2-3runs on GPUs 2,3 (PARALLEL)test-git-gpu-4-7runs on GPUs 4-7 (PARALLEL)test-git-gpu-allruns on all GPUs (after others if needed)Editable and pip install phases: Same parallel pattern after previous phase completes
Benefits
Runner Requirements
To fully utilize this structure:
Current Status
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.