-
Notifications
You must be signed in to change notification settings - Fork 33
Add API reference for iris.ccl, iris.ops, and iris.x modules #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds comprehensive API reference documentation for three core Iris modules (iris.ccl, iris.ops, and iris.x) that were previously undocumented. The documentation follows the established pattern from existing Triton and Gluon reference sections.
Changes:
- Added structured documentation for collective communication operations (CCL), fused GEMM+CCL operations (ops), and device-side tile-level primitives (x)
- Created overview pages with usage patterns and examples for each module
- Added autodoc-based API reference pages for operations, configuration classes, and core abstractions
- Updated the main API reference navigation to include links to the new sections
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
docs/reference/api-reference.md |
Added navigation entries for the three new modules in the main API reference index |
docs/reference/ccl/overview.md |
Overview of collective communication library with usage patterns and examples |
docs/reference/ccl/operations.md |
Autodoc references for CCL collective operations (all_reduce, all_gather, all_to_all, reduce_scatter) |
docs/reference/ccl/config.md |
Autodoc references for CCL configuration classes (Config, ReduceOp) |
docs/reference/ops/overview.md |
Overview of fused GEMM+CCL operations with usage patterns and workspace management examples |
docs/reference/ops/operations.md |
Autodoc references for fused operations (matmul_all_reduce, all_gather_matmul, matmul_all_gather, matmul_reduce_scatter) and OpsNamespace |
docs/reference/ops/config.md |
Autodoc references for ops configuration classes (FusedConfig, FusedWorkspace) |
docs/reference/x/overview.md |
Overview of device-side tile-level primitives with algorithm selection and usage patterns |
docs/reference/x/core.md |
Autodoc references for core abstractions (TileView, TensorView, DeviceContext, AllReduceConfig) and helper functions |
docs/reference/x/operations.md |
Autodoc references for device-side collective operations (all-reduce variants, all_gather, all_to_all, reduce_scatter, gather) |
| - [iris.ccl - Collective Communication](ccl/overview.md) | ||
| - [Operations](ccl/operations.md) | ||
| - [Configuration](ccl/config.md) | ||
| - [iris.ops - Fused GEMM+CCL](ops/overview.md) | ||
| - [Operations](ops/operations.md) | ||
| - [Configuration](ops/config.md) | ||
| - [iris.x - Device-Side Primitives](x/overview.md) | ||
| - [Core Abstractions](x/core.md) | ||
| - [Operations](x/operations.md) |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new documentation sections for iris.ccl, iris.ops, and iris.x are not included in the Sphinx table of contents configuration file (docs/sphinx/_toc.yml). This means these pages will not be included in the generated documentation navigation and may not be properly built.
You need to add entries to docs/sphinx/_toc.yml under the "Reference" section, similar to how triton and gluon are structured. The entries should follow this pattern after the gluon section.
| ctx.all_reduce(tile, src_view, dst_view) | ||
| ``` | ||
|
|
||
| ## Core Abstractions | ||
|
|
||
| - **TileView**: Represents a tile's position and size in a 2D grid | ||
| - **TensorView**: Represents a tensor's memory layout (pointer, shape, strides) | ||
| - **DeviceContext**: Holds rank, world size, and heap bases for communication | ||
| - **AllReduceConfig**: Configuration for selecting all-reduce algorithms | ||
|
|
||
| ## Usage Patterns | ||
|
|
||
| ### Using DeviceContext (Recommended) | ||
|
|
||
| The `DeviceContext` provides a clean API for calling collectives: | ||
|
|
||
| ```python | ||
| @triton.jit | ||
| def kernel(input_ptr, output_ptr, ...): | ||
| tile = iris.x.TileView(pid_m, pid_n, BLOCK_M, BLOCK_N) | ||
| src_view = iris.x.TensorView(input_ptr, M, N, stride_m, stride_n) | ||
| dst_view = iris.x.TensorView(output_ptr, M, N, stride_m, stride_n) | ||
| ctx = iris.x.DeviceContext(rank, world_size, heap_bases) | ||
|
|
||
| # Call collectives with default algorithms | ||
| ctx.all_reduce(tile, src_view, dst_view) | ||
| ctx.all_gather(tile, src_view, dst_view, dim=0) | ||
| ctx.all_to_all(tile, src_view, dst_view, N_per_rank) | ||
| ctx.reduce_scatter(tile, src_view, dst_view) | ||
| ``` | ||
|
|
||
| ### Algorithm Selection | ||
|
|
||
| Use `AllReduceConfig` to select specific all-reduce algorithms: | ||
|
|
||
| ```python | ||
| @triton.jit | ||
| def kernel(input_ptr, output_ptr, locks_ptr, ...): | ||
| ctx = iris.x.DeviceContext(rank, world_size, heap_bases) | ||
|
|
||
| # Use ring algorithm | ||
| config = iris.x.AllReduceConfig("ring") | ||
| ctx.all_reduce(tile, src_view, dst_view, config=config) | ||
|
|
||
| # Use spinlock algorithm with locks | ||
| config = iris.x.AllReduceConfig("spinlock", locks_ptr) | ||
| tile_id = pid_m * num_tiles_n + pid_n | ||
| ctx.all_reduce(tile, src_view, dst_view, config=config, tile_id=tile_id) | ||
| ``` | ||
|
|
||
| ### Standalone Functions | ||
|
|
||
| You can also call operations directly without `DeviceContext`: | ||
|
|
||
| ```python | ||
| @triton.jit | ||
| def kernel(input_ptr, output_ptr, ...): | ||
| ctx = iris.x.DeviceContext(rank, world_size, heap_bases) | ||
|
|
||
| # Call operations directly | ||
| iris.x.all_reduce_atomic(tile, src_view, dst_view, ctx) | ||
| iris.x.all_reduce_ring(tile, src_view, dst_view, ctx) | ||
| iris.x.all_gather(tile, src_view, dst_view, dim, ctx) |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation shows DeviceContext methods being called (e.g., ctx.all_reduce(tile, src_view, dst_view)), but the actual API uses standalone functions where ctx is passed as the last parameter (e.g., iris.x.all_reduce_atomic(tile, src_view, dst_view, ctx)).
Based on the actual implementation in iris/x/core.py lines 466-489, DeviceContext does not have collective methods. The correct usage is to call the standalone functions from iris.x module, passing ctx as a parameter.
The examples throughout this file (lines 30, 55-58, 90-92) need to be corrected to match the actual API.
| # Use ring algorithm | ||
| config = iris.x.AllReduceConfig("ring") | ||
| ctx.all_reduce(tile, src_view, dst_view, config=config) | ||
|
|
||
| # Use spinlock algorithm with locks | ||
| config = iris.x.AllReduceConfig("spinlock", locks_ptr) |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AllReduceConfig initialization examples show string arguments like AllReduceConfig("ring"), but the actual implementation in iris/x/core.py (lines 452-462) expects an integer variant_code (0-4) and a locks_ptr parameter.
The correct usage should be:
AllReduceConfig(1, locks_ptr)for ring (notAllReduceConfig("ring"))AllReduceConfig(4, locks_ptr)for spinlock (notAllReduceConfig("spinlock", locks_ptr))
The mapping is documented in core.py lines 441-446:
- 0 = atomic
- 1 = ring
- 2 = one_shot
- 3 = two_shot
- 4 = spinlock
All examples using AllReduceConfig need to use integer codes, not strings.
| # Use ring algorithm | |
| config = iris.x.AllReduceConfig("ring") | |
| ctx.all_reduce(tile, src_view, dst_view, config=config) | |
| # Use spinlock algorithm with locks | |
| config = iris.x.AllReduceConfig("spinlock", locks_ptr) | |
| # Use ring algorithm (variant_code = 1) | |
| config = iris.x.AllReduceConfig(1, locks_ptr) | |
| ctx.all_reduce(tile, src_view, dst_view, config=config) | |
| # Use spinlock algorithm with locks (variant_code = 4) | |
| config = iris.x.AllReduceConfig(4, locks_ptr) |
| ### Via shmem.ops namespace (recommended) | ||
|
|
||
| ```python | ||
| shmem = iris.iris(heap_size) |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code example uses an undefined variable heap_size. It should either define it (e.g., heap_size = 2**30) or use a literal value like in the first example (line 14: iris.iris(heap_size=2**30)).
| ```python | ||
| import iris.ops as ops | ||
|
|
||
| shmem = iris.iris(heap_size) |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code example uses an undefined variable heap_size. It should either define it (e.g., heap_size = 2**30) or use a literal value like in the first example (line 14: iris.iris(heap_size=2**30)).
| def my_kernel(input_ptr, output_ptr, M, N, | ||
| rank, world_size, heap_bases, |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel signature is missing tl.constexpr annotations for M, N parameters. TensorView requires dimensions and strides to be constexpr (as documented in iris/x/core.py lines 264-265 and 269-272).
The parameters M and N should be annotated as:
def my_kernel(input_ptr, output_ptr, M: tl.constexpr, N: tl.constexpr,
rank: tl.constexpr, world_size: tl.constexpr, heap_bases,
BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):Additionally, stride_m and stride_n should either be passed as constexpr parameters or computed from constexpr values. See tests/x/test_all_to_all.py lines 21-27 for the correct pattern.
| def my_kernel(input_ptr, output_ptr, M, N, | |
| rank, world_size, heap_bases, | |
| def my_kernel(input_ptr, output_ptr, M: tl.constexpr, N: tl.constexpr, | |
| rank: tl.constexpr, world_size: tl.constexpr, heap_bases, |
Three core API modules (
iris.ccl,iris.ops,iris.x) were undocumented in the reference section, making discovery difficult for users.Changes
Collective Communication Library (
iris.ccl)all_reduce,all_gather,all_to_all,reduce_scatterConfigclass with algorithm selection and tuning parametersReduceOpenum for reduction operationsFused GEMM+CCL Operations (
iris.ops)matmul_all_reduce,all_gather_matmul,matmul_all_gather,matmul_reduce_scatterFusedConfigandFusedWorkspaceshmem.opsnamespace or standaloneDevice-Side Tile-Level Primitives (
iris.x)TileView,TensorView,DeviceContextAllReduceConfigtile_layout,tile_ptr,offset_ptrDocumentation Structure
Each section includes usage examples and follows the existing Triton/Gluon API reference patterns.
Original prompt
iris.ccl,iris.opsandiris.x#344✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.