Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Feb 3, 2026

Three core API modules (iris.ccl, iris.ops, iris.x) were undocumented in the reference section, making discovery difficult for users.

Changes

Collective Communication Library (iris.ccl)

  • Host-side tensor-level collectives: all_reduce, all_gather, all_to_all, reduce_scatter
  • Configuration via Config class with algorithm selection and tuning parameters
  • ReduceOp enum for reduction operations

Fused GEMM+CCL Operations (iris.ops)

  • Computation-communication overlap primitives: matmul_all_reduce, all_gather_matmul, matmul_all_gather, matmul_reduce_scatter
  • Workspace management via FusedConfig and FusedWorkspace
  • Accessible through shmem.ops namespace or standalone

Device-Side Tile-Level Primitives (iris.x)

  • Fine-grained control in Triton kernels via TileView, TensorView, DeviceContext
  • Five all-reduce algorithms: atomic, ring, two-shot, one-shot, spinlock
  • Algorithm selection via AllReduceConfig
  • Helper functions: tile_layout, tile_ptr, offset_ptr

Documentation Structure

docs/reference/
├── ccl/
│   ├── overview.md      # Usage patterns, async operations
│   ├── operations.md    # Autodoc for CCL methods
│   └── config.md        # Config and ReduceOp classes
├── ops/
│   ├── overview.md      # Fused operations, workspace management
│   ├── operations.md    # Autodoc for ops methods
│   └── config.md        # FusedConfig and FusedWorkspace
└── x/
    ├── overview.md      # Tile-level API, algorithm selection
    ├── core.md          # Core abstractions and helpers
    └── operations.md    # Device-side collectives

Each section includes usage examples and follows the existing Triton/Gluon API reference patterns.

Original prompt

This section details on the original issue you should resolve

<issue_title>[Documentation]: Add API reference for iris.ccl, iris.ops and iris.x</issue_title>
<issue_description>### Description of errors

Attach any links, screenshots, or additional evidence you think will be helpful.

No response</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits February 3, 2026 02:39
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Copilot AI changed the title [WIP] Add API reference for iris.ccl, iris.ops, and iris.x Add API reference for iris.ccl, iris.ops, and iris.x modules Feb 3, 2026
Copilot AI requested a review from neoblizz February 3, 2026 02:44
@neoblizz neoblizz marked this pull request as ready for review February 3, 2026 02:54
@neoblizz neoblizz requested a review from mawad-amd as a code owner February 3, 2026 02:54
Copilot AI review requested due to automatic review settings February 3, 2026 02:54
@neoblizz neoblizz requested a review from BKP as a code owner February 3, 2026 02:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive API reference documentation for three core Iris modules (iris.ccl, iris.ops, and iris.x) that were previously undocumented. The documentation follows the established pattern from existing Triton and Gluon reference sections.

Changes:

  • Added structured documentation for collective communication operations (CCL), fused GEMM+CCL operations (ops), and device-side tile-level primitives (x)
  • Created overview pages with usage patterns and examples for each module
  • Added autodoc-based API reference pages for operations, configuration classes, and core abstractions
  • Updated the main API reference navigation to include links to the new sections

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/reference/api-reference.md Added navigation entries for the three new modules in the main API reference index
docs/reference/ccl/overview.md Overview of collective communication library with usage patterns and examples
docs/reference/ccl/operations.md Autodoc references for CCL collective operations (all_reduce, all_gather, all_to_all, reduce_scatter)
docs/reference/ccl/config.md Autodoc references for CCL configuration classes (Config, ReduceOp)
docs/reference/ops/overview.md Overview of fused GEMM+CCL operations with usage patterns and workspace management examples
docs/reference/ops/operations.md Autodoc references for fused operations (matmul_all_reduce, all_gather_matmul, matmul_all_gather, matmul_reduce_scatter) and OpsNamespace
docs/reference/ops/config.md Autodoc references for ops configuration classes (FusedConfig, FusedWorkspace)
docs/reference/x/overview.md Overview of device-side tile-level primitives with algorithm selection and usage patterns
docs/reference/x/core.md Autodoc references for core abstractions (TileView, TensorView, DeviceContext, AllReduceConfig) and helper functions
docs/reference/x/operations.md Autodoc references for device-side collective operations (all-reduce variants, all_gather, all_to_all, reduce_scatter, gather)

Comment on lines +19 to +27
- [iris.ccl - Collective Communication](ccl/overview.md)
- [Operations](ccl/operations.md)
- [Configuration](ccl/config.md)
- [iris.ops - Fused GEMM+CCL](ops/overview.md)
- [Operations](ops/operations.md)
- [Configuration](ops/config.md)
- [iris.x - Device-Side Primitives](x/overview.md)
- [Core Abstractions](x/core.md)
- [Operations](x/operations.md)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new documentation sections for iris.ccl, iris.ops, and iris.x are not included in the Sphinx table of contents configuration file (docs/sphinx/_toc.yml). This means these pages will not be included in the generated documentation navigation and may not be properly built.

You need to add entries to docs/sphinx/_toc.yml under the "Reference" section, similar to how triton and gluon are structured. The entries should follow this pattern after the gluon section.

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +92
ctx.all_reduce(tile, src_view, dst_view)
```

## Core Abstractions

- **TileView**: Represents a tile's position and size in a 2D grid
- **TensorView**: Represents a tensor's memory layout (pointer, shape, strides)
- **DeviceContext**: Holds rank, world size, and heap bases for communication
- **AllReduceConfig**: Configuration for selecting all-reduce algorithms

## Usage Patterns

### Using DeviceContext (Recommended)

The `DeviceContext` provides a clean API for calling collectives:

```python
@triton.jit
def kernel(input_ptr, output_ptr, ...):
tile = iris.x.TileView(pid_m, pid_n, BLOCK_M, BLOCK_N)
src_view = iris.x.TensorView(input_ptr, M, N, stride_m, stride_n)
dst_view = iris.x.TensorView(output_ptr, M, N, stride_m, stride_n)
ctx = iris.x.DeviceContext(rank, world_size, heap_bases)

# Call collectives with default algorithms
ctx.all_reduce(tile, src_view, dst_view)
ctx.all_gather(tile, src_view, dst_view, dim=0)
ctx.all_to_all(tile, src_view, dst_view, N_per_rank)
ctx.reduce_scatter(tile, src_view, dst_view)
```

### Algorithm Selection

Use `AllReduceConfig` to select specific all-reduce algorithms:

```python
@triton.jit
def kernel(input_ptr, output_ptr, locks_ptr, ...):
ctx = iris.x.DeviceContext(rank, world_size, heap_bases)

# Use ring algorithm
config = iris.x.AllReduceConfig("ring")
ctx.all_reduce(tile, src_view, dst_view, config=config)

# Use spinlock algorithm with locks
config = iris.x.AllReduceConfig("spinlock", locks_ptr)
tile_id = pid_m * num_tiles_n + pid_n
ctx.all_reduce(tile, src_view, dst_view, config=config, tile_id=tile_id)
```

### Standalone Functions

You can also call operations directly without `DeviceContext`:

```python
@triton.jit
def kernel(input_ptr, output_ptr, ...):
ctx = iris.x.DeviceContext(rank, world_size, heap_bases)

# Call operations directly
iris.x.all_reduce_atomic(tile, src_view, dst_view, ctx)
iris.x.all_reduce_ring(tile, src_view, dst_view, ctx)
iris.x.all_gather(tile, src_view, dst_view, dim, ctx)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation shows DeviceContext methods being called (e.g., ctx.all_reduce(tile, src_view, dst_view)), but the actual API uses standalone functions where ctx is passed as the last parameter (e.g., iris.x.all_reduce_atomic(tile, src_view, dst_view, ctx)).

Based on the actual implementation in iris/x/core.py lines 466-489, DeviceContext does not have collective methods. The correct usage is to call the standalone functions from iris.x module, passing ctx as a parameter.

The examples throughout this file (lines 30, 55-58, 90-92) need to be corrected to match the actual API.

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +75
# Use ring algorithm
config = iris.x.AllReduceConfig("ring")
ctx.all_reduce(tile, src_view, dst_view, config=config)

# Use spinlock algorithm with locks
config = iris.x.AllReduceConfig("spinlock", locks_ptr)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AllReduceConfig initialization examples show string arguments like AllReduceConfig("ring"), but the actual implementation in iris/x/core.py (lines 452-462) expects an integer variant_code (0-4) and a locks_ptr parameter.

The correct usage should be:

  • AllReduceConfig(1, locks_ptr) for ring (not AllReduceConfig("ring"))
  • AllReduceConfig(4, locks_ptr) for spinlock (not AllReduceConfig("spinlock", locks_ptr))

The mapping is documented in core.py lines 441-446:

  • 0 = atomic
  • 1 = ring
  • 2 = one_shot
  • 3 = two_shot
  • 4 = spinlock

All examples using AllReduceConfig need to use integer codes, not strings.

Suggested change
# Use ring algorithm
config = iris.x.AllReduceConfig("ring")
ctx.all_reduce(tile, src_view, dst_view, config=config)
# Use spinlock algorithm with locks
config = iris.x.AllReduceConfig("spinlock", locks_ptr)
# Use ring algorithm (variant_code = 1)
config = iris.x.AllReduceConfig(1, locks_ptr)
ctx.all_reduce(tile, src_view, dst_view, config=config)
# Use spinlock algorithm with locks (variant_code = 4)
config = iris.x.AllReduceConfig(4, locks_ptr)

Copilot uses AI. Check for mistakes.
### Via shmem.ops namespace (recommended)

```python
shmem = iris.iris(heap_size)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code example uses an undefined variable heap_size. It should either define it (e.g., heap_size = 2**30) or use a literal value like in the first example (line 14: iris.iris(heap_size=2**30)).

Copilot uses AI. Check for mistakes.
```python
import iris.ops as ops

shmem = iris.iris(heap_size)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code example uses an undefined variable heap_size. It should either define it (e.g., heap_size = 2**30) or use a literal value like in the first example (line 14: iris.iris(heap_size=2**30)).

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +17
def my_kernel(input_ptr, output_ptr, M, N,
rank, world_size, heap_bases,
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel signature is missing tl.constexpr annotations for M, N parameters. TensorView requires dimensions and strides to be constexpr (as documented in iris/x/core.py lines 264-265 and 269-272).

The parameters M and N should be annotated as:

def my_kernel(input_ptr, output_ptr, M: tl.constexpr, N: tl.constexpr,
              rank: tl.constexpr, world_size: tl.constexpr, heap_bases,
              BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):

Additionally, stride_m and stride_n should either be passed as constexpr parameters or computed from constexpr values. See tests/x/test_all_to_all.py lines 21-27 for the correct pattern.

Suggested change
def my_kernel(input_ptr, output_ptr, M, N,
rank, world_size, heap_bases,
def my_kernel(input_ptr, output_ptr, M: tl.constexpr, N: tl.constexpr,
rank: tl.constexpr, world_size: tl.constexpr, heap_bases,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Documentation]: Add API reference for iris.ccl, iris.ops and iris.x

2 participants