Use tritonBLAS Device Side APIs in iris ops #358

ryanswann-amd · 2026-02-04T23:57:53Z

Motivation

This PR integrates tritonBLAS's composable stage abstractions into the iris.ops fused communication+compute kernels. The goal is to leverage tritonBLAS's device side APIs to reduce kernel code complexity.

Technical Details

Changes to iris.ops Kernels

The following iris.ops kernels now use tritonBLAS stages:

Kernel	tritonBLAS Components Used
`matmul_all_gather.py`	`GemmContext`, `ScheduleContext`, `make_tensor_view`
`matmul_all_reduce.py`	`GemmContext`, `make_tensor_view`, `Tile`
`matmul_reduce_scatter.py`	`GemmContext`, `ScheduleContext`, `make_tensor_view`
`all_gather_matmul.py`	`GemmContext`, `ScheduleContext`

Example Pattern

Before (custom GEMM implementation):

# Manual accumulator init, dot products, and K-loop
acc = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
    a = tl.load(A_ptr + ...)
    b = tl.load(B_ptr + ...)
    acc = tl.dot(a, b, acc)

After (using tritonBLAS stages):

# Create views and context
tensorA = make_tensor_view(A, M, K, stride_am, stride_ak)
tensorB = make_tensor_view(B, K, N, stride_bk, stride_bn)
gemm_ctx = GemmContext(BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, ...)

# Single call handles accumulator init, K-loop, and reduction
acc = gemm_ctx.reduce_axis(tensorA, tensorB, out_tile)

Test Plan

Run all the tests in tests/ops

Test Result

Tests passed

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

… ryaswann/use_tblas_stages

Copilot

Pull request overview

This PR integrates tritonBLAS's composable stage abstractions into iris.ops fused communication+compute kernels to reduce code complexity and improve maintainability.

Changes:

Replaces custom GEMM implementations with tritonBLAS's GemmContext, ScheduleContext, and make_tensor_view APIs
Updates TensorView to store dimensions/strides as tensors instead of constexpr values, with a new make_tensor_view factory function
Simplifies all_reduce_two_shot to use interleaved distribution instead of parameterized start_tile/stride
Removes unnecessary parentheses in lambda function

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
iris/x/core.py	Adds `make_tensor_view` factory function and updates `TensorView` to store tensor fields instead of constexpr
iris/x/all_reduce.py	Simplifies `all_reduce_two_shot` by removing start_tile/stride parameters and using interleaved distribution
iris/x/init.py	Exports new `make_tensor_view` function
iris/ops/matmul_reduce_scatter.py	Replaces custom GEMM loop with tritonBLAS `GemmContext` and updates to use `make_tensor_view`
iris/ops/matmul_all_reduce.py	Replaces custom GEMM loop with tritonBLAS stages and restructures variant dispatch logic
iris/ops/matmul_all_gather.py	Replaces custom GEMM implementation with tritonBLAS `GemmContext` and `ScheduleContext`
iris/ops/all_gather_matmul.py	Replaces custom GEMM with tritonBLAS stages, adds `NUM_K_BLOCKS_LOCAL` parameter, and improves code organization
examples/common/utils.py	Removes unnecessary parentheses from lambda function

Copilot · 2026-02-04T23:58:37Z

iris/ops/matmul_reduce_scatter.py

+    rm, rn = out_tile.indices()

-    c = acc.to(C.dtype.element_ty)
+    c = acc.to(C.type.element_ty)


Attribute access changed from C.dtype.element_ty to C.type.element_ty. Verify that C.type.element_ty is the correct attribute for the pointer type in Triton, as this differs from the original code pattern.

Suggested change

c = acc.to(C.type.element_ty)

c = acc.to(C.dtype.element_ty)

Copilot · 2026-02-04T23:58:37Z

iris/ops/matmul_all_reduce.py


    # Convert to output dtype
-    c = acc.to(C.dtype.element_ty)
+    c = acc.to(C.type.element_ty)


Attribute access changed from C.dtype.element_ty to C.type.element_ty. Verify that C.type.element_ty is the correct attribute for the pointer type in Triton, as this differs from the original code pattern.

Suggested change

c = acc.to(C.type.element_ty)

c = acc.to(C.dtype.element_ty)

Copilot · 2026-02-04T23:58:38Z

iris/ops/all_gather_matmul.py

+                # Promote tile_k to tensor (TileView expects tl.tensor for pid_n)
+                tile_k = pid_m * 0 + k_offset // BLOCK_SIZE_K


The expression pid_m * 0 + k_offset // BLOCK_SIZE_K uses an unusual pattern to convert to tensor. Consider using a more explicit method or adding a comment explaining why this approach is necessary for TileView.

Suggested change

# Promote tile_k to tensor (TileView expects tl.tensor for pid_n)

tile_k = pid_m * 0 + k_offset // BLOCK_SIZE_K

# Promote scalar tile_k to tensor (TileView expects tl.tensor for pid_n)

tile_k = tl.full_like(pid_m, k_offset // BLOCK_SIZE_K)

Copilot · 2026-02-04T23:58:38Z

iris/ops/all_gather_matmul.py

+                # Promote tile_k to tensor (TileView expects tl.tensor for pid_n)
+                tile_k = pid_m * 0 + k_offset // BLOCK_SIZE_K


The expression pid_m * 0 + k_offset // BLOCK_SIZE_K uses an unusual pattern to convert to tensor. Consider using a more explicit method or adding a comment explaining why this approach is necessary for TileView.

Suggested change

# Promote tile_k to tensor (TileView expects tl.tensor for pid_n)

tile_k = pid_m * 0 + k_offset // BLOCK_SIZE_K

# Promote scalar K-tile index to tensor matching pid_m's shape for TileView

tile_k_scalar = k_offset // BLOCK_SIZE_K

tile_k = tl.full_like(pid_m, tile_k_scalar)

mawad-amd · 2026-02-05T00:04:58Z

Thanks, Ryan! you will need to change the SHA for the tests to run. Check this please: #350

iris/ops/all_gather_matmul.py

… ryaswann/use_tblas_stages

mawad-amd · 2026-02-05T16:53:35Z

iris/ops/all_gather_matmul.py

        # Create DeviceContext and TensorView for gather operations
-        ctx = iris.x.DeviceContext(cur_rank, world_size, heap_bases)
-        src_view = iris.x.TensorView(A_sharded, M, K_local, stride_am, stride_ak)
+        iris_ctx = iris.x.DeviceContext(cur_rank, world_size, heap_bases)


There will be a small conflict here once you sync with main:

iris/iris/ops/all_gather_matmul.py

Line 96 in 383b35c

ctx = iris.DeviceContext.initialize(context_tensor, cur_rank, world_size)

iris/ops/all_gather_matmul.py

mawad-amd

This looks very neat. Thanks, Ryan!

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

…ages

… ryaswann/use_tblas_stages

ryanswann-amd and others added 6 commits February 3, 2026 19:40

User tritonblas stages in iris ops

d94ca56

Apply Ruff auto-fixes

866ec24

Update ops with non const problem dimensions.

13d5c15

Merge branch 'ryaswann/use_tblas_stages' of github.com:ROCm/iris into…

67b560a

… ryaswann/use_tblas_stages

Fix all_gather matmul

28ebebf

Apply Ruff auto-fixes

32db156

ryanswann-amd requested review from BKP, mawad-amd and neoblizz as code owners February 4, 2026 23:57

Copilot AI review requested due to automatic review settings February 4, 2026 23:57

github-actions bot added in-progress We are working on it iris Iris project issue labels Feb 4, 2026

Copilot AI reviewed Feb 4, 2026

View reviewed changes

mawad-amd reviewed Feb 5, 2026

View reviewed changes

iris/ops/all_gather_matmul.py Show resolved Hide resolved

ryanswann-amd added 2 commits February 5, 2026 11:52

Add tritonblas with stages as a dependency

e88d952

Merge branch 'ryaswann/use_tblas_stages' of github.com:ROCm/iris into…

21e776f

… ryaswann/use_tblas_stages

mawad-amd reviewed Feb 5, 2026

View reviewed changes

iris/ops/all_gather_matmul.py Outdated Show resolved Hide resolved

mawad-amd approved these changes Feb 5, 2026

View reviewed changes

ryanswann-amd and others added 7 commits February 5, 2026 10:57

Update iris/ops/all_gather_matmul.py

4ea8960

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into ryaswann/use_tblas_st…

5501212

…ages

Add factory function for iris tensor view

5fba48f

Apply Ruff auto-fixes

62a2d48

Update tests to use tensor view factory function

b108155

Merge branch 'ryaswann/use_tblas_stages' of github.com:ROCm/iris into…

f9bed3f

… ryaswann/use_tblas_stages

Fix all reduce test

9640b4a

mawad-amd merged commit ff6ef71 into main Feb 7, 2026
72 checks passed

mawad-amd deleted the ryaswann/use_tblas_stages branch February 7, 2026 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use tritonBLAS Device Side APIs in iris ops #358

Use tritonBLAS Device Side APIs in iris ops #358

Uh oh!

ryanswann-amd commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

mawad-amd commented Feb 5, 2026

Uh oh!

Uh oh!

mawad-amd Feb 5, 2026

Uh oh!

Uh oh!

mawad-amd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Promote tile_k to tensor (TileView expects tl.tensor for pid_n)
		tile_k = pid_m * 0 + k_offset // BLOCK_SIZE_K

-                # Promote tile_k to tensor (TileView expects tl.tensor for pid_n)
-                tile_k = pid_m * 0 + k_offset // BLOCK_SIZE_K
+                # Promote scalar K-tile index to tensor matching pid_m's shape for TileView
+                tile_k_scalar = k_offset // BLOCK_SIZE_K
+                tile_k = tl.full_like(pid_m, tile_k_scalar)

Use tritonBLAS Device Side APIs in iris ops #358

Use tritonBLAS Device Side APIs in iris ops #358

Uh oh!

Conversation

ryanswann-amd commented Feb 4, 2026

Motivation

Technical Details

Changes to iris.ops Kernels

Example Pattern

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

mawad-amd commented Feb 5, 2026

Uh oh!

Uh oh!

mawad-amd Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mawad-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants