Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Feb 7, 2026

Benchmarking code across examples/ and benchmark/ reimplements identical warmup loops, timing, statistics, and printing—~100 lines of boilerplate per file. This adds iris.bench, a shared infrastructure to eliminate duplication and standardize measurements.

Changes

Core module (iris/bench.py)

  • BenchmarkResult: dataclass storing mean/median/p50/p99/min/max with JSON export
  • BenchmarkRunner: context manager for parameter sweeps with barrier support
  • @benchmark: decorator for simple function benchmarking
  • Utilities: torch_dtype_from_str(), compute_bandwidth_gbps()

Integration

  • Exposed via iris.bench in __init__.py
  • Internally uses existing iris.do_bench for timing
  • Backward compatible—existing benchmarks unchanged

Testing & Documentation

  • test_bench.py: full suite (GPU required)
  • test_bench_basic.py: unit tests (no GPU)
  • API reference, migration guide, examples

Usage

Before (~100 lines):

parser.add_argument("-w", "--num_warmup", type=int, default=1)
# ... dtype conversion, manual warmup, timing, stats
triton_ms = iris.do_bench(run_exp, barrier, n_repeat=args["num_experiments"])
print(f"Time: {triton_ms:.4f} ms")

After (~50 lines):

from iris.bench import BenchmarkRunner

runner = BenchmarkRunner(name="gemm", barrier_fn=shmem.barrier)
result = runner.run(fn=operation, warmup=5, iters=50, params={"size": 1024})
result.print_summary()  # automatic mean/p50/p99
runner.save_json("results.json")  # CI-ready

Enables consistent CI performance tracking and reduces maintenance burden by centralizing benchmark infrastructure.

Original prompt

This section details on the original issue you should resolve

<issue_title>Unified benchmarking harness</issue_title>
<issue_description>Benchmarking code in Iris is currently scattered across benchmark/ and examples/, with each script re-implementing the same logic (warmup loops, synchronization, timing, averaging, printing). Over time this has led to copy-pasted code, inconsistent measurement patterns, and benchmarks that are hard to reuse or automate.

It would be useful to introduce a small, shared benchmarking harness (e.g. iris.bench) that standardizes:

  • warmup and iteration handling
  • timing and synchronization
  • basic statistics (mean / p50 / p99)
  • parameter sweeps
  • structured result output (e.g. JSON or dict)

This would allow both examples/ and benchmark/ to share the same timing infrastructure, while keeping example code focused on semantics rather than measurement boilerplate.

Example (sketch):

from iris.bench import benchmark

@benchmark(name="gemm_all_scatter", warmup=5, iters=50)
def run(size, world_size):
# setup tensors
# launch Iris kernel
kernel(...)

Internally you can use iris do bench and any code we have. Such a harness would significantly reduce duplicated code, improve maintainability, and make it easier to add consistent benchmarks and eventually integrate CI performance tracking.</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 3 commits February 7, 2026 20:52
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Copilot AI changed the title [WIP] Add unified benchmarking harness for Iris Add unified benchmarking harness (iris.bench) Feb 7, 2026
Copilot AI requested a review from mawad-amd February 7, 2026 20:57

# Simple decorator-based benchmarking
@benchmark(name="my_kernel", warmup=5, iters=50)
def run_kernel():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot the decorator way is the only way we need. Remove everything else. Also, it is safe to assume that the bench harness will construct the iris instance and pass it to the user benchmark function. When using the decorator the user will need to also annotate parts of the code that are presetup (eg tensor allocation), preamble per run (eg resetting flags) and code to actually benchmark (kernel launch ).

- Remove BenchmarkRunner class - keep only @benchmark decorator
- Decorator now constructs iris instance and passes shmem to user function
- Add @setup, @Preamble, @measure annotations for code organization
- Update all documentation to reflect decorator-only approach
- Simplify examples to demonstrate new pattern
- Update tests to match new API

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unified benchmarking harness

2 participants