Handwritten gemm for ADA by ngc92 · Pull Request #66 · IST-DASLab/llmq

ngc92 · 2026-03-04T15:13:24Z

0.5% speedup for 1.5B on 4x4090. YMMV for other models and cards.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR introduces an optional handwritten Tensor Core GEMM path (targeting Ada) and threads it through training/runtime configuration so users can switch between cuBLASLt and the custom kernel.

Changes:

Add EMatmulBackend plumbing across C++ training code, CLI, and Python bindings/config to select cuBLASLt vs custom GEMM.
Implement a custom TN GEMM kernel (gemm_mma_tn) for BF16/FP8→BF16 and dispatch it from matmul.
Add/adjust tests and CI coverage (new GEMM unit test; RoPE test updated to match fp16 freqs).

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
train.cpp	Adds `--custom-matmul` CLI flag.
src/utilities/sol.cpp	Updates matmul calls for new backend argument.
src/training/model.h	Adds `MatmulBackend` to run state.
src/testing/test_utils.h	Adds float→fp16 helper used by tests.
src/testing/test-rope.cu	Updates RoPE test to use fp16 freqs.
src/testing/test-gemm.cpp	New unit tests comparing custom GEMM vs cuBLAS.
src/models/llama_model.h	Adds `UseCustomMatmul` option.
src/models/llama_model.cpp	Threads backend selection through all matmul call sites; initializes backend from options.
src/kernels/tensor_core_utils.cuh	New low-level helpers for ldmatrix/mma fragments.
src/kernels/matmul.cpp	Adds backend-aware dispatch and hooks custom GEMM.
src/kernels/kernels.h	Adds `EMatmulBackend` and extends matmul APIs with backend parameter.
src/kernels/kernels.cpp	Threads backend through Tensor-based matmul wrapper.
src/kernels/gemm_mma.cu	Implements the custom GEMM kernel and launcher.
src/binding/python/training.py	Adds `custom_matmul` to TrainingConfig.
src/binding/python/tests/run.py	Wires config → LLamaOptions; adds argparse flag.
src/binding/kernel_binding.cpp	Extends Python matmul binding with a backend parameter.
src/binding/binding.cpp	Exposes `use_custom_matmul` on `LLamaOptions` in Python.
scripts/train.py	Adds `custom_matmul` toggle and passes it into options.
CMakeLists.txt	Builds new kernel and new unit test.
.github/workflows/wheel.yml	Adds a Modal CI job exercising `--custom-matmul`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/binding/kernel_binding.cpp

src/kernels/matmul.cpp

src/kernels/gemm_mma.cu

src/testing/test-gemm.cpp

src/training/model.h

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

src/testing/test-rope.cu:215

The bf16 RoPE test now sends half (fp16) freqs to the GPU, but the CPU baseline still quantizes h_freqs_f with round_bf16. This makes the reference computation use different frequency precision than the kernel and can cause incorrect comparisons. Update the CPU baseline to quantize/emulate freqs in fp16 (or keep freqs in fp32) to match the kernel’s half* freqs_cis contract.

    // Prepare freqs and quantize to fp16 (kernel expects fp16 freqs)
    std::vector<float> h_freqs_f(size_freqs);
    precompute_freqs_cis(h_freqs_f.data(), HD, T, 10000.0f);
    std::vector<half> h_freqs_fp16 = to_fp16(h_freqs_f);

    // CPU baseline with bf16 emulation: quantize inputs/freqs to bf16, do math in float, quantize outputs
    std::vector<float> h_inp_q = round_bf16(h_inp_f);
    std::vector<float> h_freqs_q = round_bf16(h_freqs_f);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/kernels/matmul.cpp

src/kernels/gemm_mma.cu

src/binding/kernel_binding.cpp

src/models/llama_model.cpp

src/testing/test-gemm.cpp

src/kernels/matmul.cpp

ngc92 and others added 15 commits March 4, 2026 13:13

added custom gemm implementation

3d7c4a4

added bias

eb932af

print warning when falling back

dfc21d3

interleave scale and out writing

5fd033f

improve performance to match/slightly surpass cublas (in some configs)

2df9413

handle a and b scale

f3d595c

cleanups

ddb41cb

move test file

168a2df

fix existing rope test

9a3374f

fix gemm test

0c1e6e1

more fixes

01f8c49

Apply suggestions from code review

10d2013

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

explicit matmul backend argument

9c7ade4

Update src/models/llama_model.cpp

413de02

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix bindings

188c72b

Copilot AI review requested due to automatic review settings March 4, 2026 15:13

Copilot started reviewing on behalf of ngc92 March 4, 2026 15:13 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

src/binding/kernel_binding.cpp Show resolved Hide resolved

src/kernels/matmul.cpp Outdated Show resolved Hide resolved

src/kernels/gemm_mma.cu Show resolved Hide resolved

src/testing/test-gemm.cpp Show resolved Hide resolved

src/training/model.h Show resolved Hide resolved

ngc92 force-pushed the gemm branch from 9741cc7 to e4c6aba Compare March 4, 2026 15:36

make it build on Ampere again + other fixes

59e1844

Copilot AI review requested due to automatic review settings March 4, 2026 23:58

ngc92 force-pushed the gemm branch from e4c6aba to 59e1844 Compare March 4, 2026 23:58

Copilot started reviewing on behalf of ngc92 March 4, 2026 23:58 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

ngc92 merged commit 69bc1f3 into dev Mar 5, 2026
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handwritten gemm for ADA#66

Handwritten gemm for ADA#66
ngc92 merged 16 commits intodevfrom
gemm

ngc92 commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ngc92 commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants