Skip to content

[MLAS][Kleidiai] Sve Gemm and IMatmul Integration#27643

Open
JonathanC-ARM wants to merge 5 commits intomicrosoft:mainfrom
JonathanC-ARM:jonclo01/sve_kleidi_integration
Open

[MLAS][Kleidiai] Sve Gemm and IMatmul Integration#27643
JonathanC-ARM wants to merge 5 commits intomicrosoft:mainfrom
JonathanC-ARM:jonclo01/sve_kleidi_integration

Conversation

@JonathanC-ARM
Copy link
Contributor

Description

Adds initial Arm SVE enablement for the KleidiAI MLAS backend, including SVE
ukernel wiring, SGEMM dispatch/packing support, and an SVE convolution path
with runtime selection (prefer SME/SME2 when available, otherwise use SVE).
Also updates the KleidiAI dependency to a newer release to pick up the
required SVE kernels KleidiAI 1.22

Motivation and Context

Enables KleidiAI acceleration on Arm systems that expose SVE but not SME/SME2,
reducing fallback to the generic MLAS implementations and broadening hardware
coverage. This is an initial bring-up focused on correctness and integration,
with some configuration limitations (e.g., SVE SGEMM currently targets non-
transposed inputs; SVE conv has capability constraints).

Laan33 and others added 4 commits March 13, 2026 13:12
Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com
Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com
Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com

mlas: Correct checks for early-exit fast path in sgemm_kleidiai.cpp

Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com

mlas: update ApplyAlphaBeta2D comment to reflect new control flow

Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com

mlas: add test case for batched K==0 in sgemm_kleidiai.cpp

Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com
Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
@hariharans29 hariharans29 requested a review from Copilot March 13, 2026 17:45
@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds initial Arm SVE enablement for the KleidiAI MLAS backend, wiring in SVE SGEMM packing/dispatch and an SVE IMATMUL-based convolution path with runtime selection (prefer SME/SME2 when available, otherwise SVE). Also bumps the KleidiAI dependency to a version that provides the required SVE kernels.

Changes:

  • Extend MLAS platform dispatch to select KleidiAI overrides on SVE-only CPUs (and include SME2 in SME selection).
  • Add SVE variants for KleidiAI SGEMM (pack + batched GEMM dispatch) and convolution (SVE IMATMUL indirection path).
  • Expand FGEMM unit test coverage for batched short-path shapes and non-trivial alpha/beta cases.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
onnxruntime/test/mlas/unittest/test_fgemm_fixture.h Adds small batched FGEMM short-path coverage with varied alpha/beta.
onnxruntime/core/mlas/lib/sgemm.cpp Removes TransA gating so KleidiAI override can decide support; minor cleanup.
onnxruntime/core/mlas/lib/platform.cpp Enables KleidiAI override selection for SME2 and SVE-only runtime.
onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp Adds SVE SGEMM packer + batched GEMM path; refactors alpha/beta application helpers.
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Introduces UseSVE runtime feature flag.
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp Adds SVE IMATMUL convolution path and runtime selection between SME/SME2 vs SVE.
onnxruntime/core/mlas/lib/kai_ukernel_interface.h Adds SVE matmul/imatmul wrapper typedefs + getter declarations; small comment edits.
onnxruntime/core/mlas/lib/kai_ukernel_interface.cpp Registers SVE kernels and implements SVE getter functions.
cmake/deps.txt Updates KleidiAI dependency to v1.22.0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +646 to +656
// Match SME alpha/beta behavior: apply beta per-batch when alpha==0 or K==0.
if (Data->alpha == 0.0f || K == 0) {
if (BatchSize == 1) {
ApplyBetaToC(Data->C, Data->ldc, M, N, Data->beta);
} else {
for (size_t batch = 0; batch < BatchSize; ++batch) {
ApplyBetaToC(Data[batch].C, Data[batch].ldc, M, N, Data[batch].beta);
}
}
return true;
}
Comment on lines +736 to +739
const size_t tile_elems = TileSizeM * TileSizeN;
g_kai_tls.output_tile.resize(tile_elems);
out_tile = g_kai_tls.output_tile.data();
out_row_stride_bytes = TileSizeN * sizeof(float);
Comment on lines 23 to +24
#include "kai/ukernels/matmul/pack/kai_rhs_pack_nxk_f32p2vlx1biasf32_f32_f32_sme.h"
#include "kai/ukernels/matmul/pack/kai_rhs_pack_kxn_x32p4vlx1b_x32_x32_sve.h"
Comment on lines +71 to +73
const KaiF32SveIMatmulKernel GetKleidiAISveImatmulUKernel();

const KaiF32SveKernel GetKleidiAISveSGemmUKernel();

#include "kai/ukernels/matmul/matmul_clamp_f32_qai8dxp_qsi4c32p/kai_matmul_clamp_f32_qai8dxp_qsi4c32p_interface.h"

// matmul inferfaces
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants