[MLAS][Kleidiai] Sve Gemm and IMatmul Integration#27643
Open
JonathanC-ARM wants to merge 5 commits intomicrosoft:mainfrom
Open
[MLAS][Kleidiai] Sve Gemm and IMatmul Integration#27643JonathanC-ARM wants to merge 5 commits intomicrosoft:mainfrom
JonathanC-ARM wants to merge 5 commits intomicrosoft:mainfrom
Conversation
Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com
Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com
Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com mlas: Correct checks for early-exit fast path in sgemm_kleidiai.cpp Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com mlas: update ApplyAlphaBeta2D comment to reflect new control flow Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com mlas: add test case for batched K==0 in sgemm_kleidiai.cpp Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com
Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
Member
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds initial Arm SVE enablement for the KleidiAI MLAS backend, wiring in SVE SGEMM packing/dispatch and an SVE IMATMUL-based convolution path with runtime selection (prefer SME/SME2 when available, otherwise SVE). Also bumps the KleidiAI dependency to a version that provides the required SVE kernels.
Changes:
- Extend MLAS platform dispatch to select KleidiAI overrides on SVE-only CPUs (and include SME2 in SME selection).
- Add SVE variants for KleidiAI SGEMM (pack + batched GEMM dispatch) and convolution (SVE IMATMUL indirection path).
- Expand FGEMM unit test coverage for batched short-path shapes and non-trivial alpha/beta cases.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/mlas/unittest/test_fgemm_fixture.h | Adds small batched FGEMM short-path coverage with varied alpha/beta. |
| onnxruntime/core/mlas/lib/sgemm.cpp | Removes TransA gating so KleidiAI override can decide support; minor cleanup. |
| onnxruntime/core/mlas/lib/platform.cpp | Enables KleidiAI override selection for SME2 and SVE-only runtime. |
| onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp | Adds SVE SGEMM packer + batched GEMM path; refactors alpha/beta application helpers. |
| onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h | Introduces UseSVE runtime feature flag. |
| onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp | Adds SVE IMATMUL convolution path and runtime selection between SME/SME2 vs SVE. |
| onnxruntime/core/mlas/lib/kai_ukernel_interface.h | Adds SVE matmul/imatmul wrapper typedefs + getter declarations; small comment edits. |
| onnxruntime/core/mlas/lib/kai_ukernel_interface.cpp | Registers SVE kernels and implements SVE getter functions. |
| cmake/deps.txt | Updates KleidiAI dependency to v1.22.0. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+646
to
+656
| // Match SME alpha/beta behavior: apply beta per-batch when alpha==0 or K==0. | ||
| if (Data->alpha == 0.0f || K == 0) { | ||
| if (BatchSize == 1) { | ||
| ApplyBetaToC(Data->C, Data->ldc, M, N, Data->beta); | ||
| } else { | ||
| for (size_t batch = 0; batch < BatchSize; ++batch) { | ||
| ApplyBetaToC(Data[batch].C, Data[batch].ldc, M, N, Data[batch].beta); | ||
| } | ||
| } | ||
| return true; | ||
| } |
Comment on lines
+736
to
+739
| const size_t tile_elems = TileSizeM * TileSizeN; | ||
| g_kai_tls.output_tile.resize(tile_elems); | ||
| out_tile = g_kai_tls.output_tile.data(); | ||
| out_row_stride_bytes = TileSizeN * sizeof(float); |
Comment on lines
23
to
+24
| #include "kai/ukernels/matmul/pack/kai_rhs_pack_nxk_f32p2vlx1biasf32_f32_f32_sme.h" | ||
| #include "kai/ukernels/matmul/pack/kai_rhs_pack_kxn_x32p4vlx1b_x32_x32_sve.h" |
Comment on lines
+71
to
+73
| const KaiF32SveIMatmulKernel GetKleidiAISveImatmulUKernel(); | ||
|
|
||
| const KaiF32SveKernel GetKleidiAISveSGemmUKernel(); |
|
|
||
| #include "kai/ukernels/matmul/matmul_clamp_f32_qai8dxp_qsi4c32p/kai_matmul_clamp_f32_qai8dxp_qsi4c32p_interface.h" | ||
|
|
||
| // matmul inferfaces |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds initial Arm SVE enablement for the KleidiAI MLAS backend, including SVE
ukernel wiring, SGEMM dispatch/packing support, and an SVE convolution path
with runtime selection (prefer SME/SME2 when available, otherwise use SVE).
Also updates the KleidiAI dependency to a newer release to pick up the
required SVE kernels KleidiAI 1.22
Motivation and Context
Enables KleidiAI acceleration on Arm systems that expose SVE but not SME/SME2,
reducing fallback to the generic MLAS implementations and broadening hardware
coverage. This is an initial bring-up focused on correctness and integration,
with some configuration limitations (e.g., SVE SGEMM currently targets non-
transposed inputs; SVE conv has capability constraints).