Optimize gf N-vect dot product SVE functions by AWSjswinney · Pull Request #367 · intel/isa-l

AWSjswinney · 2025-10-14T21:44:00Z

aarch64/erasure_code: SVE intrinsics implementation

Replace hand-written SVE assembly implementations of n-vector dot
product functions with optimized SVE intrinsics that provide better
performance through 4x loop unrolling.

Key improvements over the original SVE assembly:

4x unrolled loops processing 64 bytes per iteration (vs single vector)
Unified implementation supports 1-7 vector operations by using the
compiler to generate each version.
The compiler also generates SVE2 versions of the same functions which
make use of the EOR3 instruction.

The implementation maintains the existing nibble-based Galois Field
multiplication with 32-byte lookup tables while adding significant
performance optimizations.

reverts: aedcd37
This change also reverts the above commit which configured systems with
SVE width of 128 bits to use the path use NEON instead. NEON was faster
since it had more unrolling, but now the SVE also has the same level of
unrolling and the availablily of SVE2 makes that path faster still for
systems which support it.

aarch64: Optimize SVE encode functions to use peak-performance vector combinations

Update both ec_encode_data_sve() and ec_encode_data_sve2() to use optimal
4 and 5 vector combinations based on benchmark results showing these
achieve the highest performance.

Key optimizations:

Loop over 4-vector operations when rows > 7 (peak performance)
Use 4+3 combination for 7 vectors instead of single 7-vector call
Use 4+2 combination for 6 vectors instead of single 6-vector call
Keep 5-vector for 5 vectors (second-best performance)
Applies to both SVE and SVE2 variants for consistent optimization

This leverages the benchmark findings that 4 and 5 vector operations
achieve 40+ GB/s performance, significantly better than 6-7 vector
operations which drop to 30-36 GB/s.

AWSjswinney · 2025-10-14T21:51:32Z

This is intended to replace #349.

Performance data is attached. This change significantly improves performance on Graviton4, the Neoverse-V2, with most benchmarks getting +28-32% gain over my previous proposal in #349. I tested GCC10-15, and there is some difference in the compiler versions, the benefit outweighs the variability. (I didn't include that data to avoid posting too much detail.)

The second set of plots shows the performance gain from the baseline (before any of my changes). Graviton4 gets +40-97%.

The final set of plots shows the impact to Graviton3 or the Neoverse-V1. There are some regressions in some of the tests, but I would argue the simpler implementation and the gains in other tests makes it worth it.

11b9d9f7-db9e-4acf-a508-f31717aa7f27.pdf

Replace hand-written SVE assembly implementations of n-vector dot product functions with optimized SVE intrinsics that provide better performance through 4x loop unrolling. Key improvements over the original SVE assembly: - 4x unrolled loops processing 64 bytes per iteration (vs single vector) - Unified implementation supports 1-7 vector operations by using the compiler to generate each version. - The compiler also generates SVE2 versions of the same functions which make use of the EOR3 instruction. The implementation maintains the existing nibble-based Galois Field multiplication with 32-byte lookup tables while adding significant performance optimizations. reverts: aedcd37 This change also reverts the above commit which configured systems with SVE width of 128 bits to use the path use NEON instead. NEON was faster since it had more unrolling, but now the SVE also has the same level of unrolling and the availablily of SVE2 makes that path faster still for systems which support it. Signed-off-by: Jonathan Swinney <jswinney@amazon.com>

… combinations Update both ec_encode_data_sve() and ec_encode_data_sve2() to use optimal 4 and 5 vector combinations based on benchmark results showing these achieve the highest performance. Key optimizations: - Loop over 4-vector operations when rows > 7 (peak performance) - Use 4+3 combination for 7 vectors instead of single 7-vector call - Use 4+2 combination for 6 vectors instead of single 6-vector call - Keep 5-vector for 5 vectors (second-best performance) - Applies to both SVE and SVE2 variants for consistent optimization This leverages the benchmark findings that 4 and 5 vector operations achieve 40+ GB/s performance, significantly better than 6-7 vector operations which drop to 30-36 GB/s. Signed-off-by: Jonathan Swinney <jswinney@amazon.com>

AWSjswinney · 2025-10-16T18:52:50Z

I haven't been able to reproduce the failure on macOS, but I pushed a change which I'm hoping will do the trick.

pablodelara · 2025-10-17T08:25:55Z

This is intended to replace #349.

Performance data is attached. This change significantly improves performance on Graviton4, the Neoverse-V2, with most benchmarks getting +28-32% gain over my previous proposal in #349. I tested GCC10-15, and there is some difference in the compiler versions, the benefit outweighs the variability. (I didn't include that data to avoid posting too much detail.)

The second set of plots shows the performance gain from the baseline (before any of my changes). Graviton4 gets +40-97%.

The final set of plots shows the impact to Graviton3 or the Neoverse-V1. There are some regressions in some of the tests, but I would argue the simpler implementation and the gains in other tests makes it worth it.

11b9d9f7-db9e-4acf-a508-f31717aa7f27.pdf

So #349 can be closed? Thanks

AWSjswinney · 2025-10-17T13:43:45Z

So #349 can be closed? Thanks

Yes. I just closed it!

pablodelara · 2025-10-20T08:51:54Z

@liuqinfei could you review this PR?

liuqinfei · 2025-10-21T14:18:59Z

@liuqinfei could you review this PR?

of course.

liuqinfei · 2025-10-28T16:16:07Z

The ec dot prod code has been simplified much, which is a positive change, but it's now more abstract and takes effort to follow. Additionally, I need to validate the performance impact thoroughly before approving the changes.

AWSjswinney · 2025-10-28T22:06:40Z

I made the changes without fully understanding how it works. Perhaps we could some comments to make it easier to follow?

pablodelara · 2025-11-17T16:50:28Z

I made the changes without fully understanding how it works. Perhaps we could some comments to make it easier to follow?

Maybe @docularxu can help with this.

cielavenir · 2026-01-24T15:37:49Z

@pablodelara #303 (comment) (my featSME2_CI branch) could help except smstart overhead.

cielavenir · 2026-01-27T17:00:30Z

it's now more abstract and takes effort to follow

I suggest introducing .h eg https://github.com/cielavenir/isa-l/blob/b2080801220ca00d2e6ad0fc73489f30bb75a523/erasure_code/aarch64/gf_nvect_dot_prod_sve.c#L64

pablodelara · 2026-01-28T09:20:30Z

I would like @liuqinfei's review again on this

cielavenir · 2026-01-29T06:44:44Z

I suggest introducing .h

I prepared https://github.com/cielavenir/isa-l/commits/jswinney/2025-10-14-sve2-pr/ which has only this my suggestion

liuqinfei · 2026-01-30T08:53:16Z

I’ve reviewed the AWSjswinney:jswinney/2025-10-14-sve2-pr branch and it looks good to me. @cielavenir could you please integrate the SME releated code into PR #303?

cielavenir · 2026-01-30T09:22:30Z

Done cf #303 (comment) @liuqinfei

cielavenir · 2026-01-30T09:59:32Z

@liuqinfei Should I make another pull request from https://github.com/cielavenir/isa-l/commits/jswinney/2025-10-14-sve2-pr/ ?

pablodelara · 2026-01-30T11:27:47Z

@liuqinfei is this PR OK to merge as is??

liuqinfei · 2026-01-30T11:40:48Z

@liuqinfei is this PR OK to merge as is??

Yes.

pablodelara · 2026-01-30T17:15:48Z

This is merged now, thanks!

AWSjswinney mentioned this pull request Oct 14, 2025

Optimize instruction scheduling in gf_*vect_dot_prod_neon #349

Closed

AWSjswinney force-pushed the jswinney/2025-10-14-sve2-pr branch from cdde062 to 2859288 Compare October 14, 2025 22:10

AWSjswinney added 2 commits October 16, 2025 13:51

AWSjswinney force-pushed the jswinney/2025-10-14-sve2-pr branch from 2859288 to b01e834 Compare October 16, 2025 18:51

cielavenir mentioned this pull request Jan 24, 2026

Use +sme for Apple #303

Open

pablodelara closed this Jan 30, 2026

Comments

Conversation

AWSjswinney commented Oct 14, 2025

Uh oh!

AWSjswinney commented Oct 14, 2025

Uh oh!

AWSjswinney commented Oct 16, 2025

Uh oh!

pablodelara commented Oct 17, 2025

Uh oh!

AWSjswinney commented Oct 17, 2025

Uh oh!

pablodelara commented Oct 20, 2025

Uh oh!

liuqinfei commented Oct 21, 2025

Uh oh!

liuqinfei commented Oct 28, 2025

Uh oh!

AWSjswinney commented Oct 28, 2025

Uh oh!

pablodelara commented Nov 17, 2025

Uh oh!

cielavenir commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cielavenir commented Jan 27, 2026

Uh oh!

pablodelara commented Jan 28, 2026

Uh oh!

cielavenir commented Jan 29, 2026

Uh oh!

liuqinfei commented Jan 30, 2026

Uh oh!

cielavenir commented Jan 30, 2026

Uh oh!

cielavenir commented Jan 30, 2026

Uh oh!

pablodelara commented Jan 30, 2026

Uh oh!

liuqinfei commented Jan 30, 2026

Uh oh!

pablodelara commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cielavenir commented Jan 24, 2026 •

edited

Loading