Skip to content

Comments

Optimize gf N-vect dot product SVE functions#367

Closed
AWSjswinney wants to merge 2 commits intointel:masterfrom
AWSjswinney:jswinney/2025-10-14-sve2-pr
Closed

Optimize gf N-vect dot product SVE functions#367
AWSjswinney wants to merge 2 commits intointel:masterfrom
AWSjswinney:jswinney/2025-10-14-sve2-pr

Conversation

@AWSjswinney
Copy link
Contributor

aarch64/erasure_code: SVE intrinsics implementation

Replace hand-written SVE assembly implementations of n-vector dot
product functions with optimized SVE intrinsics that provide better
performance through 4x loop unrolling.

Key improvements over the original SVE assembly:

  • 4x unrolled loops processing 64 bytes per iteration (vs single vector)
  • Unified implementation supports 1-7 vector operations by using the
    compiler to generate each version.
  • The compiler also generates SVE2 versions of the same functions which
    make use of the EOR3 instruction.

The implementation maintains the existing nibble-based Galois Field
multiplication with 32-byte lookup tables while adding significant
performance optimizations.

reverts: aedcd37
This change also reverts the above commit which configured systems with
SVE width of 128 bits to use the path use NEON instead. NEON was faster
since it had more unrolling, but now the SVE also has the same level of
unrolling and the availablily of SVE2 makes that path faster still for
systems which support it.

aarch64: Optimize SVE encode functions to use peak-performance vector combinations

Update both ec_encode_data_sve() and ec_encode_data_sve2() to use optimal
4 and 5 vector combinations based on benchmark results showing these
achieve the highest performance.

Key optimizations:

  • Loop over 4-vector operations when rows > 7 (peak performance)
  • Use 4+3 combination for 7 vectors instead of single 7-vector call
  • Use 4+2 combination for 6 vectors instead of single 6-vector call
  • Keep 5-vector for 5 vectors (second-best performance)
  • Applies to both SVE and SVE2 variants for consistent optimization

This leverages the benchmark findings that 4 and 5 vector operations
achieve 40+ GB/s performance, significantly better than 6-7 vector
operations which drop to 30-36 GB/s.

@AWSjswinney
Copy link
Contributor Author

This is intended to replace #349.

Performance data is attached. This change significantly improves performance on Graviton4, the Neoverse-V2, with most benchmarks getting +28-32% gain over my previous proposal in #349. I tested GCC10-15, and there is some difference in the compiler versions, the benefit outweighs the variability. (I didn't include that data to avoid posting too much detail.)

The second set of plots shows the performance gain from the baseline (before any of my changes). Graviton4 gets +40-97%.

The final set of plots shows the impact to Graviton3 or the Neoverse-V1. There are some regressions in some of the tests, but I would argue the simpler implementation and the gains in other tests makes it worth it.

11b9d9f7-db9e-4acf-a508-f31717aa7f27.pdf

Replace hand-written SVE assembly implementations of n-vector dot
product functions with optimized SVE intrinsics that provide better
performance through 4x loop unrolling.

Key improvements over the original SVE assembly:
- 4x unrolled loops processing 64 bytes per iteration (vs single vector)
- Unified implementation supports 1-7 vector operations by using the
  compiler to generate each version.
- The compiler also generates SVE2 versions of the same functions which
  make use of the EOR3 instruction.

The implementation maintains the existing nibble-based Galois Field
multiplication with 32-byte lookup tables while adding significant
performance optimizations.

reverts: aedcd37
This change also reverts the above commit which configured systems with
SVE width of 128 bits to use the path use NEON instead. NEON was faster
since it had more unrolling, but now the SVE also has the same level of
unrolling and the availablily of SVE2 makes that path faster still for
systems which support it.

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
… combinations

Update both ec_encode_data_sve() and ec_encode_data_sve2() to use optimal
4 and 5 vector combinations based on benchmark results showing these
achieve the highest performance.

Key optimizations:
- Loop over 4-vector operations when rows > 7 (peak performance)
- Use 4+3 combination for 7 vectors instead of single 7-vector call
- Use 4+2 combination for 6 vectors instead of single 6-vector call
- Keep 5-vector for 5 vectors (second-best performance)
- Applies to both SVE and SVE2 variants for consistent optimization

This leverages the benchmark findings that 4 and 5 vector operations
achieve 40+ GB/s performance, significantly better than 6-7 vector
operations which drop to 30-36 GB/s.

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
@AWSjswinney AWSjswinney force-pushed the jswinney/2025-10-14-sve2-pr branch from 2859288 to b01e834 Compare October 16, 2025 18:51
@AWSjswinney
Copy link
Contributor Author

I haven't been able to reproduce the failure on macOS, but I pushed a change which I'm hoping will do the trick.

@pablodelara
Copy link
Contributor

This is intended to replace #349.

Performance data is attached. This change significantly improves performance on Graviton4, the Neoverse-V2, with most benchmarks getting +28-32% gain over my previous proposal in #349. I tested GCC10-15, and there is some difference in the compiler versions, the benefit outweighs the variability. (I didn't include that data to avoid posting too much detail.)

The second set of plots shows the performance gain from the baseline (before any of my changes). Graviton4 gets +40-97%.

The final set of plots shows the impact to Graviton3 or the Neoverse-V1. There are some regressions in some of the tests, but I would argue the simpler implementation and the gains in other tests makes it worth it.

11b9d9f7-db9e-4acf-a508-f31717aa7f27.pdf

So #349 can be closed? Thanks

@AWSjswinney
Copy link
Contributor Author

So #349 can be closed? Thanks

Yes. I just closed it!

@pablodelara
Copy link
Contributor

@liuqinfei could you review this PR?

@liuqinfei
Copy link
Contributor

@liuqinfei could you review this PR?

of course.

@liuqinfei
Copy link
Contributor

The ec dot prod code has been simplified much, which is a positive change, but it's now more abstract and takes effort to follow. Additionally, I need to validate the performance impact thoroughly before approving the changes.

@AWSjswinney
Copy link
Contributor Author

I made the changes without fully understanding how it works. Perhaps we could some comments to make it easier to follow?

@pablodelara
Copy link
Contributor

I made the changes without fully understanding how it works. Perhaps we could some comments to make it easier to follow?

Maybe @docularxu can help with this.

@cielavenir
Copy link
Contributor

cielavenir commented Jan 24, 2026

@pablodelara #303 (comment) (my featSME2_CI branch) could help except smstart overhead.

@cielavenir cielavenir mentioned this pull request Jan 24, 2026
@cielavenir
Copy link
Contributor

it's now more abstract and takes effort to follow

I suggest introducing .h eg https://github.com/cielavenir/isa-l/blob/b2080801220ca00d2e6ad0fc73489f30bb75a523/erasure_code/aarch64/gf_nvect_dot_prod_sve.c#L64

@pablodelara
Copy link
Contributor

I would like @liuqinfei's review again on this

@cielavenir
Copy link
Contributor

I suggest introducing .h

I prepared https://github.com/cielavenir/isa-l/commits/jswinney/2025-10-14-sve2-pr/ which has only this my suggestion

@liuqinfei
Copy link
Contributor

I’ve reviewed the AWSjswinney:jswinney/2025-10-14-sve2-pr branch and it looks good to me. @cielavenir could you please integrate the SME releated code into PR #303?

@cielavenir
Copy link
Contributor

Done cf #303 (comment) @liuqinfei

@cielavenir
Copy link
Contributor

@liuqinfei Should I make another pull request from https://github.com/cielavenir/isa-l/commits/jswinney/2025-10-14-sve2-pr/ ?

@pablodelara
Copy link
Contributor

@liuqinfei is this PR OK to merge as is??

@liuqinfei
Copy link
Contributor

@liuqinfei is this PR OK to merge as is??

Yes.

@pablodelara
Copy link
Contributor

This is merged now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants