Use +sme for Apple by cielavenir · Pull Request #303 · intel/isa-l

cielavenir · 2024-11-09T01:02:49Z

Today I went to biccamera ( 😂 ) and checked hw.optional. Then I found FEAT_SME but not FEAT_SVE. ¹

This means that for apple the +sve code has to be compiled with +sme instead.

This is potentially quite breaking change, so I'd like this to be tested from those who have M4 Mac.

Call for tester(s): if you have M4 mac, please try running the test on your machine~~

Why Apple says something without FEAT_SVE armv9? ↩

cielavenir · 2024-11-09T01:04:21Z

erasure_code/aarch64/gf_vect_mad_sve.S

 .Lloopsve_vl:
 	whilelo	p0.b, x_pos, x_len
-	b.none	.return_pass
+	b.eq	.return_pass


According to https://llvm.org/doxygen/AArch64AsmParser_8cpp_source.html , b.none is the same as b.eq when +sve is specified.

pablodelara · 2024-11-12T16:12:40Z

@liuqinfei could you look into this issue? Thanks again! ;)

liuqinfei · 2024-11-18T08:55:34Z

@liuqinfei could you look into this issue? Thanks again! ;)

In fact, i don't have an Apple computer that supports SVE on hand. So I can't verify this patch. Maybe you can supply your verifications on the machines with and without SVE. @cielavenir

cielavenir · 2024-11-18T09:38:00Z

I don't have either

I just checked compilation

Thus we need to call for tester(s) with M4 Mac, otherwise we need to wait for the next github RUNNER (not image) update.

pablodelara · 2024-12-09T17:24:35Z

We are looking into releasing 2.31.1 as soon as next week, with just bug fixes. If we have someone that can test this, then we can include it in the next release.

pablodelara · 2024-12-16T12:23:12Z

Let's hold this PR for next release, once more testing is done

tipabu · 2025-04-24T19:01:52Z

I've got an M4 MacBook Air -- doesn't seem to work for me:

tburke@2025-air isa-l % git clean -fdx && ./autogen.sh && ./configure --prefix ~/.local/ && make test
...
  CPPAS    mem/aarch64/mem_zero_detect_neon.lo
  CPPAS    mem/aarch64/mem_multibinary_arm.lo
  CC       mem/aarch64/mem_aarch64_dispatcher.lo
  CCLD     libisal.la
copying selected object files to avoid basename conflicts...
  CCLD     erasure_code/gf_vect_mul_base_test
erasure_code/gf_vect_mul_base_test
gf_vect_mul_base_test:
Random tests  done: Pass
Completed run: erasure_code/gf_vect_mul_base_test
  CC       erasure_code/gf_vect_dot_prod_base_test.o
  CCLD     erasure_code/gf_vect_dot_prod_base_test
erasure_code/gf_vect_dot_prod_base_test
gf_vect_dot_prod_base: 250x8192 done all: Pass
Completed run: erasure_code/gf_vect_dot_prod_base_test
  CC       erasure_code/gf_vect_dot_prod_test.o
  CCLD     erasure_code/gf_vect_dot_prod_test
erasure_code/gf_vect_dot_prod_test
make: *** [erasure_code/gf_vect_dot_prod_test.run] Illegal instruction: 4

Oddly enough, running via lldb (to try to get a better handle on where things went wrong) doesn't trip the same error:

tburke@2025-air isa-l % ./libtool --mode=execute lldb -o run erasure_code/gf_vect_dot_prod_test
(lldb) target create "/Users/tburke/Code/isa-l/erasure_code/.libs/gf_vect_dot_prod_test"
Current executable set to '/Users/tburke/Code/isa-l/erasure_code/.libs/gf_vect_dot_prod_test' (arm64).
(lldb) run
gf_vect_dot_prod: 16x8192 done all: Pass
Process 7194 launched: '/Users/tburke/Code/isa-l/erasure_code/.libs/gf_vect_dot_prod_test' (arm64)
Process 7194 exited with status = 0 (0x00000000)

On master, make test has everything pass.

cielavenir · 2025-04-27T13:01:25Z

@tipabu thank you for testing. maybe current code being accepted with +sme might be a assembler bug....

pablodelara · 2025-04-28T19:47:50Z

@tipabu, what about "master" branch?

tipabu · 2025-04-28T20:19:54Z

@pablodelara, on master (91da2ad add RISCV CI) all tests pass and perf suite runs fine.

pablodelara · 2025-04-29T10:29:28Z

Thanks @tipabu. So it looks like this PR is not needed...

cielavenir · 2025-04-29T12:48:24Z

@tipabu actually according to https://qiita.com/zacky1972/items/b7b5dd456fe021b30eb2, I need to wrap the function with smstart sm and smstop sm. I implemented that. If you have time could you try again?

(compilation is tested in https://github.com/cielavenir/isa-l/actions/runs/14731321078)

tipabu · 2025-04-29T16:01:07Z

@cielavenir Tests now pass! And looking at someone's investigation, we shouldn't need to worry about losing sve checks for Macs; no Apple silicon supports it.

So it looks like this PR is not needed...

@pablodelara Only insofar as Macs were always getting the neon implementation.

cielavenir · 2025-04-29T23:04:07Z

great thank you~

pablodelara · 2025-04-30T10:05:04Z

@cielavenir can you clean up the commits (so there is no "Merge branch 'master'"...)? Good opportunity to rebase against latest 'master' branch

cielavenir · 2025-04-30T12:11:02Z

@pablodelara rebased.

tipabu · 2025-04-30T16:37:22Z

So one concern: This seems to be slightly slower than master. On this branch:

tburke@2025-air isa-l % make erasure_code/erasure_code_perf.run
erasure_code/erasure_code_perf
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 4 0 5 1 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime =    3062483 usecs, bandwidth 49464 MB in 3.0625 sec = 16151.90 MB/s
erasure_code_decode_warm: runtime =    3001748 usecs, bandwidth 59388 MB in 3.0017 sec = 19784.75 MB/s
done all: Pass
Completed run: erasure_code/erasure_code_perf

Whereas on master:

tburke@2025-air isa-l % make erasure_code/erasure_code_perf.run
erasure_code/erasure_code_perf
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 4 0 5 1 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime =    3039658 usecs, bandwidth 49832 MB in 3.0397 sec = 16394.04 MB/s
erasure_code_decode_warm: runtime =    3027461 usecs, bandwidth 65886 MB in 3.0275 sec = 21763.11 MB/s
done all: Pass
Completed run: erasure_code/erasure_code_perf

Multiple runs had similar results (±100MB/s on encode, ±200MB/s on decode, give or take).

cielavenir · 2025-05-01T15:49:42Z

@tipabu on https://github.com/cielavenir/isa-l/tree/featSME_CI branch, I changed to call smstart/smstop only in the dispatched function. Setting smstart to the each subroutine called by ec_encode_data_sve could have overhead issue.

if this is faster, I will rebase featSME branch again.

tipabu · 2025-05-01T16:57:35Z

@cielavenir I see roughly the same on cielavenir@aad8c5c:

tburke@2025-air isa-l % make erasure_code/erasure_code_perf.run
erasure_code/erasure_code_perf
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 4 0 5 1 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime =    3062620 usecs, bandwidth 49503 MB in 3.0626 sec = 16163.74 MB/s
erasure_code_decode_warm: runtime =    3006801 usecs, bandwidth 59369 MB in 3.0068 sec = 19745.01 MB/s
done all: Pass
Completed run: erasure_code/erasure_code_perf

cielavenir · 2025-05-02T05:39:29Z

Now I'm not sure if this is smstart overhead or sme impl is not faster than neon impl..

pablodelara · 2025-08-29T14:33:06Z

@liuqinfei what do you think of this PR?

liuqinfei · 2025-08-30T00:56:34Z

@liuqinfei what do you think of this PR?

I recommend re-evaluating the ratio configuration 10 + 1 / 4 + 2 / 8 + 3. If benchmarking confirms no performance gains in the SME branch, i propose deferring this patch’s merger pending further validation.

pablodelara · 2026-01-19T12:36:35Z

@liuqinfei what do you think of this PR?

I recommend re-evaluating the ratio configuration 10 + 1 / 4 + 2 / 8 + 3. If benchmarking confirms no performance gains in the SME branch, i propose deferring this patch’s merger pending further validation.

@tipabu @cielavenir, could you check this? If no gain, we should close this PR then.

cielavenir · 2026-01-24T15:36:02Z

@tipabu I pushed a branch named featSME2_CI. https://github.com/cielavenir/isa-l/actions/runs/21317365685

Could you test it with ec_encode_data dispatching sme2/sme/none?

[edit] if sme2/sme performances are the same, #367 might have build target issue so we need to wait for them

tipabu · 2026-01-26T20:55:31Z

Locally, compilation fails for cielavenir@5189a30, ending with:

  CC       erasure_code/aarch64/gf_nvect_dot_prod_sve.lo
fatal error: error in backend: Don't know how to legalize this scalable vector type
clang: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Apple clang version 17.0.0 (clang-1700.6.3.2)
Target: arm64-apple-darwin25.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
clang: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang: note: diagnostic msg: /var/folders/ql/rh2f23bn56g8809nfw_c8qdm0000gn/T/gf_nvect_dot_prod_sve-960886.c
clang: note: diagnostic msg: /var/folders/ql/rh2f23bn56g8809nfw_c8qdm0000gn/T/gf_nvect_dot_prod_sve-960886.sh
clang: note: diagnostic msg: Crash backtrace is located in
clang: note: diagnostic msg: /Users/tburke/Library/Logs/DiagnosticReports/clang_<YYYY-MM-DD-HHMMSS>_<hostname>.crash
clang: note: diagnostic msg: (choose the .crash file that corresponds to your crash)
clang: note: diagnostic msg:

********************
make[1]: *** [erasure_code/aarch64/gf_nvect_dot_prod_sve.lo] Error 1
make: *** [all] Error 2

Curious when it seems to work for GHA... maybe it comes down to CI using macos-14 and macos-15-intel, but no macos-15?

cielavenir · 2026-01-27T16:54:48Z

@tipabu

could you try my featSME2_CI26 branch now? https://github.com/cielavenir/isa-l/actions/runs/21405820788
just in case, could you also try erasure_code_perf_macOS binary from https://github.com/cielavenir/isa-l/actions/runs/21383486025 ?

tipabu · 2026-01-27T22:37:35Z

Thanks @cielavenir, compilation works now.

Could you test it with ec_encode_data dispatching sme2/sme/none?

I'm not sure how to control that, could you give some more guidance?

At any rate, I can compare whatever the default behavior is:

Scheme	Errors	featSME2_CI26		master
Scheme	Errors	encode (MB/s)	decode (MB/s)	encode (MB/s)	decode (MB/s)
8 + 6	1	16541	49096	15953	48298
8 + 6	4	16543	22595	15824	21209
10 + 1	1	47784	48365	46967	46990
4 + 2	1	40370	52700	38310	53199
4 + 2	2	40367	40322	38339	38302
8 + 3	1	26576	49225	24800	47391
8 + 3	3	26620	26787	24511	24457

(Again, done as an average of ten runs.)

This looks more promising. 👍

cielavenir · 2026-01-27T23:29:56Z

thank you for checking @tipabu

@pablodelara now #367 is required for this pull request

cielavenir · 2026-01-28T03:17:01Z

@tipabu just in case could you test featSME2_CI26_noSME2 ? https://github.com/cielavenir/isa-l/actions/runs/21421947349

tipabu · 2026-01-28T17:59:58Z

Threw in jswinney/2025-10-14-sve2-pr from #367 as another data point, too.

Scheme	Errors	master `f0c2efa`		jswinney/2025-10-14-sve2-pr `b01e834`		featSME2_CI26 `b208080`		featSME2_CI26_noSME2 `971725a`
Scheme	Errors	encode (MB/s)	decode (MB/s)	encode (MB/s)	decode (MB/s)	encode (MB/s)	decode (MB/s)	encode (MB/s)	decode (MB/s)
8 + 6	1	15973	48427	15979	48398	16577	49208	16574	49225
8 + 6	4	15956	21478	15977	21476	16590	22657	16555	22616
10 + 1	1	47321	47323	47320	47109	47816	48413	47805	48385
4 + 2	1	38642	53721	38619	53681	40344	52769	40401	52770
4 + 2	2	38603	38618	38646	38660	40423	40377	40393	40358
8 + 3	1	25306	48371	25314	48321	26628	49315	26646	49355
8 + 3	3	25307	25312	25293	25328	26610	26782	26638	26824

Looks pretty comparable to featSME2_CI26.

cielavenir · 2026-01-29T07:09:11Z

I now know that SME has eor3 instruction, so SME2 and SME version would not have so much difference, but I keep both for now.

And I cleaned up featSME2_CI branch, which now use macos-15 runner. https://github.com/cielavenir/isa-l/actions/runs/21468869590

Now cleanup from my side is done and #367 is a real blocker

cielavenir · 2026-01-30T09:22:03Z

@liuqinfei I merged featSME2 into featSME. Only drawback is that the commits are tangled now.

If you have concern there, as I allow edits and access to secrets by maintainers, do you mind helping rebasing (But do not mix the commit author of AWSjswinney, which makes me hesitating rebasing on my own)?

pablodelara · 2026-01-30T09:38:34Z

Definitely, this needs rebase so only original commits are part of the PR.

cielavenir · 2026-01-30T09:58:53Z

But rebasing requires merging #367 separately

pablodelara · 2026-01-30T11:47:23Z

But rebasing requires merging #367 separately

Ok, let's wait for @liuqinfei to confirm #367 is OK to merge and once it is merged, you can rebase htis.

pablodelara · 2026-01-30T17:24:11Z

You are ok for rebase against master now

cielavenir · 2026-01-31T11:49:21Z

@pablodelara done rebasing.

( please tell me if I should make another pull request for 258fbed )

pablodelara

Can you sign off the second commit? Thanks!

pablodelara · 2026-02-02T08:52:17Z

erasure_code/aarch64/ec_aarch64_dispatcher.c

 #elif defined(__APPLE__)
-        if (sysctlEnabled(SYSCTL_SVE_KEY))
-                return gf_vect_dot_prod_sve;
+        // Due to smstart, should not dispatch SME


These lines can be eliminated completely (same below)

eliminated.

@pablodelara but if someone ask about SME here, please explain to them that __arm_streaming ABI is incompatible with the dispatcher because the git history is lost about this part. I cannot support this.

pablodelara · 2026-02-02T09:06:34Z

@pablodelara done rebasing.

( please tell me if I should make another pull request for 258fbed )

If it has nothing to do with SME, then open a new PR.

pablodelara · 2026-02-09T11:35:18Z

@cielavenir could you look at the final comments? I want all optimizations/new features to be merged by the end of this week as we are releasing the library by the end of the month.

cielavenir · 2026-02-10T13:55:39Z

@pablodelara #393

cielavenir · 2026-02-11T14:58:20Z

I rebased again, but sorry, this will be the last time from my side. I'm exhausted with repeated rebasing, sorry.

cielavenir · 2026-02-11T14:59:00Z

I hope better solutions eg squash-merge can be configured so that the manual rebasing is not required.

pablodelara · 2026-02-11T16:46:04Z

@liuqinfei could you review this PR? thanks

erasure_code/aarch64/ec_aarch64_highlevel_func.c

Signed-off-by: Taiju Yamada <tyamada@bi.a.u-tokyo.ac.jp>

pablodelara · 2026-02-16T17:57:28Z

Last call for this PR to be integrated in v2.32, thanks. @liuqinfei can you review it please? Thanks!

cielavenir · 2026-02-16T23:14:36Z

@liuqinfei yes please. the rebasing state is the cleanest ever.

cielavenir commented Nov 9, 2024

View reviewed changes

cielavenir force-pushed the featSME branch from b4f4570 to b504f2e Compare November 9, 2024 10:22

pablodelara added bug bugfix and removed bug labels Nov 26, 2024

cielavenir marked this pull request as draft April 28, 2025 23:59

cielavenir marked this pull request as ready for review April 29, 2025 23:06

cielavenir force-pushed the featSME branch from ead56c9 to 742787c Compare April 30, 2025 12:10

cielavenir mentioned this pull request Jan 24, 2026

Optimize gf N-vect dot product SVE functions #367

Closed

cielavenir force-pushed the featSME branch from 571e90a to 452c2cd Compare January 31, 2026 11:47

pablodelara reviewed Feb 2, 2026

View reviewed changes

cielavenir force-pushed the featSME branch 2 times, most recently from 62f6b29 to 75367b8 Compare February 11, 2026 14:55

pablodelara reviewed Feb 13, 2026

View reviewed changes

erasure_code/aarch64/ec_aarch64_highlevel_func.c Show resolved Hide resolved

Setup Apple SME dispatch

5973708

Signed-off-by: Taiju Yamada <tyamada@bi.a.u-tokyo.ac.jp>

cielavenir force-pushed the featSME branch from 75367b8 to 5973708 Compare February 14, 2026 13:52

Comments

Conversation

cielavenir commented Nov 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

cielavenir Nov 9, 2024

Choose a reason for hiding this comment

Uh oh!

pablodelara commented Nov 12, 2024

Uh oh!

liuqinfei commented Nov 18, 2024

Uh oh!

cielavenir commented Nov 18, 2024

Uh oh!

pablodelara commented Dec 9, 2024

Uh oh!

pablodelara commented Dec 16, 2024

Uh oh!

tipabu commented Apr 24, 2025

Uh oh!

cielavenir commented Apr 27, 2025

Uh oh!

pablodelara commented Apr 28, 2025

Uh oh!

tipabu commented Apr 28, 2025

Uh oh!

pablodelara commented Apr 29, 2025

Uh oh!

cielavenir commented Apr 29, 2025

Uh oh!

tipabu commented Apr 29, 2025

Uh oh!

cielavenir commented Apr 29, 2025

Uh oh!

pablodelara commented Apr 30, 2025

Uh oh!

cielavenir commented Apr 30, 2025

Uh oh!

tipabu commented Apr 30, 2025

Uh oh!

cielavenir commented May 1, 2025

Uh oh!

tipabu commented May 1, 2025

Uh oh!

cielavenir commented May 2, 2025

Uh oh!

pablodelara commented Aug 29, 2025

Uh oh!

liuqinfei commented Aug 30, 2025

Uh oh!

pablodelara commented Jan 19, 2026

Uh oh!

cielavenir commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tipabu commented Jan 26, 2026

Uh oh!

cielavenir commented Jan 27, 2026

Uh oh!

tipabu commented Jan 27, 2026

Uh oh!

cielavenir commented Jan 27, 2026

Uh oh!

cielavenir commented Jan 28, 2026

Uh oh!

tipabu commented Jan 28, 2026

Uh oh!

cielavenir commented Jan 29, 2026

Uh oh!

cielavenir commented Jan 30, 2026

Uh oh!

pablodelara commented Jan 30, 2026

Uh oh!

cielavenir commented Jan 30, 2026

Uh oh!

pablodelara commented Jan 30, 2026

Uh oh!

pablodelara commented Jan 30, 2026

Uh oh!

cielavenir commented Nov 9, 2024 •

edited

Loading

cielavenir commented Jan 24, 2026 •

edited

Loading

cielavenir commented Feb 16, 2026 •

edited

Loading