Skip to content

Comments

Use +sme for Apple#303

Open
cielavenir wants to merge 1 commit intointel:masterfrom
cielavenir:featSME
Open

Use +sme for Apple#303
cielavenir wants to merge 1 commit intointel:masterfrom
cielavenir:featSME

Conversation

@cielavenir
Copy link
Contributor

@cielavenir cielavenir commented Nov 9, 2024

Today I went to biccamera ( 😂 ) and checked hw.optional. Then I found FEAT_SME but not FEAT_SVE. 1

This means that for apple the +sve code has to be compiled with +sme instead.

This is potentially quite breaking change, so I'd like this to be tested from those who have M4 Mac.

Call for tester(s): if you have M4 mac, please try running the test on your machine~~

Footnotes

  1. Why Apple says something without FEAT_SVE armv9?

.Lloopsve_vl:
whilelo p0.b, x_pos, x_len
b.none .return_pass
b.eq .return_pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://llvm.org/doxygen/AArch64AsmParser_8cpp_source.html , b.none is the same as b.eq when +sve is specified.

@pablodelara
Copy link
Contributor

@liuqinfei could you look into this issue? Thanks again! ;)

@liuqinfei
Copy link
Contributor

@liuqinfei could you look into this issue? Thanks again! ;)

In fact, i don't have an Apple computer that supports SVE on hand. So I can't verify this patch. Maybe you can supply your verifications on the machines with and without SVE. @cielavenir

@cielavenir
Copy link
Contributor Author

I don't have either

I just checked compilation

Thus we need to call for tester(s) with M4 Mac, otherwise we need to wait for the next github RUNNER (not image) update.

@pablodelara
Copy link
Contributor

We are looking into releasing 2.31.1 as soon as next week, with just bug fixes. If we have someone that can test this, then we can include it in the next release.

@pablodelara
Copy link
Contributor

Let's hold this PR for next release, once more testing is done

@tipabu
Copy link
Contributor

tipabu commented Apr 24, 2025

I've got an M4 MacBook Air -- doesn't seem to work for me:

tburke@2025-air isa-l % git clean -fdx && ./autogen.sh && ./configure --prefix ~/.local/ && make test
...
  CPPAS    mem/aarch64/mem_zero_detect_neon.lo
  CPPAS    mem/aarch64/mem_multibinary_arm.lo
  CC       mem/aarch64/mem_aarch64_dispatcher.lo
  CCLD     libisal.la
copying selected object files to avoid basename conflicts...
  CCLD     erasure_code/gf_vect_mul_base_test
erasure_code/gf_vect_mul_base_test
gf_vect_mul_base_test:
Random tests  done: Pass
Completed run: erasure_code/gf_vect_mul_base_test
  CC       erasure_code/gf_vect_dot_prod_base_test.o
  CCLD     erasure_code/gf_vect_dot_prod_base_test
erasure_code/gf_vect_dot_prod_base_test
gf_vect_dot_prod_base: 250x8192 done all: Pass
Completed run: erasure_code/gf_vect_dot_prod_base_test
  CC       erasure_code/gf_vect_dot_prod_test.o
  CCLD     erasure_code/gf_vect_dot_prod_test
erasure_code/gf_vect_dot_prod_test
make: *** [erasure_code/gf_vect_dot_prod_test.run] Illegal instruction: 4

Oddly enough, running via lldb (to try to get a better handle on where things went wrong) doesn't trip the same error:

tburke@2025-air isa-l % ./libtool --mode=execute lldb -o run erasure_code/gf_vect_dot_prod_test
(lldb) target create "/Users/tburke/Code/isa-l/erasure_code/.libs/gf_vect_dot_prod_test"
Current executable set to '/Users/tburke/Code/isa-l/erasure_code/.libs/gf_vect_dot_prod_test' (arm64).
(lldb) run
gf_vect_dot_prod: 16x8192 done all: Pass
Process 7194 launched: '/Users/tburke/Code/isa-l/erasure_code/.libs/gf_vect_dot_prod_test' (arm64)
Process 7194 exited with status = 0 (0x00000000)

On master, make test has everything pass.

@cielavenir
Copy link
Contributor Author

@tipabu thank you for testing. maybe current code being accepted with +sme might be a assembler bug....

@pablodelara
Copy link
Contributor

@tipabu, what about "master" branch?

@tipabu
Copy link
Contributor

tipabu commented Apr 28, 2025

@pablodelara, on master (91da2ad add RISCV CI) all tests pass and perf suite runs fine.

@cielavenir cielavenir marked this pull request as draft April 28, 2025 23:59
@pablodelara
Copy link
Contributor

Thanks @tipabu. So it looks like this PR is not needed...

@cielavenir
Copy link
Contributor Author

@tipabu actually according to https://qiita.com/zacky1972/items/b7b5dd456fe021b30eb2, I need to wrap the function with smstart sm and smstop sm. I implemented that. If you have time could you try again?

(compilation is tested in https://github.com/cielavenir/isa-l/actions/runs/14731321078)

@tipabu
Copy link
Contributor

tipabu commented Apr 29, 2025

@cielavenir Tests now pass! And looking at someone's investigation, we shouldn't need to worry about losing sve checks for Macs; no Apple silicon supports it.

So it looks like this PR is not needed...

@pablodelara Only insofar as Macs were always getting the neon implementation.

@cielavenir
Copy link
Contributor Author

great thank you~

@cielavenir cielavenir marked this pull request as ready for review April 29, 2025 23:06
@pablodelara
Copy link
Contributor

@cielavenir can you clean up the commits (so there is no "Merge branch 'master'"...)? Good opportunity to rebase against latest 'master' branch

@cielavenir
Copy link
Contributor Author

@pablodelara rebased.

@tipabu
Copy link
Contributor

tipabu commented Apr 30, 2025

So one concern: This seems to be slightly slower than master. On this branch:

tburke@2025-air isa-l % make erasure_code/erasure_code_perf.run
erasure_code/erasure_code_perf
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 4 0 5 1 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime =    3062483 usecs, bandwidth 49464 MB in 3.0625 sec = 16151.90 MB/s
erasure_code_decode_warm: runtime =    3001748 usecs, bandwidth 59388 MB in 3.0017 sec = 19784.75 MB/s
done all: Pass
Completed run: erasure_code/erasure_code_perf

Whereas on master:

tburke@2025-air isa-l % make erasure_code/erasure_code_perf.run
erasure_code/erasure_code_perf
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 4 0 5 1 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime =    3039658 usecs, bandwidth 49832 MB in 3.0397 sec = 16394.04 MB/s
erasure_code_decode_warm: runtime =    3027461 usecs, bandwidth 65886 MB in 3.0275 sec = 21763.11 MB/s
done all: Pass
Completed run: erasure_code/erasure_code_perf

Multiple runs had similar results (±100MB/s on encode, ±200MB/s on decode, give or take).

@cielavenir
Copy link
Contributor Author

@tipabu on https://github.com/cielavenir/isa-l/tree/featSME_CI branch, I changed to call smstart/smstop only in the dispatched function. Setting smstart to the each subroutine called by ec_encode_data_sve could have overhead issue.

if this is faster, I will rebase featSME branch again.

@tipabu
Copy link
Contributor

tipabu commented May 1, 2025

@cielavenir I see roughly the same on cielavenir@aad8c5c:

tburke@2025-air isa-l % make erasure_code/erasure_code_perf.run
erasure_code/erasure_code_perf
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 4 0 5 1 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime =    3062620 usecs, bandwidth 49503 MB in 3.0626 sec = 16163.74 MB/s
erasure_code_decode_warm: runtime =    3006801 usecs, bandwidth 59369 MB in 3.0068 sec = 19745.01 MB/s
done all: Pass
Completed run: erasure_code/erasure_code_perf

@cielavenir
Copy link
Contributor Author

Now I'm not sure if this is smstart overhead or sme impl is not faster than neon impl..

@pablodelara
Copy link
Contributor

@liuqinfei what do you think of this PR?

@liuqinfei
Copy link
Contributor

@liuqinfei what do you think of this PR?

I recommend re-evaluating the ratio configuration 10 + 1 / 4 + 2 / 8 + 3. If benchmarking confirms no performance gains in the SME branch, i propose deferring this patch’s merger pending further validation.

@pablodelara
Copy link
Contributor

@liuqinfei what do you think of this PR?

I recommend re-evaluating the ratio configuration 10 + 1 / 4 + 2 / 8 + 3. If benchmarking confirms no performance gains in the SME branch, i propose deferring this patch’s merger pending further validation.

@tipabu @cielavenir, could you check this? If no gain, we should close this PR then.

@cielavenir
Copy link
Contributor Author

cielavenir commented Jan 24, 2026

@tipabu I pushed a branch named featSME2_CI. https://github.com/cielavenir/isa-l/actions/runs/21317365685

Could you test it with ec_encode_data dispatching sme2/sme/none?

[edit] if sme2/sme performances are the same, #367 might have build target issue so we need to wait for them

@tipabu
Copy link
Contributor

tipabu commented Jan 26, 2026

Locally, compilation fails for cielavenir@5189a30, ending with:

  CC       erasure_code/aarch64/gf_nvect_dot_prod_sve.lo
fatal error: error in backend: Don't know how to legalize this scalable vector type
clang: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Apple clang version 17.0.0 (clang-1700.6.3.2)
Target: arm64-apple-darwin25.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
clang: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang: note: diagnostic msg: /var/folders/ql/rh2f23bn56g8809nfw_c8qdm0000gn/T/gf_nvect_dot_prod_sve-960886.c
clang: note: diagnostic msg: /var/folders/ql/rh2f23bn56g8809nfw_c8qdm0000gn/T/gf_nvect_dot_prod_sve-960886.sh
clang: note: diagnostic msg: Crash backtrace is located in
clang: note: diagnostic msg: /Users/tburke/Library/Logs/DiagnosticReports/clang_<YYYY-MM-DD-HHMMSS>_<hostname>.crash
clang: note: diagnostic msg: (choose the .crash file that corresponds to your crash)
clang: note: diagnostic msg:

********************
make[1]: *** [erasure_code/aarch64/gf_nvect_dot_prod_sve.lo] Error 1
make: *** [all] Error 2

Curious when it seems to work for GHA... maybe it comes down to CI using macos-14 and macos-15-intel, but no macos-15?

@cielavenir
Copy link
Contributor Author

@tipabu

  1. could you try my featSME2_CI26 branch now? https://github.com/cielavenir/isa-l/actions/runs/21405820788
  2. just in case, could you also try erasure_code_perf_macOS binary from https://github.com/cielavenir/isa-l/actions/runs/21383486025 ?

@tipabu
Copy link
Contributor

tipabu commented Jan 27, 2026

Thanks @cielavenir, compilation works now.

Could you test it with ec_encode_data dispatching sme2/sme/none?

I'm not sure how to control that, could you give some more guidance?

At any rate, I can compare whatever the default behavior is:

SchemeErrorsfeatSME2_CI26master
encode (MB/s)decode (MB/s)encode (MB/s)decode (MB/s)
8 + 61 1654149096 1595348298
8 + 64 1654322595 1582421209
10 + 11 4778448365 4696746990
4 + 21 4037052700 3831053199
4 + 22 4036740322 3833938302
8 + 31 2657649225 2480047391
8 + 33 2662026787 2451124457

(Again, done as an average of ten runs.)

This looks more promising. 👍

@cielavenir
Copy link
Contributor Author

thank you for checking @tipabu

@pablodelara now #367 is required for this pull request

@cielavenir
Copy link
Contributor Author

@tipabu just in case could you test featSME2_CI26_noSME2 ? https://github.com/cielavenir/isa-l/actions/runs/21421947349

@tipabu
Copy link
Contributor

tipabu commented Jan 28, 2026

Threw in jswinney/2025-10-14-sve2-pr from #367 as another data point, too.

SchemeErrors master
f0c2efa
jswinney/2025-10-14-sve2-pr
b01e834
featSME2_CI26
b208080
featSME2_CI26_noSME2
971725a
encode (MB/s)decode (MB/s) encode (MB/s)decode (MB/s) encode (MB/s)decode (MB/s) encode (MB/s)decode (MB/s)
8 + 61 15973 48427 15979 48398 16577 49208 16574 49225
8 + 64 15956 21478 15977 21476 16590 22657 16555 22616
10 + 11 47321 47323 47320 47109 47816 48413 47805 48385
4 + 21 38642 53721 38619 53681 40344 52769 40401 52770
4 + 22 38603 38618 38646 38660 40423 40377 40393 40358
8 + 31 25306 48371 25314 48321 26628 49315 26646 49355
8 + 33 25307 25312 25293 25328 26610 26782 26638 26824

Looks pretty comparable to featSME2_CI26.

@cielavenir
Copy link
Contributor Author

I now know that SME has eor3 instruction, so SME2 and SME version would not have so much difference, but I keep both for now.

And I cleaned up featSME2_CI branch, which now use macos-15 runner. https://github.com/cielavenir/isa-l/actions/runs/21468869590


Now cleanup from my side is done and #367 is a real blocker

@cielavenir
Copy link
Contributor Author

@liuqinfei I merged featSME2 into featSME. Only drawback is that the commits are tangled now.

If you have concern there, as I allow edits and access to secrets by maintainers, do you mind helping rebasing (But do not mix the commit author of AWSjswinney, which makes me hesitating rebasing on my own)?

@pablodelara
Copy link
Contributor

Definitely, this needs rebase so only original commits are part of the PR.

@cielavenir
Copy link
Contributor Author

But rebasing requires merging #367 separately

@pablodelara
Copy link
Contributor

But rebasing requires merging #367 separately

Ok, let's wait for @liuqinfei to confirm #367 is OK to merge and once it is merged, you can rebase htis.

@pablodelara
Copy link
Contributor

You are ok for rebase against master now

@cielavenir
Copy link
Contributor Author

@pablodelara done rebasing.

( please tell me if I should make another pull request for 258fbed )

Copy link
Contributor

@pablodelara pablodelara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you sign off the second commit? Thanks!

#elif defined(__APPLE__)
if (sysctlEnabled(SYSCTL_SVE_KEY))
return gf_vect_dot_prod_sve;
// Due to smstart, should not dispatch SME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines can be eliminated completely (same below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eliminated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablodelara but if someone ask about SME here, please explain to them that __arm_streaming ABI is incompatible with the dispatcher because the git history is lost about this part. I cannot support this.

@pablodelara
Copy link
Contributor

@pablodelara done rebasing.

( please tell me if I should make another pull request for 258fbed )

If it has nothing to do with SME, then open a new PR.

@pablodelara
Copy link
Contributor

@cielavenir could you look at the final comments? I want all optimizations/new features to be merged by the end of this week as we are releasing the library by the end of the month.

@cielavenir
Copy link
Contributor Author

@pablodelara #393

@cielavenir cielavenir force-pushed the featSME branch 2 times, most recently from 62f6b29 to 75367b8 Compare February 11, 2026 14:55
@cielavenir
Copy link
Contributor Author

I rebased again, but sorry, this will be the last time from my side. I'm exhausted with repeated rebasing, sorry.

@cielavenir
Copy link
Contributor Author

I hope better solutions eg squash-merge can be configured so that the manual rebasing is not required.

@pablodelara
Copy link
Contributor

@liuqinfei could you review this PR? thanks

Signed-off-by: Taiju Yamada <tyamada@bi.a.u-tokyo.ac.jp>
@pablodelara
Copy link
Contributor

Last call for this PR to be integrated in v2.32, thanks. @liuqinfei can you review it please? Thanks!

@cielavenir
Copy link
Contributor Author

cielavenir commented Feb 16, 2026

@liuqinfei yes please. the rebasing state is the cleanest ever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants