Skip to content

jhorstmann/simd-prefix-sum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prefix Sum Calculation using SIMD Instructions

Implementation of prefix sum using SSE2, AVX2, AVX512 intrinsics and Rust's nightly portable simd api.

The algorithm is based on Prefix Sum with SIMD but extended to use larger simd registers.

Run Benchmarks using

$ cargo bench

This will automatically detect supported simd instruction set and only run the applicable benchmarks.

The portable simd api requires a nightly Rust version and needs to be enabled via the nightly feature. To get best performance for this api, compile for an explicit target-cpu:

RUSTFLAGS="-Ctarget-cpu=native" cargo +nightly bench --features nightly

Benchmark results on an i9-11900KB:

Using the above command:

bench_prefix_sum/prefix_sum_scalar
                        time:   [831.73 ns 833.67 ns 836.00 ns]
                        thrpt:  [36.504 GiB/s 36.606 GiB/s 36.692 GiB/s]
bench_prefix_sum/prefix_sum_sse2_128
                        time:   [506.86 ns 509.23 ns 511.85 ns]
                        thrpt:  [59.622 GiB/s 59.929 GiB/s 60.209 GiB/s]
bench_prefix_sum/prefix_sum_avx2_256
                        time:   [492.74 ns 493.71 ns 494.82 ns]
                        thrpt:  [61.674 GiB/s 61.813 GiB/s 61.934 GiB/s]
bench_prefix_sum/prefix_sum_avx512_128
                        time:   [509.67 ns 511.96 ns 514.56 ns]
                        thrpt:  [59.309 GiB/s 59.609 GiB/s 59.877 GiB/s]
bench_prefix_sum/prefix_sum_avx512_256
                        time:   [462.02 ns 462.48 ns 463.02 ns]
                        thrpt:  [65.910 GiB/s 65.986 GiB/s 66.053 GiB/s]
bench_prefix_sum/prefix_sum_avx512_512
                        time:   [303.24 ns 303.64 ns 304.04 ns]
                        thrpt:  [100.37 GiB/s 100.50 GiB/s 100.64 GiB/s]
bench_prefix_sum/prefix_sum_portable_128
                        time:   [482.13 ns 483.21 ns 484.51 ns]
                        thrpt:  [62.987 GiB/s 63.156 GiB/s 63.297 GiB/s]
bench_prefix_sum/prefix_sum_portable_512
                        time:   [302.97 ns 303.31 ns 303.71 ns]
                        thrpt:  [100.48 GiB/s 100.61 GiB/s 100.73 GiB/s]

Using 512-bit registers, the performance seems to be only limited by cache bandwidth.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages