Implementation of prefix sum using SSE2, AVX2, AVX512 intrinsics and Rust's nightly portable simd api.
The algorithm is based on Prefix Sum with SIMD but extended to use larger simd registers.
Run Benchmarks using
$ cargo bench
This will automatically detect supported simd instruction set and only run the applicable benchmarks.
The portable simd api requires a nightly Rust version and needs to be enabled via the nightly feature.
To get best performance for this api, compile for an explicit target-cpu:
RUSTFLAGS="-Ctarget-cpu=native" cargo +nightly bench --features nightly
Using the above command:
bench_prefix_sum/prefix_sum_scalar
time: [831.73 ns 833.67 ns 836.00 ns]
thrpt: [36.504 GiB/s 36.606 GiB/s 36.692 GiB/s]
bench_prefix_sum/prefix_sum_sse2_128
time: [506.86 ns 509.23 ns 511.85 ns]
thrpt: [59.622 GiB/s 59.929 GiB/s 60.209 GiB/s]
bench_prefix_sum/prefix_sum_avx2_256
time: [492.74 ns 493.71 ns 494.82 ns]
thrpt: [61.674 GiB/s 61.813 GiB/s 61.934 GiB/s]
bench_prefix_sum/prefix_sum_avx512_128
time: [509.67 ns 511.96 ns 514.56 ns]
thrpt: [59.309 GiB/s 59.609 GiB/s 59.877 GiB/s]
bench_prefix_sum/prefix_sum_avx512_256
time: [462.02 ns 462.48 ns 463.02 ns]
thrpt: [65.910 GiB/s 65.986 GiB/s 66.053 GiB/s]
bench_prefix_sum/prefix_sum_avx512_512
time: [303.24 ns 303.64 ns 304.04 ns]
thrpt: [100.37 GiB/s 100.50 GiB/s 100.64 GiB/s]
bench_prefix_sum/prefix_sum_portable_128
time: [482.13 ns 483.21 ns 484.51 ns]
thrpt: [62.987 GiB/s 63.156 GiB/s 63.297 GiB/s]
bench_prefix_sum/prefix_sum_portable_512
time: [302.97 ns 303.31 ns 303.71 ns]
thrpt: [100.48 GiB/s 100.61 GiB/s 100.73 GiB/s]
Using 512-bit registers, the performance seems to be only limited by cache bandwidth.