Prefix Sum Calculation using SIMD Instructions

Implementation of prefix sum using SSE2, AVX2, AVX512 intrinsics and Rust's nightly portable simd api.

The algorithm is based on Prefix Sum with SIMD but extended to use larger simd registers.

Run Benchmarks using

$ cargo bench

This will automatically detect supported simd instruction set and only run the applicable benchmarks.

The portable simd api requires a nightly Rust version and needs to be enabled via the nightly feature. To get best performance for this api, compile for an explicit target-cpu:

RUSTFLAGS="-Ctarget-cpu=native" cargo +nightly bench --features nightly

Benchmark results on an i9-11900KB:

Using the above command:

bench_prefix_sum/prefix_sum_scalar
                        time:   [831.73 ns 833.67 ns 836.00 ns]
                        thrpt:  [36.504 GiB/s 36.606 GiB/s 36.692 GiB/s]
bench_prefix_sum/prefix_sum_sse2_128
                        time:   [506.86 ns 509.23 ns 511.85 ns]
                        thrpt:  [59.622 GiB/s 59.929 GiB/s 60.209 GiB/s]
bench_prefix_sum/prefix_sum_avx2_256
                        time:   [492.74 ns 493.71 ns 494.82 ns]
                        thrpt:  [61.674 GiB/s 61.813 GiB/s 61.934 GiB/s]
bench_prefix_sum/prefix_sum_avx512_128
                        time:   [509.67 ns 511.96 ns 514.56 ns]
                        thrpt:  [59.309 GiB/s 59.609 GiB/s 59.877 GiB/s]
bench_prefix_sum/prefix_sum_avx512_256
                        time:   [462.02 ns 462.48 ns 463.02 ns]
                        thrpt:  [65.910 GiB/s 65.986 GiB/s 66.053 GiB/s]
bench_prefix_sum/prefix_sum_avx512_512
                        time:   [303.24 ns 303.64 ns 304.04 ns]
                        thrpt:  [100.37 GiB/s 100.50 GiB/s 100.64 GiB/s]
bench_prefix_sum/prefix_sum_portable_128
                        time:   [482.13 ns 483.21 ns 484.51 ns]
                        thrpt:  [62.987 GiB/s 63.156 GiB/s 63.297 GiB/s]
bench_prefix_sum/prefix_sum_portable_512
                        time:   [302.97 ns 303.31 ns 303.71 ns]
                        thrpt:  [100.48 GiB/s 100.61 GiB/s 100.73 GiB/s]

Using 512-bit registers, the performance seems to be only limited by cache bandwidth.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benches		benches
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prefix Sum Calculation using SIMD Instructions

Benchmark results on an i9-11900KB:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prefix Sum Calculation using SIMD Instructions

Benchmark results on an i9-11900KB:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages