This repository implements single-precision adaptive FIR filters for real and complex analytic signals with SIMD acceleration (AVX2/AVX-512/NEON), optional CUDA GPU support, and OptMathKernels integration for Raspberry Pi 5.
- Algorithms: LMS, NLMS, Block LMS, Normalized Block LMS
- Signal Types: Real (
float) and complex (std::complex<float>) - FFT Acceleration: Automatic overlap-save convolution for filters > 32 taps via FFTW3
- SIMD Optimization: AVX2+FMA and AVX-512 (x86_64), ARM NEON (Raspberry Pi 4/5)
- OptMathKernels Integration: Enhanced NEON kernels via OptMathKernels
- GPU Acceleration: Optional NVIDIA CUDA support via cuFFT
- FFTW Wisdom Caching: Persistent FFT plan optimization in
~/.adapt_fftw_wisdom - Parameter Validation: Runtime checks on all configuration parameters
- Thread Safety: Mutex-protected FFTW plan creation for multi-threaded use
- GNU Radio Integration: Ready-to-use
sync_blockwrappers - Comprehensive Test Suite: 22 tests covering all algorithms, types, edge cases, and numerical properties
| Platform | SIMD | GPU | OptMathKernels | Notes |
|---|---|---|---|---|
| x86_64 Linux | AVX2+FMA, AVX-512 | CUDA | - | Full optimization |
| Raspberry Pi 5 | NEON | - | Supported | Cortex-A76 tuned |
| Raspberry Pi 4 | NEON | - | Supported | Cortex-A72 tuned |
| Generic ARM64 | NEON | - | Supported | ARMv8-A baseline |
- LMS - Least Mean Squares (sample-wise weight update)
- NLMS - Normalized LMS (scale-invariant, regularized adaptation)
- Block LMS - Block-based update (weights updated every M samples)
- Normalized Block LMS - Block NLMS with per-block energy normalization
For block algorithms with filter length > 32, the implementation automatically uses an FFT overlap-save path backed by FFTW3 single precision (fftw3f). FFT sizes are chosen with small prime factors (2, 3, 5, 7) for optimal FFTW performance. For filter length <= 32, a time-domain block path is used.
Filtering uses the conjugate-weights convention:
Time-domain LMS/NLMS update:
The FFT block update is implemented to be algebraically consistent with these conventions (including conjugations).
Required:
- C++20 compiler (GCC 10+, Clang 12+)
- CMake 3.18+
- FFTW3 single-precision (
libfftw3-devorfftw3f)
Optional:
- FFTW3 threads (
libfftw3-devincludes this on most systems) - NVIDIA CUDA Toolkit 11.0+ (for GPU acceleration)
- OptMathKernels (for enhanced ARM NEON performance)
# Install dependencies
sudo apt update
sudo apt install -y build-essential cmake pkg-config libfftw3-dev
# Build with default options (AVX2 auto-detected)
mkdir -p build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Run tests
ctest --test-dir buildsudo apt update
sudo apt install -y build-essential cmake pkg-config libfftw3-dev
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DADAPT_TARGET_PI5=ON
cmake --build build -j4
ctest --test-dir buildcmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DADAPT_TARGET_PI4=ON
cmake --build build -j4# First build and install OptMathKernels
cd /path/to/OptimizedKernelsForRaspberryPi5_NvidiaCUDA
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DOPTMATH_USE_NEON=ON \
-DCMAKE_INSTALL_PREFIX=$(pwd)/../install
make -j4 && make install
# Then build AdaptiveFiltering with OptMathKernels
cd /path/to/AdaptiveFiltering
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
-DADAPT_TARGET_PI5=ON \
-DADAPT_USE_OPTMATH=ON \
-DADAPT_OPTMATH_PATH=/path/to/OptimizedKernelsForRaspberryPi5_NvidiaCUDA/install
cmake --build build -j4cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
-DADAPT_USE_CUDA=ON \
-DADAPT_CUDA_ARCH=86 # Adjust for your GPU (75=Turing, 86=Ampere, 89=Ada)
cmake --build build -j$(nproc)cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DADAPT_USE_AVX512=ON
cmake --build build -j$(nproc)| Option | Default | Description |
|---|---|---|
ADAPT_FILTERS_BUILD_TESTS |
ON |
Build unit tests |
ADAPT_FILTERS_BUILD_EXAMPLES |
ON |
Build example programs |
ADAPT_FILTERS_BUILD_BENCHMARKS |
OFF |
Build benchmark suite |
ADAPT_USE_AVX2 |
ON |
Enable AVX2+FMA optimizations (x86_64) |
ADAPT_USE_AVX512 |
OFF |
Enable AVX-512 optimizations (x86_64) |
ADAPT_USE_NEON |
ON |
Enable ARM NEON optimizations |
ADAPT_TARGET_PI5 |
OFF |
Optimize for Raspberry Pi 5 (Cortex-A76) |
ADAPT_TARGET_PI4 |
OFF |
Optimize for Raspberry Pi 4 (Cortex-A72) |
ADAPT_USE_CUDA |
OFF |
Enable NVIDIA CUDA acceleration |
ADAPT_CUDA_ARCH |
75 |
CUDA compute capability |
ADAPT_USE_OPTMATH |
OFF |
Enable OptMathKernels integration |
ADAPT_OPTMATH_PATH |
"" |
Path to OptMathKernels installation |
#include <adapt/adaptive_fir.hpp>
#include <vector>
int main() {
// Create a 64-tap NLMS filter
adapt::Params params;
params.mu = 0.01f; // Step size (must be >= 0)
params.eps = 1e-6f; // Regularization (must be > 0)
adapt::AdaptiveFIR<float> filter(64, adapt::Algorithm::NLMS, params);
// Process signals
std::vector<float> input(1024); // Input signal
std::vector<float> desired(1024); // Desired/reference signal
std::vector<float> output(1024); // Filter output
// ... fill input and desired ...
filter.process(
adapt::Span<const float>(input.data(), input.size()),
adapt::Span<const float>(desired.data(), desired.size()),
adapt::Span<float>(output.data(), output.size())
);
// Get adapted weights
const auto& weights = filter.weights();
return 0;
}#include <adapt/adaptive_fir.hpp>
#include <complex>
using cf32 = std::complex<float>;
adapt::AdaptiveFIR<cf32> filter(128, adapt::Algorithm::BLOCK_NLMS);
std::vector<cf32> x(4096), d(4096), y(4096);
// ... fill x and d ...
filter.process(
adapt::Span<const cf32>(x.data(), x.size()),
adapt::Span<const cf32>(d.data(), d.size()),
adapt::Span<cf32>(y.data(), y.size())
);// Parameters can be changed at runtime (validated on set)
filter.set_mu(0.005f); // Reduce step size (throws if < 0)
filter.set_eps(1e-8f); // Adjust regularization (throws if <= 0)
// Reset filter to initial state (clears weights, history, and FFT overlap)
filter.reset_state();
// Set specific weights
std::vector<float> new_weights(64, 0.0f);
new_weights[0] = 1.0f;
filter.set_weights(new_weights);#ifdef ADAPT_HAVE_CUDA
#include <adapt/cuda/adaptive_fir_cuda.hpp>
// GPU-accelerated filter (automatically falls back to CPU for small filters)
adapt::cuda::AdaptiveFIRCuda<float> gpu_filter(256, adapt::Algorithm::BLOCK_LMS);
// Check if GPU is being used
if (gpu_filter.is_using_gpu()) {
std::cout << "Using GPU acceleration\n";
}
// Same API as CPU version
gpu_filter.process(x_span, d_span, y_span);
#endiftemplate <typename T> // T = float or std::complex<float>
class AdaptiveFIR {
public:
// Constructor - throws on invalid params (filter_len==0, mu<0, eps<=0, max_nfft==0)
AdaptiveFIR(std::size_t filter_len, Algorithm alg, Params p = {});
// Properties
std::size_t filter_len() const;
Algorithm algorithm() const;
const std::vector<T>& weights() const;
// Parameters (validated on set - throws std::runtime_error on invalid values)
float mu() const;
float eps() const;
void set_mu(float mu); // throws if mu < 0
void set_eps(float eps); // throws if eps <= 0
// Weight management
void set_weights(const std::vector<T>& w); // throws if w.size() != filter_len
void reset_state(); // resets weights to zero, clears history and FFT state
// Main processing function
// x: input signal, d: desired/reference signal, y_out: optional filter output
// throws if x.size() != d.size() or y_out non-empty with wrong size
void process(Span<const T> x, Span<const T> d, Span<T> y_out = {});
};enum class Algorithm {
LMS, // Sample-wise LMS
NLMS, // Sample-wise Normalized LMS
BLOCK_LMS, // Block LMS (time-domain for M<=32, FFT for M>32)
BLOCK_NLMS // Block Normalized LMS
};struct Params {
float mu = 0.01f; // Step size (must be >= 0)
float eps = 1e-6f; // Regularization (must be > 0)
std::size_t max_nfft = 65536; // Maximum FFT size cap (must be > 0)
void validate() const; // throws std::runtime_error on invalid values
};The library includes hand-optimized SIMD kernels for critical inner-loop operations:
| Operation | x86_64 AVX2 | x86_64 AVX-512 | ARM NEON | Generic |
|---|---|---|---|---|
| Real dot product | 8 floats/iter | 16 floats/iter | 4 floats/iter | 1 float/iter |
| Complex multiply | 4 complex/iter | 8 complex/iter | 2 complex/iter | 1 complex/iter |
| Weight update (FMA) | 8 floats/iter | 16 floats/iter | 4 floats/iter | 1 float/iter |
| FFT freq-domain ops | Vectorized | Vectorized | Vectorized | Scalar |
When OptMathKernels is enabled on ARM, the NEON kernels are replaced with further-optimized implementations from that library for fmac, scale, and complex multiply-accumulate operations.
The kernel dispatcher selects the best available backend at compile time:
- OptMathKernels (if
ADAPT_USE_OPTMATH=ONand ARM NEON available) - AVX-512 (if
ADAPT_USE_AVX512=ONand CPU supports it) - AVX2+FMA (if
ADAPT_USE_AVX2=ONand CPU supports it) - ARM NEON (always available on AArch64)
- Generic (portable C++ fallback)
| Scenario | Recommended Algorithm |
|---|---|
| Real-time, sample-by-sample | LMS or NLMS |
| Unknown signal scaling | NLMS or BLOCK_NLMS |
| Large filter (>32 taps) | BLOCK_LMS or BLOCK_NLMS |
| High throughput, batched | BLOCK_LMS or BLOCK_NLMS |
| GPU available, very large filter | CUDA variant with BLOCK_* |
AdaptiveFiltering/
├── include/adapt/
│ ├── adaptive_fir.hpp # Main adaptive filter class
│ ├── span.hpp # Minimal span utility
│ ├── traits.hpp # Type traits (is_complex, scalar_type, conj_if_needed, abs2)
│ ├── smooth_fft.hpp # FFT size selection (2,3,5,7-smooth numbers)
│ ├── fftw_wrap.hpp # FFTW3 RAII wrapper with wisdom caching
│ ├── kernels/
│ │ ├── kernels.hpp # Unified kernel dispatcher
│ │ ├── kernel_config.hpp # CPU feature detection (CPUID / NEON)
│ │ ├── kernel_generic.hpp # Portable C++ fallback kernels
│ │ ├── kernel_avx2.hpp # AVX2+FMA optimized kernels
│ │ ├── kernel_avx512.hpp # AVX-512 optimized kernels
│ │ ├── kernel_neon.hpp # ARM NEON optimized kernels
│ │ └── kernel_optmath.hpp # OptMathKernels bridge (enhanced NEON)
│ └── cuda/
│ ├── cuda_fft.hpp # cuFFT wrapper
│ └── adaptive_fir_cuda.hpp # GPU-accelerated filter
├── src/
│ ├── adaptive_fir.cpp # Template instantiation for linkage
│ ├── fftw_wrap.cpp # FFTW implementation with mutex + wisdom
│ └── cuda/ # CUDA kernel implementations
├── tests/ # 22 unit tests (see Testing section)
├── examples/
│ ├── example_system_identification.cpp
│ └── example_noise_canceller.cpp
├── gnuradio_wrappers/ # GNU Radio sync_block wrappers
├── cmake/
│ └── adapt_config.hpp.in # Generated configuration header
└── CMakeLists.txt
GNU Radio-compatible sync_block wrappers are included:
adaptive_fir_ff- Float input/outputadaptive_fir_cc- Complex input/output
Located in gnuradio_wrappers/. These are designed to be dropped into a GNU Radio OOT module and linked against adapt_filters.
- Two input ports: signal (
x) and reference (d) - Selectable output: filtered signal (
y) or error (e) - Runtime parameter updates via setters
- Algorithm switching while preserving weights
- FFTW plan creation/destruction: Protected by mutex (thread-safe)
- FFT execution: Thread-safe (same plan can be used from multiple threads with different data)
- Filter instances: NOT thread-safe (use separate instances per thread)
- FFTW wisdom: Loaded once at startup, saved on plan creation
Two example programs are included:
example_system_identification- Identifies an unknown 96-tap system using Block NLMS with FFT accelerationexample_noise_canceller- Complex-valued adaptive noise cancellation with a 48-tap NLMS filter
Build and run:
./build/example_system_identification
./build/example_noise_cancellerThe test suite contains 22 tests organized into categories:
| Test | Description |
|---|---|
test_lms_system_id |
LMS convergence on system identification |
test_nlms_scale_invariance |
NLMS robustness to signal scaling |
test_block_fft_matches_direct |
FFT overlap-save matches time-domain |
test_block_nlms_complex_converges |
Complex Block NLMS convergence |
test_lms_complex |
Complex LMS: system ID, channel tracking, output accuracy |
| Test | Description |
|---|---|
test_all_algorithms_float |
All 4 algorithms x multiple filter lengths (float) with NMSE thresholds |
test_all_algorithms_complex |
All 4 algorithms x multiple filter lengths (complex) with NMSE thresholds |
test_large_filters |
M=256, M=512 filters (FFT path stress test) |
| Test | Description |
|---|---|
test_param_validation |
12 checks: invalid mu, eps, sizes, empty input |
test_edge_cases |
M=1, M=2, single sample, M=32/33 threshold boundary |
test_numerical_stability |
Large/small amplitude, zero input, impulse, alternating signals |
test_convergence_rate |
mu comparison, NLMS vs LMS, monotonic improvement |
| Test | Description |
|---|---|
test_weight_management |
10 checks: init, get/set, FFT sync, reset, frozen weights |
test_streaming_consistency |
Chunked vs monolithic processing equivalence |
test_fft_path_correctness |
Frozen convolution accuracy at M=64/100/128, partial blocks |
test_noise_cancellation |
Float ANC, complex ANC, echo cancellation with ERLE measurement |
| Test | Description |
|---|---|
test_simd_kernels |
All SIMD kernel ops (dot, fmac, mul, conj_mul) |
test_kernel_consistency |
Optimized kernels vs generic reference across 27 sizes |
test_smooth_fft |
FFT size selection: smoothness, minimality, constraints |
test_fftw_wrap |
FFTW roundtrip, Parseval's theorem, DFT accuracy, move semantics |
test_span |
Span utility: constructors, access, const, zero-size |
test_traits |
Type traits: is_complex, scalar_type, conj_if_needed, abs2 |
# Run all tests
ctest --test-dir build
# Run with verbose output
ctest --test-dir build -V
# Run specific test
./build/test_noise_cancellationAll 22 tests pass on Raspberry Pi 5 (Cortex-A76, ARM NEON, OptMathKernels enabled):
1/22 test_lms_system_id .................. Passed 0.00 sec
2/22 test_nlms_scale_invariance .......... Passed 0.00 sec
3/22 test_block_fft_matches_direct ....... Passed 0.00 sec
4/22 test_block_nlms_complex_converges ... Passed 0.02 sec
5/22 test_simd_kernels ................... Passed 0.00 sec
6/22 test_param_validation ............... Passed 0.00 sec
7/22 test_edge_cases ..................... Passed 0.01 sec
8/22 test_all_algorithms_float ........... Passed 0.06 sec
9/22 test_all_algorithms_complex ......... Passed 0.10 sec
10/22 test_large_filters .................. Passed 0.22 sec
11/22 test_weight_management .............. Passed 0.00 sec
12/22 test_streaming_consistency .......... Passed 0.00 sec
13/22 test_numerical_stability ............ Passed 0.01 sec
14/22 test_convergence_rate ............... Passed 0.01 sec
15/22 test_fft_path_correctness ........... Passed 0.01 sec
16/22 test_noise_cancellation ............. Passed 0.03 sec
17/22 test_smooth_fft ..................... Passed 0.00 sec
18/22 test_fftw_wrap ...................... Passed 0.01 sec
19/22 test_kernel_consistency ............. Passed 0.01 sec
20/22 test_lms_complex .................... Passed 0.01 sec
21/22 test_span ........................... Passed 0.00 sec
22/22 test_traits ......................... Passed 0.00 sec
100% tests passed, 0 tests failed out of 22
Total Test time (real) = 0.54 sec
Detailed test output (click to expand)
test_simd_kernels:
dot_product_f32: PASS
dot_product_cf32: PASS
sum_squares_f32: PASS
sum_norm_cf32: PASS
fmac_f32: PASS
mul_cf32: PASS
conj_mul_cf32: PASS
test_param_validation:
filter_len=0: PASS
negative mu: PASS
eps=0: PASS
negative eps: PASS
max_nfft=0: PASS
set_mu negative: PASS
set_eps zero: PASS
set_weights mismatch: PASS
x/d size mismatch: PASS
y_out size mismatch: PASS
mu=0 valid: PASS
empty input: PASS
test_edge_cases:
M=1 LMS: PASS
M=1 complex NLMS: PASS
M=2 LMS: PASS
single sample processing: PASS
M=33 Block NLMS FFT path: PASS
M=32 Block LMS time domain: PASS
test_all_algorithms_float:
LMS M=8: PASS (NMSE=0.0000 < 0.15)
LMS M=16: PASS (NMSE=0.0000 < 0.20)
NLMS M=8: PASS (NMSE=0.0000 < 0.10)
NLMS M=16: PASS (NMSE=0.0001 < 0.15)
NLMS M=32: PASS (NMSE=0.0001 < 0.20)
BLOCK_LMS M=16: PASS (NMSE=0.0001 < 0.25)
BLOCK_LMS M=32: PASS (NMSE=0.0001 < 0.30)
BLOCK_NLMS M=16: PASS (NMSE=0.0000 < 0.20)
BLOCK_NLMS M=32: PASS (NMSE=0.0006 < 0.25)
BLOCK_LMS M=64: PASS (NMSE=0.0000 < 0.35)
BLOCK_NLMS M=64: PASS (NMSE=0.0001 < 0.30)
BLOCK_NLMS M=128: PASS (NMSE=0.0006 < 0.35)
test_all_algorithms_complex:
LMS M=8: PASS (NMSE=0.0000 < 0.20)
LMS M=16: PASS (NMSE=0.0000 < 0.25)
NLMS M=8: PASS (NMSE=0.0002 < 0.15)
NLMS M=16: PASS (NMSE=0.0001 < 0.20)
NLMS M=32: PASS (NMSE=0.0001 < 0.25)
BLOCK_LMS M=16: PASS (NMSE=0.0001 < 0.30)
BLOCK_LMS M=64: PASS (NMSE=0.0000 < 0.35)
BLOCK_NLMS M=16: PASS (NMSE=0.0000 < 0.25)
BLOCK_NLMS M=64: PASS (NMSE=0.0002 < 0.30)
BLOCK_NLMS M=96: PASS (NMSE=0.0001 < 0.35)
test_large_filters:
M=256 float BLOCK_NLMS: PASS (NMSE=0.0006)
M=256 complex BLOCK_NLMS: PASS (NMSE=0.0001)
M=512 float BLOCK_NLMS: PASS (NMSE=0.0001)
test_weight_management:
initial weights zero: PASS
set/get weights roundtrip: PASS
set_weights FFT path: PASS
reset_state clears weights: PASS
frozen weights (mu=0): PASS
complex frozen weights: PASS
complex set_weights FFT roundtrip: PASS
filter_len accessor: PASS
algorithm accessor: PASS
mu/eps accessors: PASS
test_streaming_consistency:
LMS float chunk=1: PASS
LMS float chunk=7: PASS
LMS float chunk=50: PASS
NLMS float chunk=1: PASS
NLMS float chunk=13: PASS
LMS complex chunk=1: PASS
NLMS complex chunk=11: PASS
BLOCK_NLMS float chunk=100 vs 500: PASS
test_numerical_stability:
large amplitude NLMS: PASS
small amplitude NLMS: PASS
zero input signal: PASS
impulse response: PASS
complex large amplitude: PASS
alternating amplitude: PASS
block FFT large amplitude complex: PASS
test_convergence_rate:
NLMS higher mu converges faster: (mu=0.05: 0.000005, mu=0.5: 0.000129) PASS
NLMS converges faster than LMS: PASS (LMS: 0.0000, NLMS: 0.0001)
monotonic improvement: PASS
more data better convergence: PASS (N=200: 0.7409, N=10000: 0.0000)
test_fft_path_correctness:
float M=64 FFT frozen convolution: PASS (max_err=0.000000)
complex M=64 FFT frozen convolution: PASS (max_err=0.000000)
float M=128 FFT frozen convolution: PASS (max_err=0.000000)
float M=100 FFT frozen convolution: PASS (max_err=0.000000)
partial block (N < Nfft): PASS (max_err=0.000000)
multiple partial blocks: PASS (max_err=0.000000)
test_noise_cancellation:
float ANC (correlated noise removal): PASS (SNR: -7.9 -> 4.8)
complex ANC: PASS (steady-state SNR: 6.8 dB)
echo cancellation scenario: PASS (ERLE: 59.1 dB)
test_kernel_consistency:
dot_product_f32 vs generic: PASS
sum_squares_f32 vs generic: PASS
fmac_f32 vs generic: PASS
dot_product_cf32 vs generic: PASS
sum_norm_cf32 vs generic: PASS
mul_cf32 vs generic: PASS
conj_mul_cf32 vs generic: PASS
scale_inplace_cf32 vs generic: PASS
fmac_cf32 vs generic: PASS
template dispatchers: PASS
CPU feature detection: PASS
test_lms_complex:
complex LMS sysid M=16: PASS (NMSE=0.0000)
complex NLMS tracking: PASS (NMSE=0.0003 tracking h2)
complex LMS output accuracy: PASS
test_fftw_wrap:
forward-inverse roundtrip: PASS
Parseval's theorem: PASS
DC signal DFT: PASS
single frequency DFT: PASS
various FFT sizes: PASS
non-power-of-2 size: PASS
move semantics: PASS
test_smooth_fft:
basic selection: PASS
powers of 2: PASS
non-smooth roundup: PASS
all results smooth: PASS
result is minimal: PASS
max_n constraint: PASS
large numbers: PASS
target=0: PASS
test_span:
default constructor: PASS
pointer+size constructor: PASS
element access: PASS
const span: PASS
span from vector: PASS
zero-size span: PASS
test_traits:
is_complex: PASS
scalar_type: PASS
conj_if_needed: PASS
abs2: PASS
MIT License - see LICENSE for details.
Contributions are welcome! Please ensure:
- All tests pass (
ctest --test-dir build) - New features include appropriate tests
- Code follows existing style conventions
- FFTW3 for high-performance FFT
- OptMathKernels for optimized ARM NEON kernels
- NVIDIA for cuFFT and CUDA toolkit