Skip to content

n4hy/AdaptiveFiltering

Repository files navigation

Adaptive FIR Filters (LMS, NLMS, Block LMS, Normalized Block LMS)

This repository implements single-precision adaptive FIR filters for real and complex analytic signals with SIMD acceleration (AVX2/AVX-512/NEON), optional CUDA GPU support, and OptMathKernels integration for Raspberry Pi 5.

Features

  • Algorithms: LMS, NLMS, Block LMS, Normalized Block LMS
  • Signal Types: Real (float) and complex (std::complex<float>)
  • FFT Acceleration: Automatic overlap-save convolution for filters > 32 taps via FFTW3
  • SIMD Optimization: AVX2+FMA and AVX-512 (x86_64), ARM NEON (Raspberry Pi 4/5)
  • OptMathKernels Integration: Enhanced NEON kernels via OptMathKernels
  • GPU Acceleration: Optional NVIDIA CUDA support via cuFFT
  • FFTW Wisdom Caching: Persistent FFT plan optimization in ~/.adapt_fftw_wisdom
  • Parameter Validation: Runtime checks on all configuration parameters
  • Thread Safety: Mutex-protected FFTW plan creation for multi-threaded use
  • GNU Radio Integration: Ready-to-use sync_block wrappers
  • Comprehensive Test Suite: 22 tests covering all algorithms, types, edge cases, and numerical properties

Supported Platforms

Platform SIMD GPU OptMathKernels Notes
x86_64 Linux AVX2+FMA, AVX-512 CUDA - Full optimization
Raspberry Pi 5 NEON - Supported Cortex-A76 tuned
Raspberry Pi 4 NEON - Supported Cortex-A72 tuned
Generic ARM64 NEON - Supported ARMv8-A baseline

Algorithms

  • LMS - Least Mean Squares (sample-wise weight update)
  • NLMS - Normalized LMS (scale-invariant, regularized adaptation)
  • Block LMS - Block-based update (weights updated every M samples)
  • Normalized Block LMS - Block NLMS with per-block energy normalization

For block algorithms with filter length > 32, the implementation automatically uses an FFT overlap-save path backed by FFTW3 single precision (fftw3f). FFT sizes are chosen with small prime factors (2, 3, 5, 7) for optimal FFTW performance. For filter length <= 32, a time-domain block path is used.

Complex Convention

Filtering uses the conjugate-weights convention:

$$y[n] = \sum_{k=0}^{M-1} \overline{w[k]} \cdot x[n-k]$$

Time-domain LMS/NLMS update:

$$w[k] \leftarrow w[k] + \mu \cdot \frac{\overline{e[n]} \cdot x[n-k]}{|x|^2 + \varepsilon}$$

The FFT block update is implemented to be algebraically consistent with these conventions (including conjugations).

Installation

Dependencies

Required:

  • C++20 compiler (GCC 10+, Clang 12+)
  • CMake 3.18+
  • FFTW3 single-precision (libfftw3-dev or fftw3f)

Optional:

  • FFTW3 threads (libfftw3-dev includes this on most systems)
  • NVIDIA CUDA Toolkit 11.0+ (for GPU acceleration)
  • OptMathKernels (for enhanced ARM NEON performance)

Build (Linux - x86_64)

# Install dependencies
sudo apt update
sudo apt install -y build-essential cmake pkg-config libfftw3-dev

# Build with default options (AVX2 auto-detected)
mkdir -p build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run tests
ctest --test-dir build

Build (Raspberry Pi 5)

sudo apt update
sudo apt install -y build-essential cmake pkg-config libfftw3-dev

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DADAPT_TARGET_PI5=ON
cmake --build build -j4
ctest --test-dir build

Build (Raspberry Pi 4)

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DADAPT_TARGET_PI4=ON
cmake --build build -j4

Build with OptMathKernels (ARM)

# First build and install OptMathKernels
cd /path/to/OptimizedKernelsForRaspberryPi5_NvidiaCUDA
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DOPTMATH_USE_NEON=ON \
      -DCMAKE_INSTALL_PREFIX=$(pwd)/../install
make -j4 && make install

# Then build AdaptiveFiltering with OptMathKernels
cd /path/to/AdaptiveFiltering
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
      -DADAPT_TARGET_PI5=ON \
      -DADAPT_USE_OPTMATH=ON \
      -DADAPT_OPTMATH_PATH=/path/to/OptimizedKernelsForRaspberryPi5_NvidiaCUDA/install
cmake --build build -j4

Build with CUDA (x86_64 + NVIDIA GPU)

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
      -DADAPT_USE_CUDA=ON \
      -DADAPT_CUDA_ARCH=86  # Adjust for your GPU (75=Turing, 86=Ampere, 89=Ada)
cmake --build build -j$(nproc)

Build with AVX-512 (x86_64)

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DADAPT_USE_AVX512=ON
cmake --build build -j$(nproc)

CMake Options

Option Default Description
ADAPT_FILTERS_BUILD_TESTS ON Build unit tests
ADAPT_FILTERS_BUILD_EXAMPLES ON Build example programs
ADAPT_FILTERS_BUILD_BENCHMARKS OFF Build benchmark suite
ADAPT_USE_AVX2 ON Enable AVX2+FMA optimizations (x86_64)
ADAPT_USE_AVX512 OFF Enable AVX-512 optimizations (x86_64)
ADAPT_USE_NEON ON Enable ARM NEON optimizations
ADAPT_TARGET_PI5 OFF Optimize for Raspberry Pi 5 (Cortex-A76)
ADAPT_TARGET_PI4 OFF Optimize for Raspberry Pi 4 (Cortex-A72)
ADAPT_USE_CUDA OFF Enable NVIDIA CUDA acceleration
ADAPT_CUDA_ARCH 75 CUDA compute capability
ADAPT_USE_OPTMATH OFF Enable OptMathKernels integration
ADAPT_OPTMATH_PATH "" Path to OptMathKernels installation

Usage

Basic Example

#include <adapt/adaptive_fir.hpp>
#include <vector>

int main() {
    // Create a 64-tap NLMS filter
    adapt::Params params;
    params.mu = 0.01f;   // Step size (must be >= 0)
    params.eps = 1e-6f;  // Regularization (must be > 0)

    adapt::AdaptiveFIR<float> filter(64, adapt::Algorithm::NLMS, params);

    // Process signals
    std::vector<float> input(1024);    // Input signal
    std::vector<float> desired(1024);  // Desired/reference signal
    std::vector<float> output(1024);   // Filter output

    // ... fill input and desired ...

    filter.process(
        adapt::Span<const float>(input.data(), input.size()),
        adapt::Span<const float>(desired.data(), desired.size()),
        adapt::Span<float>(output.data(), output.size())
    );

    // Get adapted weights
    const auto& weights = filter.weights();

    return 0;
}

Complex Signals

#include <adapt/adaptive_fir.hpp>
#include <complex>

using cf32 = std::complex<float>;

adapt::AdaptiveFIR<cf32> filter(128, adapt::Algorithm::BLOCK_NLMS);

std::vector<cf32> x(4096), d(4096), y(4096);
// ... fill x and d ...

filter.process(
    adapt::Span<const cf32>(x.data(), x.size()),
    adapt::Span<const cf32>(d.data(), d.size()),
    adapt::Span<cf32>(y.data(), y.size())
);

Runtime Parameter Updates

// Parameters can be changed at runtime (validated on set)
filter.set_mu(0.005f);   // Reduce step size (throws if < 0)
filter.set_eps(1e-8f);   // Adjust regularization (throws if <= 0)

// Reset filter to initial state (clears weights, history, and FFT overlap)
filter.reset_state();

// Set specific weights
std::vector<float> new_weights(64, 0.0f);
new_weights[0] = 1.0f;
filter.set_weights(new_weights);

CUDA-Accelerated Filter

#ifdef ADAPT_HAVE_CUDA
#include <adapt/cuda/adaptive_fir_cuda.hpp>

// GPU-accelerated filter (automatically falls back to CPU for small filters)
adapt::cuda::AdaptiveFIRCuda<float> gpu_filter(256, adapt::Algorithm::BLOCK_LMS);

// Check if GPU is being used
if (gpu_filter.is_using_gpu()) {
    std::cout << "Using GPU acceleration\n";
}

// Same API as CPU version
gpu_filter.process(x_span, d_span, y_span);
#endif

API Reference

AdaptiveFIR<T> Class

template <typename T>  // T = float or std::complex<float>
class AdaptiveFIR {
public:
    // Constructor - throws on invalid params (filter_len==0, mu<0, eps<=0, max_nfft==0)
    AdaptiveFIR(std::size_t filter_len, Algorithm alg, Params p = {});

    // Properties
    std::size_t filter_len() const;
    Algorithm algorithm() const;
    const std::vector<T>& weights() const;

    // Parameters (validated on set - throws std::runtime_error on invalid values)
    float mu() const;
    float eps() const;
    void set_mu(float mu);    // throws if mu < 0
    void set_eps(float eps);  // throws if eps <= 0

    // Weight management
    void set_weights(const std::vector<T>& w);  // throws if w.size() != filter_len
    void reset_state();  // resets weights to zero, clears history and FFT state

    // Main processing function
    // x: input signal, d: desired/reference signal, y_out: optional filter output
    // throws if x.size() != d.size() or y_out non-empty with wrong size
    void process(Span<const T> x, Span<const T> d, Span<T> y_out = {});
};

Algorithm Enum

enum class Algorithm {
    LMS,        // Sample-wise LMS
    NLMS,       // Sample-wise Normalized LMS
    BLOCK_LMS,  // Block LMS (time-domain for M<=32, FFT for M>32)
    BLOCK_NLMS  // Block Normalized LMS
};

Params Struct

struct Params {
    float mu = 0.01f;              // Step size (must be >= 0)
    float eps = 1e-6f;             // Regularization (must be > 0)
    std::size_t max_nfft = 65536;  // Maximum FFT size cap (must be > 0)

    void validate() const;  // throws std::runtime_error on invalid values
};

Performance

SIMD Acceleration

The library includes hand-optimized SIMD kernels for critical inner-loop operations:

Operation x86_64 AVX2 x86_64 AVX-512 ARM NEON Generic
Real dot product 8 floats/iter 16 floats/iter 4 floats/iter 1 float/iter
Complex multiply 4 complex/iter 8 complex/iter 2 complex/iter 1 complex/iter
Weight update (FMA) 8 floats/iter 16 floats/iter 4 floats/iter 1 float/iter
FFT freq-domain ops Vectorized Vectorized Vectorized Scalar

When OptMathKernels is enabled on ARM, the NEON kernels are replaced with further-optimized implementations from that library for fmac, scale, and complex multiply-accumulate operations.

Kernel Dispatch Priority

The kernel dispatcher selects the best available backend at compile time:

  1. OptMathKernels (if ADAPT_USE_OPTMATH=ON and ARM NEON available)
  2. AVX-512 (if ADAPT_USE_AVX512=ON and CPU supports it)
  3. AVX2+FMA (if ADAPT_USE_AVX2=ON and CPU supports it)
  4. ARM NEON (always available on AArch64)
  5. Generic (portable C++ fallback)

When to Use Each Algorithm

Scenario Recommended Algorithm
Real-time, sample-by-sample LMS or NLMS
Unknown signal scaling NLMS or BLOCK_NLMS
Large filter (>32 taps) BLOCK_LMS or BLOCK_NLMS
High throughput, batched BLOCK_LMS or BLOCK_NLMS
GPU available, very large filter CUDA variant with BLOCK_*

Project Structure

AdaptiveFiltering/
├── include/adapt/
│   ├── adaptive_fir.hpp        # Main adaptive filter class
│   ├── span.hpp                # Minimal span utility
│   ├── traits.hpp              # Type traits (is_complex, scalar_type, conj_if_needed, abs2)
│   ├── smooth_fft.hpp          # FFT size selection (2,3,5,7-smooth numbers)
│   ├── fftw_wrap.hpp           # FFTW3 RAII wrapper with wisdom caching
│   ├── kernels/
│   │   ├── kernels.hpp         # Unified kernel dispatcher
│   │   ├── kernel_config.hpp   # CPU feature detection (CPUID / NEON)
│   │   ├── kernel_generic.hpp  # Portable C++ fallback kernels
│   │   ├── kernel_avx2.hpp     # AVX2+FMA optimized kernels
│   │   ├── kernel_avx512.hpp   # AVX-512 optimized kernels
│   │   ├── kernel_neon.hpp     # ARM NEON optimized kernels
│   │   └── kernel_optmath.hpp  # OptMathKernels bridge (enhanced NEON)
│   └── cuda/
│       ├── cuda_fft.hpp        # cuFFT wrapper
│       └── adaptive_fir_cuda.hpp  # GPU-accelerated filter
├── src/
│   ├── adaptive_fir.cpp        # Template instantiation for linkage
│   ├── fftw_wrap.cpp           # FFTW implementation with mutex + wisdom
│   └── cuda/                   # CUDA kernel implementations
├── tests/                      # 22 unit tests (see Testing section)
├── examples/
│   ├── example_system_identification.cpp
│   └── example_noise_canceller.cpp
├── gnuradio_wrappers/          # GNU Radio sync_block wrappers
├── cmake/
│   └── adapt_config.hpp.in     # Generated configuration header
└── CMakeLists.txt

GNU Radio Integration

GNU Radio-compatible sync_block wrappers are included:

  • adaptive_fir_ff - Float input/output
  • adaptive_fir_cc - Complex input/output

Located in gnuradio_wrappers/. These are designed to be dropped into a GNU Radio OOT module and linked against adapt_filters.

Features:

  • Two input ports: signal (x) and reference (d)
  • Selectable output: filtered signal (y) or error (e)
  • Runtime parameter updates via setters
  • Algorithm switching while preserving weights

Thread Safety

  • FFTW plan creation/destruction: Protected by mutex (thread-safe)
  • FFT execution: Thread-safe (same plan can be used from multiple threads with different data)
  • Filter instances: NOT thread-safe (use separate instances per thread)
  • FFTW wisdom: Loaded once at startup, saved on plan creation

Examples

Two example programs are included:

  1. example_system_identification - Identifies an unknown 96-tap system using Block NLMS with FFT acceleration
  2. example_noise_canceller - Complex-valued adaptive noise cancellation with a 48-tap NLMS filter

Build and run:

./build/example_system_identification
./build/example_noise_canceller

Testing

Test Suite Overview

The test suite contains 22 tests organized into categories:

Core Algorithm Tests (5)

Test Description
test_lms_system_id LMS convergence on system identification
test_nlms_scale_invariance NLMS robustness to signal scaling
test_block_fft_matches_direct FFT overlap-save matches time-domain
test_block_nlms_complex_converges Complex Block NLMS convergence
test_lms_complex Complex LMS: system ID, channel tracking, output accuracy

Algorithm Coverage Tests (3)

Test Description
test_all_algorithms_float All 4 algorithms x multiple filter lengths (float) with NMSE thresholds
test_all_algorithms_complex All 4 algorithms x multiple filter lengths (complex) with NMSE thresholds
test_large_filters M=256, M=512 filters (FFT path stress test)

Robustness Tests (4)

Test Description
test_param_validation 12 checks: invalid mu, eps, sizes, empty input
test_edge_cases M=1, M=2, single sample, M=32/33 threshold boundary
test_numerical_stability Large/small amplitude, zero input, impulse, alternating signals
test_convergence_rate mu comparison, NLMS vs LMS, monotonic improvement

Functional Tests (4)

Test Description
test_weight_management 10 checks: init, get/set, FFT sync, reset, frozen weights
test_streaming_consistency Chunked vs monolithic processing equivalence
test_fft_path_correctness Frozen convolution accuracy at M=64/100/128, partial blocks
test_noise_cancellation Float ANC, complex ANC, echo cancellation with ERLE measurement

Infrastructure Tests (6)

Test Description
test_simd_kernels All SIMD kernel ops (dot, fmac, mul, conj_mul)
test_kernel_consistency Optimized kernels vs generic reference across 27 sizes
test_smooth_fft FFT size selection: smoothness, minimality, constraints
test_fftw_wrap FFTW roundtrip, Parseval's theorem, DFT accuracy, move semantics
test_span Span utility: constructors, access, const, zero-size
test_traits Type traits: is_complex, scalar_type, conj_if_needed, abs2

Running Tests

# Run all tests
ctest --test-dir build

# Run with verbose output
ctest --test-dir build -V

# Run specific test
./build/test_noise_cancellation

Test Results

All 22 tests pass on Raspberry Pi 5 (Cortex-A76, ARM NEON, OptMathKernels enabled):

 1/22 test_lms_system_id ..................   Passed    0.00 sec
 2/22 test_nlms_scale_invariance ..........   Passed    0.00 sec
 3/22 test_block_fft_matches_direct .......   Passed    0.00 sec
 4/22 test_block_nlms_complex_converges ...   Passed    0.02 sec
 5/22 test_simd_kernels ...................   Passed    0.00 sec
 6/22 test_param_validation ...............   Passed    0.00 sec
 7/22 test_edge_cases .....................   Passed    0.01 sec
 8/22 test_all_algorithms_float ...........   Passed    0.06 sec
 9/22 test_all_algorithms_complex .........   Passed    0.10 sec
10/22 test_large_filters ..................   Passed    0.22 sec
11/22 test_weight_management ..............   Passed    0.00 sec
12/22 test_streaming_consistency ..........   Passed    0.00 sec
13/22 test_numerical_stability ............   Passed    0.01 sec
14/22 test_convergence_rate ...............   Passed    0.01 sec
15/22 test_fft_path_correctness ...........   Passed    0.01 sec
16/22 test_noise_cancellation .............   Passed    0.03 sec
17/22 test_smooth_fft .....................   Passed    0.00 sec
18/22 test_fftw_wrap ......................   Passed    0.01 sec
19/22 test_kernel_consistency .............   Passed    0.01 sec
20/22 test_lms_complex ....................   Passed    0.01 sec
21/22 test_span ...........................   Passed    0.00 sec
22/22 test_traits .........................   Passed    0.00 sec

100% tests passed, 0 tests failed out of 22
Total Test time (real) =   0.54 sec
Detailed test output (click to expand)
test_simd_kernels:
  dot_product_f32: PASS
  dot_product_cf32: PASS
  sum_squares_f32: PASS
  sum_norm_cf32: PASS
  fmac_f32: PASS
  mul_cf32: PASS
  conj_mul_cf32: PASS

test_param_validation:
  filter_len=0: PASS
  negative mu: PASS
  eps=0: PASS
  negative eps: PASS
  max_nfft=0: PASS
  set_mu negative: PASS
  set_eps zero: PASS
  set_weights mismatch: PASS
  x/d size mismatch: PASS
  y_out size mismatch: PASS
  mu=0 valid: PASS
  empty input: PASS

test_edge_cases:
  M=1 LMS: PASS
  M=1 complex NLMS: PASS
  M=2 LMS: PASS
  single sample processing: PASS
  M=33 Block NLMS FFT path: PASS
  M=32 Block LMS time domain: PASS

test_all_algorithms_float:
  LMS M=8: PASS (NMSE=0.0000 < 0.15)
  LMS M=16: PASS (NMSE=0.0000 < 0.20)
  NLMS M=8: PASS (NMSE=0.0000 < 0.10)
  NLMS M=16: PASS (NMSE=0.0001 < 0.15)
  NLMS M=32: PASS (NMSE=0.0001 < 0.20)
  BLOCK_LMS M=16: PASS (NMSE=0.0001 < 0.25)
  BLOCK_LMS M=32: PASS (NMSE=0.0001 < 0.30)
  BLOCK_NLMS M=16: PASS (NMSE=0.0000 < 0.20)
  BLOCK_NLMS M=32: PASS (NMSE=0.0006 < 0.25)
  BLOCK_LMS M=64: PASS (NMSE=0.0000 < 0.35)
  BLOCK_NLMS M=64: PASS (NMSE=0.0001 < 0.30)
  BLOCK_NLMS M=128: PASS (NMSE=0.0006 < 0.35)

test_all_algorithms_complex:
  LMS M=8: PASS (NMSE=0.0000 < 0.20)
  LMS M=16: PASS (NMSE=0.0000 < 0.25)
  NLMS M=8: PASS (NMSE=0.0002 < 0.15)
  NLMS M=16: PASS (NMSE=0.0001 < 0.20)
  NLMS M=32: PASS (NMSE=0.0001 < 0.25)
  BLOCK_LMS M=16: PASS (NMSE=0.0001 < 0.30)
  BLOCK_LMS M=64: PASS (NMSE=0.0000 < 0.35)
  BLOCK_NLMS M=16: PASS (NMSE=0.0000 < 0.25)
  BLOCK_NLMS M=64: PASS (NMSE=0.0002 < 0.30)
  BLOCK_NLMS M=96: PASS (NMSE=0.0001 < 0.35)

test_large_filters:
  M=256 float BLOCK_NLMS: PASS (NMSE=0.0006)
  M=256 complex BLOCK_NLMS: PASS (NMSE=0.0001)
  M=512 float BLOCK_NLMS: PASS (NMSE=0.0001)

test_weight_management:
  initial weights zero: PASS
  set/get weights roundtrip: PASS
  set_weights FFT path: PASS
  reset_state clears weights: PASS
  frozen weights (mu=0): PASS
  complex frozen weights: PASS
  complex set_weights FFT roundtrip: PASS
  filter_len accessor: PASS
  algorithm accessor: PASS
  mu/eps accessors: PASS

test_streaming_consistency:
  LMS float chunk=1: PASS
  LMS float chunk=7: PASS
  LMS float chunk=50: PASS
  NLMS float chunk=1: PASS
  NLMS float chunk=13: PASS
  LMS complex chunk=1: PASS
  NLMS complex chunk=11: PASS
  BLOCK_NLMS float chunk=100 vs 500: PASS

test_numerical_stability:
  large amplitude NLMS: PASS
  small amplitude NLMS: PASS
  zero input signal: PASS
  impulse response: PASS
  complex large amplitude: PASS
  alternating amplitude: PASS
  block FFT large amplitude complex: PASS

test_convergence_rate:
  NLMS higher mu converges faster: (mu=0.05: 0.000005, mu=0.5: 0.000129) PASS
  NLMS converges faster than LMS: PASS (LMS: 0.0000, NLMS: 0.0001)
  monotonic improvement: PASS
  more data better convergence: PASS (N=200: 0.7409, N=10000: 0.0000)

test_fft_path_correctness:
  float M=64 FFT frozen convolution: PASS (max_err=0.000000)
  complex M=64 FFT frozen convolution: PASS (max_err=0.000000)
  float M=128 FFT frozen convolution: PASS (max_err=0.000000)
  float M=100 FFT frozen convolution: PASS (max_err=0.000000)
  partial block (N < Nfft): PASS (max_err=0.000000)
  multiple partial blocks: PASS (max_err=0.000000)

test_noise_cancellation:
  float ANC (correlated noise removal): PASS (SNR: -7.9 -> 4.8)
  complex ANC: PASS (steady-state SNR: 6.8 dB)
  echo cancellation scenario: PASS (ERLE: 59.1 dB)

test_kernel_consistency:
  dot_product_f32 vs generic: PASS
  sum_squares_f32 vs generic: PASS
  fmac_f32 vs generic: PASS
  dot_product_cf32 vs generic: PASS
  sum_norm_cf32 vs generic: PASS
  mul_cf32 vs generic: PASS
  conj_mul_cf32 vs generic: PASS
  scale_inplace_cf32 vs generic: PASS
  fmac_cf32 vs generic: PASS
  template dispatchers: PASS
  CPU feature detection: PASS

test_lms_complex:
  complex LMS sysid M=16: PASS (NMSE=0.0000)
  complex NLMS tracking: PASS (NMSE=0.0003 tracking h2)
  complex LMS output accuracy: PASS

test_fftw_wrap:
  forward-inverse roundtrip: PASS
  Parseval's theorem: PASS
  DC signal DFT: PASS
  single frequency DFT: PASS
  various FFT sizes: PASS
  non-power-of-2 size: PASS
  move semantics: PASS

test_smooth_fft:
  basic selection: PASS
  powers of 2: PASS
  non-smooth roundup: PASS
  all results smooth: PASS
  result is minimal: PASS
  max_n constraint: PASS
  large numbers: PASS
  target=0: PASS

test_span:
  default constructor: PASS
  pointer+size constructor: PASS
  element access: PASS
  const span: PASS
  span from vector: PASS
  zero-size span: PASS

test_traits:
  is_complex: PASS
  scalar_type: PASS
  conj_if_needed: PASS
  abs2: PASS

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please ensure:

  1. All tests pass (ctest --test-dir build)
  2. New features include appropriate tests
  3. Code follows existing style conventions

Acknowledgments

  • FFTW3 for high-performance FFT
  • OptMathKernels for optimized ARM NEON kernels
  • NVIDIA for cuFFT and CUDA toolkit

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •