-
Notifications
You must be signed in to change notification settings - Fork 23
Description
In PhastFT for smaller sizes I'm calling dispatch! three times when running an FFT operation on 512 bytes of data (64-long batch of f64) and it is degrading performance by 25% (-20% throughput) measured as of commit https://github.com/QuState/PhastFT/tree/e5fcd61f3d540fcef9f8d60173dbfbe777c02e40
Meanwhile RustFFT with its handwritten dispatch does not suffer any penalty at all, and in fact is slightly slower under -C target-cpu=native than it is under its regular dynamic dispatch.
This overhead needs to be removed for code based on fearless_simd to be competitive with handwritten dynamic dispatch.
perf diff and profiling with samply both point to these dispatch! calls as a major source of slowdown: https://github.com/QuState/PhastFT/blob/c7ea3d7aef474e53233834354364fa50bbb0ba6e/src/algorithms/dit.rs#L259-L260
Profile with -C target-cpu=x86-64-v3: https://share.firefox.dev/3LMqjuI
Profile with dynamic dispatch: https://share.firefox.dev/3NTwJZw
I'm not sure what the cause is. I wouldn't expect a handful of perfectly predictable branches to tank performance. Perhaps dispatch! results in subotimal codegen, or perhaps I'm just pushing the boundaries of dynamic dispatch and need a facility to get a function pointer and store it in a struct for reuse instead of just reusing a cached Level.