⚡ Profiler: Optimize LBM Solver (~20 MLUPS)#616
Conversation
…and scalar replacement - Introduce `equilibrium_component` to `Lattice2D` trait to avoid stack array allocation in collision. - Optimize `LatticeBoltzmannD2Q9::step` using a 3-pass strategy (Stream, Macroscopic, Collision) for better vectorization. - Implement precomputed stencil offsets in `stream` to reduce integer multiplication overhead. - Benchmarks show ~20 MLUPS (up from unoptimized 16.5 MLUPS in hybrid mode). - Verified correctness with `cargo test` and mass conservation checks. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
- Apply `cargo fmt` to `bench_lbm.rs` and `lattice_boltzmann/mod.rs`. - Fix CI build failure due to style check. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
- Update loops in `LatticeBoltzmannD2Q9` to use `.iter_mut().enumerate()` instead of index-based range loops. - Resolves `clippy::needless_range_loop` lint failures in CI. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
⚡ Profiler: LBM Loop Optimization
📉 The Bottleneck:
The Lattice Boltzmann Method (LBM) solver was identified as a compute-intensive kernel limited by memory bandwidth and scalar execution overhead. The original implementation (or intermediate attempts) suffered from inefficient memory access patterns (gather/scatter) and stack allocation of temporary arrays (
[f64; 9]) during the collision step.🚀 The Boost:
Optimized the
stepfunction to achieve approximately 20 MLUPS on the benchmark environment.(Note: Comparing against a baseline of 26 MLUPS which could not be reproduced/verified in the current environment, the optimized 3-pass strategy beat the Fused strategy (19.9 vs 18.6 MLUPS)).
💻 Technical Detail:
Stream,Macroscopic, andCollisionloops. This allows the compiler to auto-vectorize the compute-heavy Macroscopic and Collision steps (linear memory access).Streamloop, eliminating(y*width + x)re-calculation for every neighbor.equilibrium_componenttoLattice2Dto compute equilibrium values on-the-fly, preventing the allocation of[f64; 9]arrays on the stack during collision.🧪 Verification:
cargo test physics::fluid_dynamicspassed (includingsecurity_lbm_validation).examples/bench_lbm.rsbenchmark added and verified mass conservation.PR created automatically by Jules for task 98323151705483289 started by @fderuiter