CUDA: Optimize GDN PP perf#20449
Closed
ORippler wants to merge 12 commits intoggml-org:gg/gdn-fix-state-transposefrom
Closed
CUDA: Optimize GDN PP perf#20449ORippler wants to merge 12 commits intoggml-org:gg/gdn-fix-state-transposefrom
ORippler wants to merge 12 commits intoggml-org:gg/gdn-fix-state-transposefrom
Conversation
…gml-org#20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 task
Contributor
|
We can try #20448 first since that's a much simpler change? Need to make sure that it doesn't regress on older hardware or HIP |
Collaborator
Author
|
Yeah was mainly looking for this kind of feedback to determine if it's worth to push this one further (coalesce writes + code cleanup), given also that chunked GDN should outperform AR GDN. Will take a look at #20448 first |
7ea6ee4 to
fe3ef4a
Compare
Collaborator
Author
|
Closed in favor of #20488 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR optimizes GDN PP perf by minimizing data-access calls via unrolling over n_tokens.
Perf for c86d89a (i.e. no distinguishing between tail and tail-less calls on kernel-side)
Perf for 33ee71b (separate kernel for tail and tail-less versions)
Not sure if the changes are worth the complexity. Further gains would come from coalescing writes to
attn_data(currently they are not), or from potentially doing wider loads/writes (though this would require investigations on alignedness of incoming ptrs)