CUDA: optimize GDN by hiding global memory loads#20448
Closed
am17an wants to merge 6 commits intoggml-org:gg/gdn-fix-state-transposefrom
Closed
CUDA: optimize GDN by hiding global memory loads#20448am17an wants to merge 6 commits intoggml-org:gg/gdn-fix-state-transposefrom
am17an wants to merge 6 commits intoggml-org:gg/gdn-fix-state-transposefrom
Conversation
…gml-org#20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collaborator
Author
|
Ah this is actually slower on other cards, need to investigate |
5501b39 to
deb2943
Compare
Member
|
Bad merge. |
Collaborator
Author
|
5090, ~8% improvement
4090:
3090:
|
ORippler
approved these changes
Mar 13, 2026
Collaborator
ORippler
left a comment
There was a problem hiding this comment.
Explicitly prefetching k and q improves perf on BW due to better scoreboarding of LDGs which are high-latency Instructions (500 -> 330 us = 1.5x), while it slightly reduces perf on pre-BW due to worse scoreboarding of register-movement that are low-latency instructions (350 -> 360 us = 2% slowdown). The gain outweighs the loss in my eyes, but if needed one can ifdef via __CUDA_ARCH__. Not sure about how it influences AMD perf, @IMbackK may run some tests. Approving from a CUDA-perspective
Collaborator
7ea6ee4 to
fe3ef4a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Optimize GDN by staging global memory loads across two buffers to hide latency
TODO
on a 5090