ggml : transpose fused GDN state access for coalesced memory reads#20437
Closed
arkavo-com wants to merge 2 commits intoggml-org:masterfrom
Closed
ggml : transpose fused GDN state access for coalesced memory reads#20437arkavo-com wants to merge 2 commits intoggml-org:masterfrom
arkavo-com wants to merge 2 commits intoggml-org:masterfrom
Conversation
…gml-org#20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 task
Member
|
Thanks, I'll continue this in #20443 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #20436
--fused-gdn [on|off|auto]CLI flag (mirrors--flash-attn) to control fused GDN independentlyggml_vec_dot_f32in CPU kernel for SIMD-optimized dot productsRoot cause
The fused GDN kernel's
[S_v, S_v]state matrix was stored row-major but accessed column-wise. A 32-thread SIMD group hit 32 different cache lines but used only 4 bytes each (~3% utilization). At S_v=128 (Qwen3.5-9B), 32 heads x 128x128 state (2MB/layer x 24 recurrent layers) spills L2 cache, amplifying the penalty.Fix
Store the state matrix transposed:
M[col][row] = S[row][col]. Reading "column col of S" becomes "row col of M" = contiguous. Since S is square and the fused kernel consistently reads/writes in this format, mathematical equivalence is maintained.Benchmark results
Apple M4 Max, Metal, Qwen3.5 Q4_K_M (prompt: 4 tokens, generation: 16 tokens)
The 9B regression (previously -39% with fused GDN) is fully resolved. Fused GDN is now faster than the unfused path across all model sizes.
Test plan
test-backend-ops test -o GATED_DELTA_NET-- 13/13 pass on Metal--fused-gdn offdisables fused pathGenerated with Claude Code