ggml : transpose fused GDN state access for coalesced memory reads by arkavo-com · Pull Request #20437 · ggml-org/llama.cpp

arkavo-com · 2026-03-12T01:29:20Z

Summary

Fixes #20436

Transpose state indexing in the fused GDN kernel across Metal, CUDA, and CPU so threads read contiguously instead of column-strided
Add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) to control fused GDN independently
Use ggml_vec_dot_f32 in CPU kernel for SIMD-optimized dot products
Couple AR/chunked fused flags in auto-detection to prevent state layout mismatch

Root cause

The fused GDN kernel's [S_v, S_v] state matrix was stored row-major but accessed column-wise. A 32-thread SIMD group hit 32 different cache lines but used only 4 bytes each (~3% utilization). At S_v=128 (Qwen3.5-9B), 32 heads x 128x128 state (2MB/layer x 24 recurrent layers) spills L2 cache, amplifying the penalty.

Fix

Store the state matrix transposed: M[col][row] = S[row][col]. Reading "column col of S" becomes "row col of M" = contiguous. Since S is square and the fused kernel consistently reads/writes in this format, mathematical equivalence is maintained.

Benchmark results

Apple M4 Max, Metal, Qwen3.5 Q4_K_M (prompt: 4 tokens, generation: 16 tokens)

Model	Mode	Prompt (t/s)	Generation (t/s)	Prompt Delta	Gen Delta
0.8B	Fused OFF	402.5	170.7	--	--
0.8B	Fused ON (fix)	416.5	233.1	+3.5%	+36.5%
9B	Fused OFF	109.3	53.6	--	--
9B	Fused ON (fix)	140.3	65.5	+28.4%	+22.2%
27B	Fused OFF	32.9	14.7	--	--
27B	Fused ON (fix)	40.3	16.7	+22.5%	+13.6%

The 9B regression (previously -39% with fused GDN) is fully resolved. Fused GDN is now faster than the unfused path across all model sizes.

Test plan

test-backend-ops test -o GATED_DELTA_NET -- 13/13 pass on Metal
Benchmark Qwen3.5-9B to confirm regression is resolved
Benchmark Qwen3.5-0.8B and 27B to confirm no regressions
Verify --fused-gdn off disables fused path
Test on CUDA backend

Generated with Claude Code

…gml-org#20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ed flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ggerganov · 2026-03-12T06:56:15Z

Thanks, I'll continue this in #20443

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 12, 2026

ggerganov mentioned this pull request Mar 12, 2026

graph : remove redundant GDN state transposes #20443

Merged

1 task

ggerganov closed this Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : transpose fused GDN state access for coalesced memory reads#20437

ggml : transpose fused GDN state access for coalesced memory reads#20437
arkavo-com wants to merge 2 commits intoggml-org:masterfrom
arkavo-ai:fix-gdn-state-access-pattern

arkavo-com commented Mar 12, 2026 •

edited

Loading

Uh oh!

ggerganov commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arkavo-com commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Benchmark results

Test plan

Uh oh!

ggerganov commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arkavo-com commented Mar 12, 2026 •

edited

Loading