Skip to content

ggml : transpose fused GDN state access for coalesced memory reads#20437

Closed
arkavo-com wants to merge 2 commits intoggml-org:masterfrom
arkavo-ai:fix-gdn-state-access-pattern
Closed

ggml : transpose fused GDN state access for coalesced memory reads#20437
arkavo-com wants to merge 2 commits intoggml-org:masterfrom
arkavo-ai:fix-gdn-state-access-pattern

Conversation

@arkavo-com
Copy link
Contributor

@arkavo-com arkavo-com commented Mar 12, 2026

Summary

Fixes #20436

  • Transpose state indexing in the fused GDN kernel across Metal, CUDA, and CPU so threads read contiguously instead of column-strided
  • Add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) to control fused GDN independently
  • Use ggml_vec_dot_f32 in CPU kernel for SIMD-optimized dot products
  • Couple AR/chunked fused flags in auto-detection to prevent state layout mismatch

Root cause

The fused GDN kernel's [S_v, S_v] state matrix was stored row-major but accessed column-wise. A 32-thread SIMD group hit 32 different cache lines but used only 4 bytes each (~3% utilization). At S_v=128 (Qwen3.5-9B), 32 heads x 128x128 state (2MB/layer x 24 recurrent layers) spills L2 cache, amplifying the penalty.

Fix

Store the state matrix transposed: M[col][row] = S[row][col]. Reading "column col of S" becomes "row col of M" = contiguous. Since S is square and the fused kernel consistently reads/writes in this format, mathematical equivalence is maintained.

Benchmark results

Apple M4 Max, Metal, Qwen3.5 Q4_K_M (prompt: 4 tokens, generation: 16 tokens)

Model Mode Prompt (t/s) Generation (t/s) Prompt Delta Gen Delta
0.8B Fused OFF 402.5 170.7 -- --
0.8B Fused ON (fix) 416.5 233.1 +3.5% +36.5%
9B Fused OFF 109.3 53.6 -- --
9B Fused ON (fix) 140.3 65.5 +28.4% +22.2%
27B Fused OFF 32.9 14.7 -- --
27B Fused ON (fix) 40.3 16.7 +22.5% +13.6%

The 9B regression (previously -39% with fused GDN) is fully resolved. Fused GDN is now faster than the unfused path across all model sizes.

Test plan

  • test-backend-ops test -o GATED_DELTA_NET -- 13/13 pass on Metal
  • Benchmark Qwen3.5-9B to confirm regression is resolved
  • Benchmark Qwen3.5-0.8B and 27B to confirm no regressions
  • Verify --fused-gdn off disables fused path
  • Test on CUDA backend

Generated with Claude Code

…gml-org#20436)

The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.

Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA:  curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU:   restructured loops for row-wise transposed access

Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.

All GATED_DELTA_NET backend-ops tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 12, 2026
…ed flags

- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
  dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
  path lacks device support, disable both to prevent state layout mismatch
  between transposed (fused) and non-transposed (unfused) formats

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ggerganov
Copy link
Member

Thanks, I'll continue this in #20443

@ggerganov ggerganov closed this Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fused GDN kernel: column-wise state access causes cache thrashing on Metal/CUDA/CPU

2 participants