Skip to content

graph : remove redundant GDN state transposes#20443

Open
ggerganov wants to merge 4 commits intomasterfrom
gg/gdn-fix-state-transpose
Open

graph : remove redundant GDN state transposes#20443
ggerganov wants to merge 4 commits intomasterfrom
gg/gdn-fix-state-transpose

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Mar 12, 2026

cont #20437
fix #20436

As correctly noted in #20436, there is no need to transpose the recurrent state in the GDN computation. Simplify the ggml graph for the unfused path + optimize the fused kernels using coalesced read/writes.

TODOs:

arkavo-com and others added 4 commits March 12, 2026 08:27
…20436)

The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.

Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA:  curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU:   restructured loops for row-wise transposed access

Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.

All GATED_DELTA_NET backend-ops tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed flags

- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
  dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
  path lacks device support, disable both to prevent state layout mismatch
  between transposed (fused) and non-transposed (unfused) formats

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ggerganov ggerganov requested a review from CISC as a code owner March 12, 2026 06:53
@github-actions github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 12, 2026
@0cc4m
Copy link
Collaborator

0cc4m commented Mar 12, 2026

This first or the Vulkan implementation first? It's gonna be stuck waiting for the CI forever, if we have to keep adapting it before it has a chance to finish. Otherwise I have to merge it without waiting for the CI.

@ggerganov
Copy link
Member Author

@0cc4m Yes, proceed with the Vulkan implementation. Sorry for the rapid changes.

@ORippler
Copy link
Collaborator

ORippler commented Mar 12, 2026

This currently speeds-up TG on NVGPU due to coalescing the writes (in TG we process n_token=1 and afterwards write the states back, whereas in PP we process ubatch tokens and write states back at the end). Uncoalesced reads are not so important on big caches, and I thus saw no trashing occur (though on pre-Ada GPUs this may happen). Further PP improvements will have to revolve around batching data-accesses (either process > 1 col per warp and do wider loads or unroll n_tokens). WIP for the latter would be #20449, as I'm not super familiar with data-layouting/alignment that comes into the cuda op

@ORippler
Copy link
Collaborator

Relevant NCU sections for TG

Untitled Untitled 2

Relevant NCU sections for PP

image Untitled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fused GDN kernel: column-wise state access causes cache thrashing on Metal/CUDA/CPU

5 participants