graph : remove redundant GDN state transposes by ggerganov · Pull Request #20443 · ggml-org/llama.cpp

ggerganov · 2026-03-12T06:53:47Z

cont #20437
fix #20436

As correctly noted in #20436, there is no need to transpose the recurrent state in the GDN computation. Simplify the ggml graph for the unfused path + optimize the fused kernels using coalesced read/writes.

TODOs:

Wait for vulkan: add GATED_DELTA_NET op support #20334 to be merged and adapt the current PR

…20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ed flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

0cc4m · 2026-03-12T08:23:14Z

This first or the Vulkan implementation first? It's gonna be stuck waiting for the CI forever, if we have to keep adapting it before it has a chance to finish. Otherwise I have to merge it without waiting for the CI.

ggerganov · 2026-03-12T08:25:19Z

@0cc4m Yes, proceed with the Vulkan implementation. Sorry for the rapid changes.

ORippler · 2026-03-12T10:32:50Z

This currently speeds-up TG on NVGPU due to coalescing the writes (in TG we process n_token=1 and afterwards write the states back, whereas in PP we process ubatch tokens and write states back at the end). Uncoalesced reads are not so important on big caches, and I thus saw no trashing occur (though on pre-Ada GPUs this may happen). Further PP improvements will have to revolve around batching data-accesses (either process > 1 col per warp and do wider loads or unroll n_tokens). WIP for the latter would be #20449, as I'm not super familiar with data-layouting/alignment that comes into the cuda op

ORippler · 2026-03-12T10:49:28Z

Relevant NCU sections for TG

Relevant NCU sections for PP

arkavo-com and others added 4 commits March 12, 2026 08:27

llama : rever fgdn argument changes

1b82571

graph : remove GDN state transposes

7ea6ee4

ggerganov requested a review from CISC as a code owner March 12, 2026 06:53

ggerganov mentioned this pull request Mar 12, 2026

ggml : transpose fused GDN state access for coalesced memory reads #20437

Closed

5 tasks

github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 12, 2026

CISC approved these changes Mar 12, 2026

View reviewed changes

am17an mentioned this pull request Mar 12, 2026

CUDA: optimize GDN by hiding global memory loads #20448

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph : remove redundant GDN state transposes#20443

graph : remove redundant GDN state transposes#20443
ggerganov wants to merge 4 commits intomasterfrom
gg/gdn-fix-state-transpose

ggerganov commented Mar 12, 2026 •

edited by 0cc4m

Loading

Uh oh!

0cc4m commented Mar 12, 2026

Uh oh!

ggerganov commented Mar 12, 2026

Uh oh!

ORippler commented Mar 12, 2026 •

edited

Loading

Uh oh!

ORippler commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ggerganov commented Mar 12, 2026 • edited by 0cc4m Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Mar 12, 2026

Uh oh!

ggerganov commented Mar 12, 2026

Uh oh!

ORippler commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggerganov commented Mar 12, 2026 •

edited by 0cc4m

Loading

ORippler commented Mar 12, 2026 •

edited

Loading