CUDA: Optimize GDN PP perf by ORippler · Pull Request #20449 · ggml-org/llama.cpp

ORippler · 2026-03-12T10:32:37Z

This PR optimizes GDN PP perf by minimizing data-access calls via unrolling over n_tokens.

Perf for c86d89a (i.e. no distinguishing between tail and tail-less calls on kernel-side)

GGML_CUDA=ON ./scripts/compare-commits.sh master osimons/gdn-pp-optimizations llama-bench -m /mnt/share/gguf/Qwen/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf -m /mnt/share/gguf/bartowski/Qwen_Qwen3.5-0.8B-GGUF/Qwen_Qwen3.5-0.8B-Q8_0.gguf -m /mnt/share/gguf/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf -m /mnt/share/gguf/unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 8096 -ub 2048 -ub 2047 -mmp 0 -dio 1 -fa 1

Model	Test	t/s master	t/s osimons/gdn-pp-optimizations	Speedup
kimi-linear 48B.A3B Q4_K_M	pp8096	7845.40	7794.67	0.99
qwen35 0.8B Q8_0	pp8096	43630.82	50364.84	1.15
qwen35moe 35B.A3B Q4_K_M	pp8096	8535.75	8684.72	1.02
qwen3next 80B.A3B Q4_K_M	pp8096	4267.76	4423.98	1.04

Perf for 33ee71b (separate kernel for tail and tail-less versions)

GGML_CUDA=ON ./scripts/compare-commits.sh master osimons/gdn-pp-optimizations llama-bench -m /mnt/share/gguf/Qwen/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf -m /mnt/share/gguf/bartowski/Qwen_Qwen3.5-0.8B-GGUF/Qwen_Qwen3.5-0.8B-Q8_0.gguf -m /mnt/share/gguf/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf -m /mnt/share/gguf/unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 8096 -ub 2048 -ub 2047 -mmp 0 -dio 1 -fa 1

Model	Microbatch size	Test	t/s master	t/s osimons/gdn-pp-optimizations	Speedup
kimi-linear 48B.A3B Q4_K_M	2047	pp8096	7493.63	7476.23	1.00
kimi-linear 48B.A3B Q4_K_M	2048	pp8096	7796.75	7883.26	1.01
qwen35 0.8B Q8_0	2047	pp8096	36443.05	42682.76	1.17
qwen35 0.8B Q8_0	2048	pp8096	43576.50	50872.04	1.17
qwen35moe 35B.A3B Q4_K_M	2047	pp8096	8131.01	8191.54	1.01
qwen35moe 35B.A3B Q4_K_M	2048	pp8096	8537.95	8703.95	1.02
qwen3next 80B.A3B Q4_K_M	2047	pp8096	4118.30	4288.26	1.04
qwen3next 80B.A3B Q4_K_M	2048	pp8096	4264.98	4474.17	1.05

Not sure if the changes are worth the complexity. Further gains would come from coalescing writes to attn_data (currently they are not), or from potentially doing wider loads/writes (though this would require investigations on alignedness of incoming ptrs)

…gml-org#20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ed flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

am17an · 2026-03-13T05:42:27Z

We can try #20448 first since that's a much simpler change? Need to make sure that it doesn't regress on older hardware or HIP

ORippler · 2026-03-13T06:02:44Z

Yeah was mainly looking for this kind of feedback to determine if it's worth to push this one further (coalesce writes + code cleanup), given also that chunked GDN should outperform AR GDN. Will take a look at #20448 first

ORippler · 2026-03-13T16:41:41Z

Closed in favor of #20488

arkavo-com and others added 5 commits March 12, 2026 08:27

llama : rever fgdn argument changes

1b82571

graph : remove GDN state transposes

7ea6ee4

WIP showing that data-access is the limitation in PP

86d393c

ORippler mentioned this pull request Mar 12, 2026

graph : remove redundant GDN state transposes #20443

Merged

1 task

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 12, 2026

ORippler added 3 commits March 12, 2026 12:09

Explicitly prefetch so the compiler knows what is expected of him

2b211be

Add unit-tests for n_tokens % n_tokens_per_loop != 0

f2b24fd

Handle case where n_tokens % n_tokens_per_loop != 0

e60c68f

github-actions bot added the testing Everything test related label Mar 12, 2026

ORippler added 4 commits March 12, 2026 15:33

Account for register pressure of KDA due to per-row g

c86d89a

Code cleanup

02ed849

Try adding untailed kernel for speed-ups in KDA setting

f0629e2

Pull attn_writes out of the compute loop for no_tail

33ee71b

ORippler marked this pull request as ready for review March 12, 2026 20:41

ORippler requested a review from ggerganov as a code owner March 12, 2026 20:41

ORippler requested a review from am17an March 12, 2026 20:42

ORippler marked this pull request as draft March 13, 2026 09:37

ggerganov force-pushed the gg/gdn-fix-state-transpose branch from 7ea6ee4 to fe3ef4a Compare March 13, 2026 16:13

ORippler closed this Mar 13, 2026

ORippler deleted the osimons/gdn-pp-optimizations branch March 13, 2026 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Optimize GDN PP perf#20449

CUDA: Optimize GDN PP perf#20449
ORippler wants to merge 12 commits intoggml-org:gg/gdn-fix-state-transposefrom
ORippler:osimons/gdn-pp-optimizations

ORippler commented Mar 12, 2026 •

edited

Loading

Uh oh!

am17an commented Mar 13, 2026

Uh oh!

ORippler commented Mar 13, 2026 •

edited

Loading

Uh oh!

ORippler commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ORippler commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Mar 13, 2026

Uh oh!

ORippler commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ORippler commented Mar 12, 2026 •

edited

Loading

ORippler commented Mar 13, 2026 •

edited

Loading