CUDA: optimize GDN by hiding global memory loads by am17an · Pull Request #20448 · ggml-org/llama.cpp

am17an · 2026-03-12T09:52:28Z

Optimize GDN by staging global memory loads across two buffers to hide latency

TODO

base off graph : remove redundant GDN state transposes #20443

on a 5090

Model	Test	t/s `c3e3f9e`	t/s cuda_gdn_load	Speedup
kimi-linear 48B.A3B Q4_K_M	pp512	6442.71	6373.16	0.99
kimi-linear 48B.A3B Q4_K_M	pp2048	9439.95	9669.07	1.02
kimi-linear 48B.A3B Q4_K_M	pp4096	9362.95	9626.56	1.03
kimi-linear 48B.A3B Q4_K_M	tg128	201.23	200.95	1.00
qwen35moe 35B.A3B Q4_K_S	pp512	7063.34	6909.85	0.98
qwen35moe 35B.A3B Q4_K_S	pp2048	8895.10	9526.67	1.07
qwen35moe 35B.A3B Q4_K_S	pp4096	9039.57	9671.51	1.07
qwen35moe 35B.A3B Q4_K_S	tg128	192.03	191.79	1.00

…gml-org#20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ed flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

am17an · 2026-03-12T09:58:47Z

Ah this is actually slower on other cards, need to investigate

CISC · 2026-03-12T13:32:43Z

Bad merge.

am17an · 2026-03-12T14:29:12Z

5090, ~8% improvement

Model	Test	t/s gg/gdn-fix-state-transpose	t/s cuda_gdn_load	Speedup
qwen35moe 35B.A3B Q4_K_S	pp512	7074.59	7079.95	1.00
qwen35moe 35B.A3B Q4_K_S	pp2048	8904.86	9594.99	1.08
qwen35moe 35B.A3B Q4_K_S	tg128	197.02	197.43	1.00

4090:

Model	Test	t/s gg/gdn-fix-state-transpose	t/s cuda_gdn_load	Speedup
qwen35moe 35B.A3B Q4_K_S	pp512	6501.74	6470.37	1.00
qwen35moe 35B.A3B Q4_K_S	pp2048	8484.94	8430.49	0.99
qwen35moe 35B.A3B Q4_K_S	tg128	183.11	183.21	1.00

3090:

Model	Test	t/s gg/gdn-fix-state-transpose	t/s cuda_gdn_load	Speedup
qwen35moe 35B.A3B Q4_K_S	pp512	3066.08	3033.36	0.99
qwen35moe 35B.A3B Q4_K_S	pp2048	4183.36	4074.75	0.97
qwen35moe 35B.A3B Q4_K_S	tg128	142.97	141.11	0.99

ORippler

Explicitly prefetching k and q improves perf on BW due to better scoreboarding of LDGs which are high-latency Instructions (500 -> 330 us = 1.5x), while it slightly reduces perf on pre-BW due to worse scoreboarding of register-movement that are low-latency instructions (350 -> 360 us = 2% slowdown). The gain outweighs the loss in my eyes, but if needed one can ifdef via __CUDA_ARCH__. Not sure about how it influences AMD perf, @IMbackK may run some tests. Approving from a CUDA-perspective

ORippler · 2026-03-13T15:01:41Z

BW

Ada

If motivated, one could try manual hoisting of pointer arithmetic (pull loop invariants out of the loop over n_tokens for the incoming ptrs) to see if it improves/recovers behavior on pre-BW GPUs

arkavo-com and others added 4 commits March 12, 2026 08:27

llama : rever fgdn argument changes

1b82571

graph : remove GDN state transposes

7ea6ee4

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 12, 2026

am17an force-pushed the cuda_gdn_load branch 2 times, most recently from 5501b39 to deb2943 Compare March 12, 2026 13:31

am17an requested review from 0cc4m, CISC, allozaur and ggerganov as code owners March 12, 2026 13:31

github-actions bot added documentation Improvements or additions to documentation model Model specific Vulkan Issues specific to the Vulkan backend examples python python script changes server Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 12, 2026

am17an removed the request for review from CISC March 12, 2026 13:37

am17an removed request for 0cc4m, allozaur and ggerganov March 12, 2026 13:37

am17an changed the base branch from master to gg/gdn-fix-state-transpose March 12, 2026 13:49

CUDA: GDN hide memory latency

80c9e50

am17an force-pushed the cuda_gdn_load branch from deb2943 to 80c9e50 Compare March 12, 2026 14:28

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 12, 2026

am17an mentioned this pull request Mar 13, 2026

CUDA: Optimize GDN PP perf #20449

Closed

ORippler approved these changes Mar 13, 2026

View reviewed changes

ggerganov force-pushed the gg/gdn-fix-state-transpose branch from 7ea6ee4 to fe3ef4a Compare March 13, 2026 16:13

move curr_state to correct pos outside the loop

7897a63

github-actions bot added model Model specific Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 13, 2026

ggerganov deleted the branch ggml-org:gg/gdn-fix-state-transpose March 13, 2026 20:12

ggerganov closed this Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: optimize GDN by hiding global memory loads#20448

CUDA: optimize GDN by hiding global memory loads#20448
am17an wants to merge 6 commits intoggml-org:gg/gdn-fix-state-transposefrom
am17an:cuda_gdn_load

am17an commented Mar 12, 2026 •

edited

Loading

Uh oh!

am17an commented Mar 12, 2026

Uh oh!

CISC commented Mar 12, 2026

Uh oh!

am17an commented Mar 12, 2026

Uh oh!

ORippler left a comment

Uh oh!

ORippler commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

am17an commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Mar 12, 2026

Uh oh!

CISC commented Mar 12, 2026

Uh oh!

am17an commented Mar 12, 2026

Uh oh!

ORippler left a comment

Choose a reason for hiding this comment

Uh oh!

ORippler commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

am17an commented Mar 12, 2026 •

edited

Loading

ORippler commented Mar 13, 2026 •

edited

Loading