Skip to content

CUDA: optimize GDN by hiding global memory loads#20448

Closed
am17an wants to merge 6 commits intoggml-org:gg/gdn-fix-state-transposefrom
am17an:cuda_gdn_load
Closed

CUDA: optimize GDN by hiding global memory loads#20448
am17an wants to merge 6 commits intoggml-org:gg/gdn-fix-state-transposefrom
am17an:cuda_gdn_load

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Mar 12, 2026

Optimize GDN by staging global memory loads across two buffers to hide latency

TODO

on a 5090

Model Test t/s c3e3f9e t/s cuda_gdn_load Speedup
kimi-linear 48B.A3B Q4_K_M pp512 6442.71 6373.16 0.99
kimi-linear 48B.A3B Q4_K_M pp2048 9439.95 9669.07 1.02
kimi-linear 48B.A3B Q4_K_M pp4096 9362.95 9626.56 1.03
kimi-linear 48B.A3B Q4_K_M tg128 201.23 200.95 1.00
qwen35moe 35B.A3B Q4_K_S pp512 7063.34 6909.85 0.98
qwen35moe 35B.A3B Q4_K_S pp2048 8895.10 9526.67 1.07
qwen35moe 35B.A3B Q4_K_S pp4096 9039.57 9671.51 1.07
qwen35moe 35B.A3B Q4_K_S tg128 192.03 191.79 1.00

arkavo-com and others added 4 commits March 12, 2026 08:27
…gml-org#20436)

The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.

Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA:  curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU:   restructured loops for row-wise transposed access

Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.

All GATED_DELTA_NET backend-ops tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed flags

- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
  dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
  path lacks device support, disable both to prevent state layout mismatch
  between transposed (fused) and non-transposed (unfused) formats

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 12, 2026
@am17an
Copy link
Collaborator Author

am17an commented Mar 12, 2026

Ah this is actually slower on other cards, need to investigate

@am17an am17an force-pushed the cuda_gdn_load branch 2 times, most recently from 5501b39 to deb2943 Compare March 12, 2026 13:31
@CISC
Copy link
Member

CISC commented Mar 12, 2026

Bad merge.

@github-actions github-actions bot added documentation Improvements or additions to documentation model Model specific Vulkan Issues specific to the Vulkan backend examples python python script changes server Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 12, 2026
@am17an am17an removed documentation Improvements or additions to documentation model Model specific Vulkan Issues specific to the Vulkan backend examples python python script changes server ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 12, 2026
@am17an am17an removed the request for review from CISC March 12, 2026 13:37
@am17an am17an changed the base branch from master to gg/gdn-fix-state-transpose March 12, 2026 13:49
@am17an
Copy link
Collaborator Author

am17an commented Mar 12, 2026

5090, ~8% improvement

Model Test t/s gg/gdn-fix-state-transpose t/s cuda_gdn_load Speedup
qwen35moe 35B.A3B Q4_K_S pp512 7074.59 7079.95 1.00
qwen35moe 35B.A3B Q4_K_S pp2048 8904.86 9594.99 1.08
qwen35moe 35B.A3B Q4_K_S tg128 197.02 197.43 1.00

4090:

Model Test t/s gg/gdn-fix-state-transpose t/s cuda_gdn_load Speedup
qwen35moe 35B.A3B Q4_K_S pp512 6501.74 6470.37 1.00
qwen35moe 35B.A3B Q4_K_S pp2048 8484.94 8430.49 0.99
qwen35moe 35B.A3B Q4_K_S tg128 183.11 183.21 1.00

3090:

Model Test t/s gg/gdn-fix-state-transpose t/s cuda_gdn_load Speedup
qwen35moe 35B.A3B Q4_K_S pp512 3066.08 3033.36 0.99
qwen35moe 35B.A3B Q4_K_S pp2048 4183.36 4074.75 0.97
qwen35moe 35B.A3B Q4_K_S tg128 142.97 141.11 0.99

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 12, 2026
Copy link
Collaborator

@ORippler ORippler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly prefetching k and q improves perf on BW due to better scoreboarding of LDGs which are high-latency Instructions (500 -> 330 us = 1.5x), while it slightly reduces perf on pre-BW due to worse scoreboarding of register-movement that are low-latency instructions (350 -> 360 us = 2% slowdown). The gain outweighs the loss in my eyes, but if needed one can ifdef via __CUDA_ARCH__. Not sure about how it influences AMD perf, @IMbackK may run some tests. Approving from a CUDA-perspective

@ORippler
Copy link
Collaborator

ORippler commented Mar 13, 2026

BW

Untitled Untitled

Ada

Untitled

If motivated, one could try manual hoisting of pointer arithmetic (pull loop invariants out of the loop over n_tokens for the incoming ptrs) to see if it improves/recovers behavior on pre-BW GPUs

@ggerganov ggerganov force-pushed the gg/gdn-fix-state-transpose branch from 7ea6ee4 to fe3ef4a Compare March 13, 2026 16:13
@github-actions github-actions bot added model Model specific Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 13, 2026
@ggerganov ggerganov deleted the branch ggml-org:gg/gdn-fix-state-transpose March 13, 2026 20:12
@ggerganov ggerganov closed this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants