Skip to content

CUDA: Optimize GDN PP perf#20449

Closed
ORippler wants to merge 12 commits intoggml-org:gg/gdn-fix-state-transposefrom
ORippler:osimons/gdn-pp-optimizations
Closed

CUDA: Optimize GDN PP perf#20449
ORippler wants to merge 12 commits intoggml-org:gg/gdn-fix-state-transposefrom
ORippler:osimons/gdn-pp-optimizations

Conversation

@ORippler
Copy link
Collaborator

@ORippler ORippler commented Mar 12, 2026

This PR optimizes GDN PP perf by minimizing data-access calls via unrolling over n_tokens.

Perf for c86d89a (i.e. no distinguishing between tail and tail-less calls on kernel-side)

GGML_CUDA=ON ./scripts/compare-commits.sh master osimons/gdn-pp-optimizations llama-bench -m /mnt/share/gguf/Qwen/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf -m /mnt/share/gguf/bartowski/Qwen_Qwen3.5-0.8B-GGUF/Qwen_Qwen3.5-0.8B-Q8_0.gguf -m /mnt/share/gguf/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf -m /mnt/share/gguf/unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 8096 -ub 2048 -ub 2047 -mmp 0 -dio 1 -fa 1
Model Test t/s master t/s osimons/gdn-pp-optimizations Speedup
kimi-linear 48B.A3B Q4_K_M pp8096 7845.40 7794.67 0.99
qwen35 0.8B Q8_0 pp8096 43630.82 50364.84 1.15
qwen35moe 35B.A3B Q4_K_M pp8096 8535.75 8684.72 1.02
qwen3next 80B.A3B Q4_K_M pp8096 4267.76 4423.98 1.04

Perf for 33ee71b (separate kernel for tail and tail-less versions)

GGML_CUDA=ON ./scripts/compare-commits.sh master osimons/gdn-pp-optimizations llama-bench -m /mnt/share/gguf/Qwen/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf -m /mnt/share/gguf/bartowski/Qwen_Qwen3.5-0.8B-GGUF/Qwen_Qwen3.5-0.8B-Q8_0.gguf -m /mnt/share/gguf/bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf -m /mnt/share/gguf/unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -p 8096 -ub 2048 -ub 2047 -mmp 0 -dio 1 -fa 1
Model Microbatch size Test t/s master t/s osimons/gdn-pp-optimizations Speedup
kimi-linear 48B.A3B Q4_K_M 2047 pp8096 7493.63 7476.23 1.00
kimi-linear 48B.A3B Q4_K_M 2048 pp8096 7796.75 7883.26 1.01
qwen35 0.8B Q8_0 2047 pp8096 36443.05 42682.76 1.17
qwen35 0.8B Q8_0 2048 pp8096 43576.50 50872.04 1.17
qwen35moe 35B.A3B Q4_K_M 2047 pp8096 8131.01 8191.54 1.01
qwen35moe 35B.A3B Q4_K_M 2048 pp8096 8537.95 8703.95 1.02
qwen3next 80B.A3B Q4_K_M 2047 pp8096 4118.30 4288.26 1.04
qwen3next 80B.A3B Q4_K_M 2048 pp8096 4264.98 4474.17 1.05

Not sure if the changes are worth the complexity. Further gains would come from coalescing writes to attn_data (currently they are not), or from potentially doing wider loads/writes (though this would require investigations on alignedness of incoming ptrs)

arkavo-com and others added 5 commits March 12, 2026 08:27
…gml-org#20436)

The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.

Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA:  curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU:   restructured loops for row-wise transposed access

Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.

All GATED_DELTA_NET backend-ops tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed flags

- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
  dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
  path lacks device support, disable both to prevent state layout mismatch
  between transposed (fused) and non-transposed (unfused) formats

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 12, 2026
@github-actions github-actions bot added the testing Everything test related label Mar 12, 2026
@ORippler ORippler marked this pull request as ready for review March 12, 2026 20:41
@ORippler ORippler requested a review from ggerganov as a code owner March 12, 2026 20:41
@ORippler ORippler requested a review from am17an March 12, 2026 20:42
@am17an
Copy link
Contributor

am17an commented Mar 13, 2026

We can try #20448 first since that's a much simpler change? Need to make sure that it doesn't regress on older hardware or HIP

@ORippler
Copy link
Collaborator Author

ORippler commented Mar 13, 2026

Yeah was mainly looking for this kind of feedback to determine if it's worth to push this one further (coalesce writes + code cleanup), given also that chunked GDN should outperform AR GDN. Will take a look at #20448 first

@ORippler ORippler marked this pull request as draft March 13, 2026 09:37
@ggerganov ggerganov force-pushed the gg/gdn-fix-state-transpose branch from 7ea6ee4 to fe3ef4a Compare March 13, 2026 16:13
@ORippler ORippler closed this Mar 13, 2026
@ORippler ORippler deleted the osimons/gdn-pp-optimizations branch March 13, 2026 16:41
@ORippler
Copy link
Collaborator Author

Closed in favor of #20488

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants