Skip to content

vulkan: chunked parallel kernel for GATED_DELTA_NET#20377

Draft
ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
ProgenyAlpha:vulkan-gdn-chunked
Draft

vulkan: chunked parallel kernel for GATED_DELTA_NET#20377
ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
ProgenyAlpha:vulkan-gdn-chunked

Conversation

@ProgenyAlpha
Copy link
Contributor

Follow-up to #20334. Adds the chunked parallel kernel infrastructure for Vulkan GATED_DELTA_NET, split out per @0cc4m's review feedback.

Depends on #20334 and #20340

Three new compute shaders implementing the chunked algorithm:

  • gated_delta_net_chunk_intra.comp — intra-chunk parallel computation
  • gated_delta_net_chunk_inter.comp — inter-chunk state propagation
  • gated_delta_net_chunk_output.comp — output reconstruction

Includes the rq1neq1 broadcast fix to match #20340's interleaved Q/K layout (head_id % neq1 instead of head_id / rq1).

Chunked dispatch is currently disabled (GDN_CHUNK_THRESHOLD = UINT32_MAX) — the autoregressive path handles all token counts. Enabling it will need cooperative matrix support for the output kernel to be competitive.

16/16 backend-ops tests passing (includes chunked-specific test configs with n_seq_tokens=64/128).

890M benchmarks (Qwen3-Coder-Next REAM Q4_K_M):

Metric Base (#20334) With #20340 + chunked infra Change
PP-512 165.31 t/s 215.46 t/s +30.3%
TG-128 21.16 t/s 21.68 t/s +2.5%

The PP improvement comes from #20340's chunked op path feeding our autoregressive shader more efficiently. The Vulkan chunked dispatch itself isn't active yet — that's the next optimization pass.

@github-actions github-actions bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 11, 2026
@lemmi
Copy link

lemmi commented Mar 11, 2026

Benchmarks, Strix Halo:

master (e1a3999):

model size params backend ngl fa mmap dio test t/s
qwen3next 80B.A3B Q4_K - Medium 46.20 GiB 79.67 B Vulkan 99 1 0 1 pp2048 501.96 ± 4.54
qwen3next 80B.A3B Q4_K - Medium 46.20 GiB 79.67 B Vulkan 99 1 0 1 tg128 38.28 ± 0.02
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan 99 1 0 1 pp2048 701.72 ± 3.59
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan 99 1 0 1 tg128 44.07 ± 0.05
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B Vulkan 99 1 0 1 pp2048 195.42 ± 1.47
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B Vulkan 99 1 0 1 tg128 18.46 ± 0.03

PR (795f15c):

model size params backend ngl fa mmap dio test t/s
qwen3next 80B.A3B Q4_K - Medium 46.20 GiB 79.67 B Vulkan 99 1 0 1 pp2048 558.94 ± 6.59
qwen3next 80B.A3B Q4_K - Medium 46.20 GiB 79.67 B Vulkan 99 1 0 1 tg128 46.39 ± 0.04
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan 99 1 0 1 pp2048 839.17 ± 4.62
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan 99 1 0 1 tg128 53.54 ± 0.05
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B Vulkan 99 1 0 1 pp2048 221.53 ± 2.29
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B Vulkan 99 1 0 1 tg128 21.39 ± 0.02

That's 10-20% better PP performance, depending on the model.

@ProgenyAlpha
Copy link
Contributor Author

ProgenyAlpha commented Mar 11, 2026

@lemmi Great numbers, thanks for testing.

Updated the PR to actually enable the chunked Vulkan dispatch — it's now gated on shader core count (> 16 CUs) instead of being disabled. On my 890M (16 CUs) the 3-dispatch overhead makes chunked slower than autoregressive, so it stays off there. On your 8060S (32 CUs) it should activate automatically for n_tokens > 64 with d128 non-KDA configs.

I can't validate the chunked dispatch path myself since I only have the integrated 890M. If you get a chance to test the latest push, that would tell us whether the chunked shaders actually help PP on discrete hardware or if they need more work (coopmat for the output kernel is the next step if so).

@lemmi
Copy link

lemmi commented Mar 11, 2026

Small clarification: the 8060s is the iGPU on Strix Halo (aka Ryzen AI MAX+ 395). The 8060s has 40CUs.

Performance tanked with the latest patch:
Before:

model size params backend ngl fa mmap dio test t/s
qwen3next 80B.A3B Q4_K - Medium 46.20 GiB 79.67 B Vulkan 99 1 0 1 pp2048 564.42 ± 6.14
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan 99 1 0 1 pp2048 848.64 ± 1.26
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B Vulkan 99 1 0 1 pp2048 229.15 ± 1.30

After:

model size params backend ngl fa mmap dio test t/s
qwen3next 80B.A3B Q4_K - Medium 46.20 GiB 79.67 B Vulkan 99 1 0 1 pp2048 366.17 ± 2.17
qwen35moe 35B.A3B Q8_0 34.36 GiB 34.66 B Vulkan 99 1 0 1 pp2048 490.98 ± 1.61
qwen35moe 122B.A10B Q5_K - Medium 85.60 GiB 122.11 B Vulkan 99 1 0 1 pp2048 165.62 ± 2.21

Three-dispatch chunked pipeline for prompt processing acceleration:
intra-chunk WY decomposition, inter-chunk state propagation, output
combination. Currently disabled (threshold=UINT32_MAX).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ProgenyAlpha
Copy link
Contributor Author

@0cc4m Rebased on master. Chunked kernels work but the scalar output kernel is too slow without coopmat, so the threshold is disabled for now. I've got a coopmat output kernel already in the works, but do you want me to add it here or keep this as infrastructure and open a separate PR for the coopmat or stop here?

@jeffbolznv
Copy link
Collaborator

Do I understand correctly that to see a gain you need to merge this PR with another?

What exact command line are you using where you see a 30% gain? I only see GDN taking about 5% of the time running llama-bench -m c:\models\Qwen3.5-35B-A3B-Q4_K_M.gguf -fa 1 -p 512 -n 128 --prio 1 -r 10. Can you share a GGML_VK_PERF_LOGGER log?

@ProgenyAlpha ProgenyAlpha marked this pull request as draft March 13, 2026 19:33
@ProgenyAlpha
Copy link
Contributor Author

@jeffbolznv Hey! This is noted in the PR description but the 30% PP gain comes from #20340's chunked op path on the graph side feeding my GDN vulkan autoregressive shader #20334 more efficiently, not from the Vulkan chunked shaders. Both #20334 and #20340 are already merged into master, so that improvement is already live.

The Vulkan chunked dispatch in this PR is actually disabled (GDN_CHUNK_THRESHOLD = UINT32_MAX) as the new shaders aren't running yet. They're infrastructure for the next step: a coopmat output kernel to make chunked competitive with autoregressive (I am almost there on my hardware within 5%).

So with this PR as-is, you'd see near identical performance to master since the chunked path doesn't activate. I was waiting to find out how 0cc4m would like to handle this or anyone in a position to give feedback.

I can close out this PR until I've done more thorough validation and testing and reopen then, if preferred.

@jeffbolznv
Copy link
Collaborator

I mostly want to understand what kind of use case/benchmark you're accelerating on so I can see how much theoretical upside there is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants