vulkan: chunked parallel kernel for GATED_DELTA_NET by ProgenyAlpha · Pull Request #20377 · ggml-org/llama.cpp

ProgenyAlpha · 2026-03-11T03:19:32Z

Follow-up to #20334. Adds the chunked parallel kernel infrastructure for Vulkan GATED_DELTA_NET, split out per @0cc4m's review feedback.

Depends on #20334 and #20340

Three new compute shaders implementing the chunked algorithm:

gated_delta_net_chunk_intra.comp — intra-chunk parallel computation
gated_delta_net_chunk_inter.comp — inter-chunk state propagation
gated_delta_net_chunk_output.comp — output reconstruction

Includes the rq1 → neq1 broadcast fix to match #20340's interleaved Q/K layout (head_id % neq1 instead of head_id / rq1).

Chunked dispatch is currently disabled (GDN_CHUNK_THRESHOLD = UINT32_MAX) — the autoregressive path handles all token counts. Enabling it will need cooperative matrix support for the output kernel to be competitive.

16/16 backend-ops tests passing (includes chunked-specific test configs with n_seq_tokens=64/128).

890M benchmarks (Qwen3-Coder-Next REAM Q4_K_M):

Metric	Base (#20334)	With #20340 + chunked infra	Change
PP-512	165.31 t/s	215.46 t/s	+30.3%
TG-128	21.16 t/s	21.68 t/s	+2.5%

The PP improvement comes from #20340's chunked op path feeding our autoregressive shader more efficiently. The Vulkan chunked dispatch itself isn't active yet — that's the next optimization pass.

lemmi · 2026-03-11T03:39:34Z

Benchmarks, Strix Halo:

master (e1a3999):

model	size	params	backend	ngl	fa	dio	test	t/s
qwen3next 80B.A3B Q4_K - Medium	46.20 GiB	79.67 B	Vulkan	99	1	1	pp2048	501.96 ± 4.54
qwen3next 80B.A3B Q4_K - Medium	46.20 GiB	79.67 B	Vulkan	99	1	1	tg128	38.28 ± 0.02
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan	99	1	1	pp2048	701.72 ± 3.59
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan	99	1	1	tg128	44.07 ± 0.05
qwen35moe 122B.A10B Q5_K - Medium	85.60 GiB	122.11 B	Vulkan	99	1	1	pp2048	195.42 ± 1.47
qwen35moe 122B.A10B Q5_K - Medium	85.60 GiB	122.11 B	Vulkan	99	1	1	tg128	18.46 ± 0.03

PR (795f15c):

model	size	params	backend	ngl	fa	dio	test	t/s
qwen3next 80B.A3B Q4_K - Medium	46.20 GiB	79.67 B	Vulkan	99	1	1	pp2048	558.94 ± 6.59
qwen3next 80B.A3B Q4_K - Medium	46.20 GiB	79.67 B	Vulkan	99	1	1	tg128	46.39 ± 0.04
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan	99	1	1	pp2048	839.17 ± 4.62
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan	99	1	1	tg128	53.54 ± 0.05
qwen35moe 122B.A10B Q5_K - Medium	85.60 GiB	122.11 B	Vulkan	99	1	1	pp2048	221.53 ± 2.29
qwen35moe 122B.A10B Q5_K - Medium	85.60 GiB	122.11 B	Vulkan	99	1	1	tg128	21.39 ± 0.02

That's 10-20% better PP performance, depending on the model.

ProgenyAlpha · 2026-03-11T04:22:05Z

@lemmi Great numbers, thanks for testing.

Updated the PR to actually enable the chunked Vulkan dispatch — it's now gated on shader core count (> 16 CUs) instead of being disabled. On my 890M (16 CUs) the 3-dispatch overhead makes chunked slower than autoregressive, so it stays off there. On your 8060S (32 CUs) it should activate automatically for n_tokens > 64 with d128 non-KDA configs.

I can't validate the chunked dispatch path myself since I only have the integrated 890M. If you get a chance to test the latest push, that would tell us whether the chunked shaders actually help PP on discrete hardware or if they need more work (coopmat for the output kernel is the next step if so).

lemmi · 2026-03-11T05:13:12Z

Small clarification: the 8060s is the iGPU on Strix Halo (aka Ryzen AI MAX+ 395). The 8060s has 40CUs.

Performance tanked with the latest patch:
Before:

model	size	params	backend	ngl	fa	dio	test	t/s
qwen3next 80B.A3B Q4_K - Medium	46.20 GiB	79.67 B	Vulkan	99	1	1	pp2048	564.42 ± 6.14
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan	99	1	1	pp2048	848.64 ± 1.26
qwen35moe 122B.A10B Q5_K - Medium	85.60 GiB	122.11 B	Vulkan	99	1	1	pp2048	229.15 ± 1.30

After:

model	size	params	backend	ngl	fa	dio	test	t/s
qwen3next 80B.A3B Q4_K - Medium	46.20 GiB	79.67 B	Vulkan	99	1	1	pp2048	366.17 ± 2.17
qwen35moe 35B.A3B Q8_0	34.36 GiB	34.66 B	Vulkan	99	1	1	pp2048	490.98 ± 1.61
qwen35moe 122B.A10B Q5_K - Medium	85.60 GiB	122.11 B	Vulkan	99	1	1	pp2048	165.62 ± 2.21

Three-dispatch chunked pipeline for prompt processing acceleration: intra-chunk WY decomposition, inter-chunk state propagation, output combination. Currently disabled (threshold=UINT32_MAX). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ProgenyAlpha · 2026-03-13T15:52:41Z

@0cc4m Rebased on master. Chunked kernels work but the scalar output kernel is too slow without coopmat, so the threshold is disabled for now. I've got a coopmat output kernel already in the works, but do you want me to add it here or keep this as infrastructure and open a separate PR for the coopmat or stop here?

jeffbolznv · 2026-03-13T19:04:11Z

Do I understand correctly that to see a gain you need to merge this PR with another?

What exact command line are you using where you see a 30% gain? I only see GDN taking about 5% of the time running llama-bench -m c:\models\Qwen3.5-35B-A3B-Q4_K_M.gguf -fa 1 -p 512 -n 128 --prio 1 -r 10. Can you share a GGML_VK_PERF_LOGGER log?

ProgenyAlpha · 2026-03-13T19:47:16Z

@jeffbolznv Hey! This is noted in the PR description but the 30% PP gain comes from #20340's chunked op path on the graph side feeding my GDN vulkan autoregressive shader #20334 more efficiently, not from the Vulkan chunked shaders. Both #20334 and #20340 are already merged into master, so that improvement is already live.

The Vulkan chunked dispatch in this PR is actually disabled (GDN_CHUNK_THRESHOLD = UINT32_MAX) as the new shaders aren't running yet. They're infrastructure for the next step: a coopmat output kernel to make chunked competitive with autoregressive (I am almost there on my hardware within 5%).

So with this PR as-is, you'd see near identical performance to master since the chunked path doesn't activate. I was waiting to find out how 0cc4m would like to handle this or anyone in a position to give feedback.

I can close out this PR until I've done more thorough validation and testing and reopen then, if preferred.

jeffbolznv · 2026-03-13T20:25:05Z

I mostly want to understand what kind of use case/benchmark you're accelerating on so I can see how much theoretical upside there is.

ProgenyAlpha requested review from 0cc4m, CISC and ggerganov as code owners March 11, 2026 03:19

ProgenyAlpha mentioned this pull request Mar 11, 2026

vulkan: add GATED_DELTA_NET op support #20334

Merged

github-actions bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 11, 2026

ProgenyAlpha force-pushed the vulkan-gdn-chunked branch 2 times, most recently from 795f15c to c0d0341 Compare March 11, 2026 07:27

ProgenyAlpha mentioned this pull request Mar 11, 2026

llama : enable chunked fused GDN path #20340

Merged

itterative mentioned this pull request Mar 12, 2026

Eval bug: Vulkan throws vk::DeviceLostError on Qwen3.5 35B A3B #20462

Open

ProgenyAlpha force-pushed the vulkan-gdn-chunked branch from c0d0341 to dbbe2a9 Compare March 13, 2026 15:00

ProgenyAlpha marked this pull request as draft March 13, 2026 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: chunked parallel kernel for GATED_DELTA_NET#20377

vulkan: chunked parallel kernel for GATED_DELTA_NET#20377
ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
ProgenyAlpha:vulkan-gdn-chunked

ProgenyAlpha commented Mar 11, 2026

Uh oh!

lemmi commented Mar 11, 2026

Uh oh!

ProgenyAlpha commented Mar 11, 2026 •

edited

Loading

Uh oh!

lemmi commented Mar 11, 2026

Uh oh!

ProgenyAlpha commented Mar 13, 2026

Uh oh!

jeffbolznv commented Mar 13, 2026

Uh oh!

ProgenyAlpha commented Mar 13, 2026

Uh oh!

jeffbolznv commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ProgenyAlpha commented Mar 11, 2026

Uh oh!

lemmi commented Mar 11, 2026

Uh oh!

ProgenyAlpha commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemmi commented Mar 11, 2026

Uh oh!

ProgenyAlpha commented Mar 13, 2026

Uh oh!

jeffbolznv commented Mar 13, 2026

Uh oh!

ProgenyAlpha commented Mar 13, 2026

Uh oh!

jeffbolznv commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ProgenyAlpha commented Mar 11, 2026 •

edited

Loading