sycl: add GGML_OP_GATED_DELTA_NET fused kernel by taowen-paraflow · Pull Request #20571 · ggml-org/llama.cpp

taowen-paraflow · 2026-03-15T01:55:45Z

Summary

Port the Gated Delta Net (GDN) recurrence kernel from the Vulkan compute shader (gated_delta_net.comp) to the SYCL backend
Enables Qwen3.5 and other delta-net architecture models to run efficiently on Intel GPUs via oneAPI
Previously, the SYCL backend had no GDN support, causing fallback to CPU for these operations

Implementation

New files:

ggml/src/ggml-sycl/gdn.cpp — fused kernel implementation
ggml/src/ggml-sycl/gdn.hpp — header

Modified files:

ggml/src/ggml-sycl/backend.hpp — add include
ggml/src/ggml-sycl/ggml-sycl.cpp — add dispatch case and supports_op entry

Kernel features:

Both GDA (scalar gate) and KDA (key-dependent / vector gate) variants
Head sizes 32, 64, 128 via compile-time templates
GQA/MQA support through stride-based tensor access (matching Vulkan push constants layout)
sycl::float4 vectorized inner loops (same pattern as existing gla.cpp)
One workgroup per (head, seq), S_V threads per workgroup, state held in registers

Benchmark

Tested on Intel Arc 140V (Lunar Lake iGPU) with Qwen3.5-0.8B-Q4_K_M, -ngl 99:

Metric	Before (no GDN kernel)	After (this PR)	Change
Decode (tok/s)	22.0	54.0	+145%
Prompt (tok/s)	8.6	23.1	+169%

The decode improvement comes from GDN layers now running as a fused kernel on GPU instead of falling back to per-op CPU execution.

Test plan

Builds successfully with oneAPI 2025.3 + Ninja on Windows
End-to-end inference with Qwen3.5-0.8B produces coherent output
test-backend-ops passes GATED_DELTA_NET tests (test cases already exist in upstream)
CI with SYCL build

🤖 Generated with Claude Code

Port the Gated Delta Net (GDN) recurrence from the Vulkan compute shader (gated_delta_net.comp) to the SYCL backend, enabling Qwen3.5 and other delta-net models to run on Intel GPUs via oneAPI. Kernel features: - Supports both GDA (scalar gate) and KDA (vector gate / key-dependent) modes - Head sizes 32, 64, 128 via compile-time templates - GQA/MQA support through stride-based tensor access - Float4 vectorized inner loops matching the GLA kernel pattern - One workgroup per (head, seq) with S_V threads; state held in registers Tested on Intel Arc 140V (Lunar Lake) with Qwen3.5-0.8B-Q4_K_M: - Before (GDN fallback to CPU): 22.0 tok/s decode - After (GDN fused on GPU): 54.0 tok/s decode (+145%) - Prompt eval: 23.1 tok/s (vs Vulkan 2.0 tok/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wangqiushi-liu · 2026-03-15T04:05:57Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

savvadesogle · 2026-03-15T08:24:57Z

Hello
Thank you ❤️

Windows 11
Intel Xeon 2699v3
1-2x Intel Arc A770 (DG2) Asrock 190W
Driver 8509

Models:
TheBloke\Llama-2-7B-GGUF\llama-2-7b.Q4_0.gguf
lmstudio-community\Qwen3.5-0.8B-GGUF\Qwen3.5-0.8B-Q4_K_M.gguf
lmstudio-community\Qwen3.5-0.8B-GGUF\Qwen3.5-0.8B-Q8_0.gguf
with/without -DGGML_SYCL_F16=ON

1x A770

Qwen3.5-0.8B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           pp512 |        213.88 + 0.65 |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           tg128 |         36.27 + 0.15 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           pp512 |      4057.25 + 97.48 |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           tg128 |         60.69 + 1.36 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           pp512 |      3984.61 + 17.69 |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           tg128 |         47.11 + 0.27 |

Qwen3.5-2B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           pp512 |        226.77 + 0.35 |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           tg128 |         41.68 + 0.19 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           pp512 |       3271.26 + 8.73 |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           tg128 |         44.33 + 0.39 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           pp512 |       2621.94 + 9.23 |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           tg128 |         53.45 + 0.50 |

Qwen3.5-4B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           pp512 |         82.74 + 0.41 |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           tg128 |         23.56 + 0.09 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           pp512 |      1868.92 + 24.84 |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           tg128 |         32.63 + 0.06 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           pp512 |       1220.79 + 7.89 |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           tg128 |         32.12 + 0.34 |

Qwen3.5-4B-Q8_0

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           pp512 |         83.68 + 0.16 |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           tg128 |         18.23 + 0.03 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           pp512 |       1831.84 + 4.51 |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           tg128 |         19.08 + 0.03 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           pp512 |       1184.61 + 6.43 |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           tg128 |         22.27 + 0.07 |

Qwen3.5-9B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           pp512 |         80.19 + 0.17 |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           tg128 |         18.82 + 0.12 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           pp512 |      1346.38 + 13.90 |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           tg128 |         24.18 + 0.04 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           pp512 |        753.07 + 2.68 |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           tg128 |         24.14 + 0.04 |

Qwen3.5-9B-Q8_0

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           pp512 |         81.08 + 0.08 |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           tg128 |         12.92 + 0.06 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           pp512 |       1320.59 + 4.45 |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           tg128 |         12.80 + 0.01 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           pp512 |        719.20 + 4.42 |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           tg128 |         14.23 + 1.04 |

2x A770

Qwen3.5-27B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           pp512 |         26.92 + 0.17 |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           tg128 |          6.78 + 0.01 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           pp512 |        459.88 + 1.99 |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           tg128 |          7.40 + 0.02 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           pp512 |        221.75 + 0.47 |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           tg128 |          7.38 + 0.03 |

Qwen3.5-27B-Q8_0

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           pp512 |         20.35 + 0.68 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           tg128 |          2.51 + 0.00 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           pp512 |        449.83 + 0.72 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           tg128 |          4.03 + 0.01 |

sycl-gated-delta-net F32
PP performance🤷‍♂️:The model is distributed between two cards, but one GPU is at 100%, the other at 0%.
llama-bench -m T:\models\lmstudio-community\Qwen3.5-27B-GGUF\Qwen3.5-27B-Q8_0.gguf -ngl 100 -fa 0

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           pp512 |         36.38 + 0.79 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           tg128 |          4.51 + 0.01 |

Qwen3.5-35B-A3B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           pp512 |         44.62 + 1.72 |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           tg128 |          7.66 + 0.06 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           pp512 |        213.99 + 4.80 |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           tg128 |          8.40 + 0.07 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           pp512 |        222.62 + 5.51 |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           tg128 |          8.72 + 0.02 |

Vulkan Qwen3.5-35B-A3B-Q8_0

Sorry. I haven't tested the Q8 quant with 3x gpus. 🤷‍♂️

But I have numbers for Vulkan)
#19918 (comment)

taowen-paraflow · 2026-03-15T08:39:12Z

duplicated with #20455

NeoZhangJianyu · 2026-03-15T09:43:17Z

@taowen-paraflow
It is unfortunate that we did the same work.
If your code could provide better performance, we could use your solution to cover the existed code. :)

Maybe we could create an issue of planned work firstly.
Then others will know it and avoid duplicated work in the future.

Thank you!

savvadesogle · 2026-03-15T12:15:15Z

@NeoZhangJianyu , @taowen-paraflow

I tested it and noticed a slight difference for TG.

GPU A770 (1-3x)
2699 v3 xeon
Windows 11
driver 8509

https://docs.google.com/spreadsheets/d/1zlxnxylvwhTWgMvJ50ysmMDNTXnucbc-/

NeoZhangJianyu · 2026-03-15T13:59:36Z

@taowen-paraflow
Could you check the diff between you PR and current master?
It's great if you could merge the Q8 & Q4 benefit to master. :)

Thank you!

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Mar 15, 2026

loci-dev mentioned this pull request Mar 15, 2026

UPSTREAM PR #20571: sycl: add GGML_OP_GATED_DELTA_NET fused kernel auroralabs-loci/llama.cpp#1256

Open

4 tasks

taowen-paraflow closed this Mar 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sycl: add GGML_OP_GATED_DELTA_NET fused kernel#20571

sycl: add GGML_OP_GATED_DELTA_NET fused kernel#20571
taowen-paraflow wants to merge 1 commit intoggml-org:masterfrom
taowen-paraflow:sycl-gated-delta-net

taowen-paraflow commented Mar 15, 2026

Uh oh!

wangqiushi-liu commented Mar 15, 2026

Uh oh!

savvadesogle commented Mar 15, 2026 •

edited

Loading

Uh oh!

taowen-paraflow commented Mar 15, 2026

Uh oh!

NeoZhangJianyu commented Mar 15, 2026

Uh oh!

savvadesogle commented Mar 15, 2026

Uh oh!

NeoZhangJianyu commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

taowen-paraflow commented Mar 15, 2026

Summary

Implementation

Benchmark

Test plan

Uh oh!

wangqiushi-liu commented Mar 15, 2026

Code review

Uh oh!

savvadesogle commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1x A770

Qwen3.5-0.8B-Q4_K_M

Qwen3.5-2B-Q4_K_M

Qwen3.5-4B-Q4_K_M

Qwen3.5-4B-Q8_0

Qwen3.5-9B-Q4_K_M

Qwen3.5-9B-Q8_0

2x A770

Qwen3.5-27B-Q4_K_M

Qwen3.5-27B-Q8_0

Qwen3.5-35B-A3B-Q4_K_M

Vulkan Qwen3.5-35B-A3B-Q8_0

Uh oh!

taowen-paraflow commented Mar 15, 2026

Uh oh!

NeoZhangJianyu commented Mar 15, 2026

Uh oh!

savvadesogle commented Mar 15, 2026

Uh oh!

NeoZhangJianyu commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

savvadesogle commented Mar 15, 2026 •

edited

Loading