Skip to content

sycl: add GGML_OP_GATED_DELTA_NET fused kernel#20571

Closed
taowen-paraflow wants to merge 1 commit intoggml-org:masterfrom
taowen-paraflow:sycl-gated-delta-net
Closed

sycl: add GGML_OP_GATED_DELTA_NET fused kernel#20571
taowen-paraflow wants to merge 1 commit intoggml-org:masterfrom
taowen-paraflow:sycl-gated-delta-net

Conversation

@taowen-paraflow
Copy link

Summary

  • Port the Gated Delta Net (GDN) recurrence kernel from the Vulkan compute shader (gated_delta_net.comp) to the SYCL backend
  • Enables Qwen3.5 and other delta-net architecture models to run efficiently on Intel GPUs via oneAPI
  • Previously, the SYCL backend had no GDN support, causing fallback to CPU for these operations

Implementation

New files:

  • ggml/src/ggml-sycl/gdn.cpp — fused kernel implementation
  • ggml/src/ggml-sycl/gdn.hpp — header

Modified files:

  • ggml/src/ggml-sycl/backend.hpp — add include
  • ggml/src/ggml-sycl/ggml-sycl.cpp — add dispatch case and supports_op entry

Kernel features:

  • Both GDA (scalar gate) and KDA (key-dependent / vector gate) variants
  • Head sizes 32, 64, 128 via compile-time templates
  • GQA/MQA support through stride-based tensor access (matching Vulkan push constants layout)
  • sycl::float4 vectorized inner loops (same pattern as existing gla.cpp)
  • One workgroup per (head, seq), S_V threads per workgroup, state held in registers

Benchmark

Tested on Intel Arc 140V (Lunar Lake iGPU) with Qwen3.5-0.8B-Q4_K_M, -ngl 99:

Metric Before (no GDN kernel) After (this PR) Change
Decode (tok/s) 22.0 54.0 +145%
Prompt (tok/s) 8.6 23.1 +169%

The decode improvement comes from GDN layers now running as a fused kernel on GPU instead of falling back to per-op CPU execution.

Test plan

  • Builds successfully with oneAPI 2025.3 + Ninja on Windows
  • End-to-end inference with Qwen3.5-0.8B produces coherent output
  • test-backend-ops passes GATED_DELTA_NET tests (test cases already exist in upstream)
  • CI with SYCL build

🤖 Generated with Claude Code

Port the Gated Delta Net (GDN) recurrence from the Vulkan compute shader
(gated_delta_net.comp) to the SYCL backend, enabling Qwen3.5 and other
delta-net models to run on Intel GPUs via oneAPI.

Kernel features:
- Supports both GDA (scalar gate) and KDA (vector gate / key-dependent) modes
- Head sizes 32, 64, 128 via compile-time templates
- GQA/MQA support through stride-based tensor access
- Float4 vectorized inner loops matching the GLA kernel pattern
- One workgroup per (head, seq) with S_V threads; state held in registers

Tested on Intel Arc 140V (Lunar Lake) with Qwen3.5-0.8B-Q4_K_M:
- Before (GDN fallback to CPU): 22.0 tok/s decode
- After  (GDN fused on GPU):    54.0 tok/s decode  (+145%)
- Prompt eval: 23.1 tok/s (vs Vulkan 2.0 tok/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Mar 15, 2026
@wangqiushi-liu
Copy link


Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@savvadesogle
Copy link

savvadesogle commented Mar 15, 2026

Hello
Thank you ❤️

Windows 11
Intel Xeon 2699v3
1-2x Intel Arc A770 (DG2) Asrock 190W
Driver 8509

Models:
TheBloke\Llama-2-7B-GGUF\llama-2-7b.Q4_0.gguf
lmstudio-community\Qwen3.5-0.8B-GGUF\Qwen3.5-0.8B-Q4_K_M.gguf
lmstudio-community\Qwen3.5-0.8B-GGUF\Qwen3.5-0.8B-Q8_0.gguf
with/without -DGGML_SYCL_F16=ON
изображение

1x A770

Qwen3.5-0.8B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           pp512 |        213.88 + 0.65 |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           tg128 |         36.27 + 0.15 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           pp512 |      4057.25 + 97.48 |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           tg128 |         60.69 + 1.36 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           pp512 |      3984.61 + 17.69 |
| qwen35 0.8B Q4_K - Medium      | 492.61 MiB |   752.39 M | SYCL       | 100 |           tg128 |         47.11 + 0.27 |

Qwen3.5-2B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           pp512 |        226.77 + 0.35 |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           tg128 |         41.68 + 0.19 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           pp512 |       3271.26 + 8.73 |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           tg128 |         44.33 + 0.39 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           pp512 |       2621.94 + 9.23 |
| qwen35 2B Q4_K - Medium        |   1.17 GiB |     1.88 B | SYCL       | 100 |           tg128 |         53.45 + 0.50 |

Qwen3.5-4B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           pp512 |         82.74 + 0.41 |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           tg128 |         23.56 + 0.09 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           pp512 |      1868.92 + 24.84 |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           tg128 |         32.63 + 0.06 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           pp512 |       1220.79 + 7.89 |
| qwen35 4B Q4_K - Medium        |   2.51 GiB |     4.21 B | SYCL       | 100 |           tg128 |         32.12 + 0.34 |

Qwen3.5-4B-Q8_0

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           pp512 |         83.68 + 0.16 |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           tg128 |         18.23 + 0.03 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           pp512 |       1831.84 + 4.51 |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           tg128 |         19.08 + 0.03 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           pp512 |       1184.61 + 6.43 |
| qwen35 4B Q8_0                 |   4.16 GiB |     4.21 B | SYCL       | 100 |           tg128 |         22.27 + 0.07 |

Qwen3.5-9B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           pp512 |         80.19 + 0.17 |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           tg128 |         18.82 + 0.12 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           pp512 |      1346.38 + 13.90 |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           tg128 |         24.18 + 0.04 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           pp512 |        753.07 + 2.68 |
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |           tg128 |         24.14 + 0.04 |

Qwen3.5-9B-Q8_0

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           pp512 |         81.08 + 0.08 |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           tg128 |         12.92 + 0.06 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           pp512 |       1320.59 + 4.45 |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           tg128 |         12.80 + 0.01 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           pp512 |        719.20 + 4.42 |
| qwen35 9B Q8_0                 |   8.86 GiB |     8.95 B | SYCL       | 100 |           tg128 |         14.23 + 1.04 |

2x A770

Qwen3.5-27B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           pp512 |         26.92 + 0.17 |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           tg128 |          6.78 + 0.01 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           pp512 |        459.88 + 1.99 |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           tg128 |          7.40 + 0.02 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           pp512 |        221.75 + 0.47 |
| qwen35 27B Q4_K - Medium       |  15.39 GiB |    26.90 B | SYCL       | 100 |           tg128 |          7.38 + 0.03 |

Qwen3.5-27B-Q8_0

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           pp512 |         20.35 + 0.68 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           tg128 |          2.51 + 0.00 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           pp512 |        449.83 + 0.72 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           tg128 |          4.03 + 0.01 |

sycl-gated-delta-net F32
PP performance🤷‍♂️:The model is distributed between two cards, but one GPU is at 100%, the other at 0%.
llama-bench -m T:\models\lmstudio-community\Qwen3.5-27B-GGUF\Qwen3.5-27B-Q8_0.gguf -ngl 100 -fa 0

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           pp512 |         36.38 + 0.79 |
| qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | SYCL       | 100 |           tg128 |          4.51 + 0.01 |

Qwen3.5-35B-A3B-Q4_K_M

b8339 (mainline)

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           pp512 |         44.62 + 1.72 |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           tg128 |          7.66 + 0.06 |

sycl-gated-delta-net F16

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           pp512 |        213.99 + 4.80 |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           tg128 |          8.40 + 0.07 |

sycl-gated-delta-net F32

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           pp512 |        222.62 + 5.51 |
| qwen35moe 35B.A3B Q4_K - Medium |  19.71 GiB |    34.66 B | SYCL       | 100 |           tg128 |          8.72 + 0.02 |

Vulkan Qwen3.5-35B-A3B-Q8_0

Sorry. I haven't tested the Q8 quant with 3x gpus. 🤷‍♂️
изображение

But I have numbers for Vulkan)
#19918 (comment)
изображение
изображение

@taowen-paraflow
Copy link
Author

duplicated with #20455

@NeoZhangJianyu
Copy link
Contributor

@taowen-paraflow
It is unfortunate that we did the same work.
If your code could provide better performance, we could use your solution to cover the existed code. :)

Maybe we could create an issue of planned work firstly.
Then others will know it and avoid duplicated work in the future.

Thank you!

@savvadesogle
Copy link

@NeoZhangJianyu , @taowen-paraflow

I tested it and noticed a slight difference for TG.

GPU A770 (1-3x)
2699 v3 xeon
Windows 11
driver 8509

изображение

https://docs.google.com/spreadsheets/d/1zlxnxylvwhTWgMvJ50ysmMDNTXnucbc-/

@NeoZhangJianyu
Copy link
Contributor

@taowen-paraflow
Could you check the diff between you PR and current master?
It's great if you could merge the Q8 & Q4 benefit to master. :)

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants