sycl: add GGML_OP_GATED_DELTA_NET fused kernel#20571
sycl: add GGML_OP_GATED_DELTA_NET fused kernel#20571taowen-paraflow wants to merge 1 commit intoggml-org:masterfrom
Conversation
Port the Gated Delta Net (GDN) recurrence from the Vulkan compute shader (gated_delta_net.comp) to the SYCL backend, enabling Qwen3.5 and other delta-net models to run on Intel GPUs via oneAPI. Kernel features: - Supports both GDA (scalar gate) and KDA (vector gate / key-dependent) modes - Head sizes 32, 64, 128 via compile-time templates - GQA/MQA support through stride-based tensor access - Float4 vectorized inner loops matching the GLA kernel pattern - One workgroup per (head, seq) with S_V threads; state held in registers Tested on Intel Arc 140V (Lunar Lake) with Qwen3.5-0.8B-Q4_K_M: - Before (GDN fallback to CPU): 22.0 tok/s decode - After (GDN fused on GPU): 54.0 tok/s decode (+145%) - Prompt eval: 23.1 tok/s (vs Vulkan 2.0 tok/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
Hello Windows 11 Models: 1x A770Qwen3.5-0.8B-Q4_K_Mb8339 (mainline) sycl-gated-delta-net F16 sycl-gated-delta-net F32 Qwen3.5-2B-Q4_K_Mb8339 (mainline) sycl-gated-delta-net F16 sycl-gated-delta-net F32 Qwen3.5-4B-Q4_K_Mb8339 (mainline) sycl-gated-delta-net F16 sycl-gated-delta-net F32 Qwen3.5-4B-Q8_0b8339 (mainline) sycl-gated-delta-net F16 sycl-gated-delta-net F32 Qwen3.5-9B-Q4_K_Mb8339 (mainline) sycl-gated-delta-net F16 sycl-gated-delta-net F32 Qwen3.5-9B-Q8_0b8339 (mainline) sycl-gated-delta-net F16 sycl-gated-delta-net F32 2x A770Qwen3.5-27B-Q4_K_Mb8339 (mainline) sycl-gated-delta-net F16 sycl-gated-delta-net F32 Qwen3.5-27B-Q8_0b8339 (mainline) sycl-gated-delta-net F16 sycl-gated-delta-net F32 Qwen3.5-35B-A3B-Q4_K_Mb8339 (mainline) sycl-gated-delta-net F16 sycl-gated-delta-net F32 Vulkan Qwen3.5-35B-A3B-Q8_0Sorry. I haven't tested the Q8 quant with 3x gpus. 🤷♂️ But I have numbers for Vulkan) |
|
duplicated with #20455 |
|
@taowen-paraflow Maybe we could create an issue of planned work firstly. Thank you! |
|
@NeoZhangJianyu , @taowen-paraflow I tested it and noticed a slight difference for TG. GPU A770 (1-3x)
https://docs.google.com/spreadsheets/d/1zlxnxylvwhTWgMvJ50ysmMDNTXnucbc-/ |
|
@taowen-paraflow Thank you! |





Summary
gated_delta_net.comp) to the SYCL backendImplementation
New files:
ggml/src/ggml-sycl/gdn.cpp— fused kernel implementationggml/src/ggml-sycl/gdn.hpp— headerModified files:
ggml/src/ggml-sycl/backend.hpp— add includeggml/src/ggml-sycl/ggml-sycl.cpp— add dispatch case andsupports_opentryKernel features:
sycl::float4vectorized inner loops (same pattern as existinggla.cpp)S_Vthreads per workgroup, state held in registersBenchmark
Tested on Intel Arc 140V (Lunar Lake iGPU) with Qwen3.5-0.8B-Q4_K_M,
-ngl 99:The decode improvement comes from GDN layers now running as a fused kernel on GPU instead of falling back to per-op CPU execution.
Test plan
test-backend-opspassesGATED_DELTA_NETtests (test cases already exist in upstream)🤖 Generated with Claude Code