vulkan: f16 mixed-precision state for GATED_DELTA_NET#20376
Open
ProgenyAlpha wants to merge 4 commits intoggml-org:masterfrom
Open
vulkan: f16 mixed-precision state for GATED_DELTA_NET#20376ProgenyAlpha wants to merge 4 commits intoggml-org:masterfrom
ProgenyAlpha wants to merge 4 commits intoggml-org:masterfrom
Conversation
Implements the fused gated delta net recurrence as a Vulkan compute shader with full support for scalar gate, KDA vector gate, GQA broadcast, multi-token sequences, and permuted (non-contiguous) q/k inputs. Specialization constants select head size (32/64/128) and KDA mode at pipeline creation time. Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- vec4 dot products on all inner loops (dp4 hardware intrinsic) - Cache exp(g) in shared memory for KDA path, eliminating ~32K redundant global reads and ~16K redundant exp() calls per token - vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops) - Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops KDA TG: +5.4% throughput. Non-KDA: no regressions. 13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros, scale in push constants, supports_op fix, dispatch restructuring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store state in float16_t registers to halve register pressure and bandwidth. Accumulation stays in float32 for accuracy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #20334. Splits out the f16 mixed-precision state optimization into its own PR per @0cc4m's feedback.
Stores the 128-element state array in
float16_t, keeps all arithmetic infloat32. No precision loss (13/13 backend-ops tests passing). Lower register pressure gives a measurable PP boost.Depends on #20334
890M benchmarks (Qwen3-Coder-Next REAM Q4_K_M):
f16 pipeline auto-selects when the device supports
shaderFloat16, falls back to f32 otherwise.