Skip to content

vulkan: f16 mixed-precision state for GATED_DELTA_NET#20376

Open
ProgenyAlpha wants to merge 4 commits intoggml-org:masterfrom
ProgenyAlpha:vulkan-gdn-f16
Open

vulkan: f16 mixed-precision state for GATED_DELTA_NET#20376
ProgenyAlpha wants to merge 4 commits intoggml-org:masterfrom
ProgenyAlpha:vulkan-gdn-f16

Conversation

@ProgenyAlpha
Copy link

Follow-up to #20334. Splits out the f16 mixed-precision state optimization into its own PR per @0cc4m's feedback.

Stores the 128-element state array in float16_t, keeps all arithmetic in float32. No precision loss (13/13 backend-ops tests passing). Lower register pressure gives a measurable PP boost.

Depends on #20334

890M benchmarks (Qwen3-Coder-Next REAM Q4_K_M):

Metric Without f16 With f16 Change
PP-512 165.31 t/s 174.54 t/s +5.6%
TG-128 21.16 t/s 21.48 t/s +1.5%

f16 pipeline auto-selects when the device supports shaderFloat16, falls back to f32 otherwise.

ProgenyAlpha and others added 4 commits March 10, 2026 04:00
Implements the fused gated delta net recurrence as a Vulkan compute
shader with full support for scalar gate, KDA vector gate, GQA
broadcast, multi-token sequences, and permuted (non-contiguous) q/k
inputs. Specialization constants select head size (32/64/128) and
KDA mode at pipeline creation time.

Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- vec4 dot products on all inner loops (dp4 hardware intrinsic)
- Cache exp(g) in shared memory for KDA path, eliminating ~32K
  redundant global reads and ~16K redundant exp() calls per token
- vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops)
- Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops

KDA TG: +5.4% throughput. Non-KDA: no regressions.
13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros,
scale in push constants, supports_op fix, dispatch restructuring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store state in float16_t registers to halve register pressure and
bandwidth. Accumulation stays in float32 for accuracy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant