llama : enable chunked fused GDN path by ggerganov · Pull Request #20340 · ggml-org/llama.cpp

ggerganov · 2026-03-10T08:52:31Z

cont #19504

Backends can now implement the chunked version of the fused GDN operator.

Implementations:

CUDA: CUDA: AR gated delta net improvements #20391
Metal: metal : add GDN kernel #20361

am17an

BTW I tried the chunked kernel, it just about equal in performance to master. It seems like cublas is hard to beat, even with the sequential loop for chunks

ggerganov · 2026-03-10T09:28:19Z

Hm, I actually disabled the chunked kernel in the CUDA backend only because I thought it is not implemented. If it is ready, then we should enable it - at the very least the ggml graph will become constant.

Btw, on DGX Spark it seems to perform better compared to the unfused path:

GGML_CUDA=ON ./scripts/compare-commits.sh master gg/llama-allow-gdn-ch llama-bench -m ~/models/qwen3-next-q4_0.gguf -m ~/models/Kimi-Linear-48B-A3B-Instruct-jp-imatrix.Q4_K_M.gguf -m ~/models/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-Q4_K_M.gguf -ngl 99 -fa 1 -t 1 -dio 1 -p 512,2048 -n 32 -ub 2048 -r 3

Model	Test	t/s master	t/s gg/llama-allow-gdn-ch	Speedup
kimi-linear 48B.A3B Q4_K_M	pp512	849.98	1517.26	1.79
kimi-linear 48B.A3B Q4_K_M	pp2048	983.61	2072.70	2.11
qwen35 27B Q4_K_M	pp512	641.32	728.68	1.14
qwen35 27B Q4_K_M	pp2048	592.71	681.24	1.15
qwen3next 80B.A3B Q4_0	pp512	1182.44	1259.99	1.07
qwen3next 80B.A3B Q4_0	pp2048	1378.34	1515.44	1.10

If you can confirm correctness of the implementation, I think it is fine to enable it.

am17an · 2026-03-10T09:33:57Z

I think what you're enabling currently is just the autoregressive kernel right?

ggerganov · 2026-03-10T09:45:08Z

Ok I see. It's still the autoregressive kernel iterating over all tokens.

am17an · 2026-03-10T09:48:01Z

Strange that's it faster for Kimi-linear by so much. I think it makes sense to enable it when kda is true? I see it's faster on 5090 as well. cc: @ymcki

ggerganov · 2026-03-10T09:51:11Z

Is the current branch much slower on 5090 with non-KDA models?

am17an · 2026-03-10T09:51:53Z

Yes it's much slower for qwen3.5

ggerganov · 2026-03-10T10:22:15Z

Ok, enabled it only for KDA for now.

Also changed the broadcast pattern to interleaved since this is what is used for Qwen3.5 and helps to avoid explicit repeats of the Q and K tensors. Added TODOs to make the broadcast configurable which will allow to avoid the repeats for Qwen3 Next in a similar way.

After a few tests and making sure this branch works correctly, we can merge.

ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

ggml/src/ggml-cuda/ggml-cuda.cu

ymcki · 2026-03-11T00:00:28Z

Strange that's it faster for Kimi-linear by so much. I think it makes sense to enable it when kda is true? I see it's faster on 5090 as well. cc: @ymcki

Originally, pwilkin implemented backend agnostic chunking implementation for pp and recurrent implementation for inference. Recurrent is a special case of chunking in which chunk_size == tokens_len.

Later pwilkin found that for inference, doing tokens one by one in autoregressive mode is faster than recurrent, so the recurrent form is replaced by the autoregressive form.

Originally, cacaview implemented CPU and CUDA recurrent form:

cacaview's recurrent CPU impl

// ggml_compute_forward_kda_scan
// KDA (Kimi Delta Attention) recurrence:
//   h[t] = exp(g[t]) * h[t-1] + k[t]^T * (beta[t] * (v[t] - h[t-1] @ k[t]))
//   o[t] = q[t]^T @ h[t]

static void ggml_compute_forward_kda_scan_f32(
        const ggml_compute_params * params,
        ggml_tensor * dst) {
    const ggml_tensor * src0 = dst->src[0]; // h    {head_dim, head_dim, n_head, n_seqs+}
    const ggml_tensor * src1 = dst->src[1]; // q    {head_dim, n_head, n_seq_tokens, n_seqs}
    const ggml_tensor * src2 = dst->src[2]; // k    {head_dim, n_head, n_seq_tokens, n_seqs}
    const ggml_tensor * src3 = dst->src[3]; // v    {head_dim, n_head, n_seq_tokens, n_seqs}
    const ggml_tensor * src4 = dst->src[4]; // g    {head_dim, n_head, n_seq_tokens, n_seqs}
    const ggml_tensor * src5 = dst->src[5]; // beta {n_head, n_seq_tokens, n_seqs}
    const ggml_tensor * src6 = dst->src[6]; // ids  {n_seqs}

    const int ith = params->ith;
    const int nth = params->nth;

    const int64_t head_dim     = src0->ne[0];
    const int64_t n_head       = src1->ne[1];
    const int64_t n_seq_tokens = src1->ne[2];
    const int64_t n_seqs       = src1->ne[3];

    // Output offset for hidden state
    const int64_t y_off = ggml_nelements(src1) * sizeof(float);

    GGML_ASSERT(src0->nb[0] == sizeof(float));
    GGML_ASSERT(src1->nb[0] == sizeof(float));
    GGML_ASSERT(src2->nb[0] == sizeof(float));
    GGML_ASSERT(src3->nb[0] == sizeof(float));
    GGML_ASSERT(src4->nb[0] == sizeof(float));
    GGML_ASSERT(src5->nb[0] == sizeof(float));
    GGML_ASSERT(src6->nb[0] == sizeof(int32_t));

    // Parallelize over heads
    const int dh = (n_head + nth - 1) / nth;
    const int ih0 = dh * ith;
    const int ih1 = MIN(ih0 + dh, (int)n_head);

    const int32_t * ids = (const int32_t *) src6->data;

    // Temporary buffer for h @ k computation
    float * hk_buf = (float *) malloc(head_dim * sizeof(float));

    static int debug_count = 0;
    bool do_debug = false; // (ith == 0 && debug_count++ < 20);
    
    for (int i3 = 0; i3 < n_seqs; ++i3) {
        // Get initial hidden state for this sequence
        const float * h0 = (const float *) ((const char *) src0->data + ids[i3] * src0->nb[3]);
        // Output hidden state location
        float * h_out = (float *) ((char *) dst->data + i3 * src0->nb[3] + y_off);

        for (int ih = ih0; ih < ih1; ++ih) {
            // Per-head hidden state: [head_dim, head_dim]
            // Copy initial state to output (will be updated in place)
            const float * h_in = h0 + ih * head_dim * head_dim;
            float * h = h_out + ih * head_dim * head_dim;
            
            // Copy initial state, but check for invalid values and clear if needed
            bool need_clear = false;
            for (int i = 0; i < head_dim * head_dim && !need_clear; ++i) {
                if (!isfinite(h_in[i]) || fabsf(h_in[i]) > 1e6f) {
                    need_clear = true;
                }
            }
            for (int i = 0; i < head_dim * head_dim; ++i) {
                h[i] = need_clear ? 0.0f : h_in[i];
            }

            for (int it = 0; it < n_seq_tokens; ++it) {
                const float * q_raw = (const float *) ((const char *) src1->data + 
                    it * src1->nb[2] + i3 * src1->nb[3]) + ih * head_dim;
                const float * k_raw = (const float *) ((const char *) src2->data + 
                    it * src2->nb[2] + i3 * src2->nb[3]) + ih * head_dim;
                const float * v = (const float *) ((const char *) src3->data + 
                    it * src3->nb[2] + i3 * src3->nb[3]) + ih * head_dim;
                const float * g = (const float *) ((const char *) src4->data + 
                    it * src4->nb[2] + i3 * src4->nb[3]) + ih * head_dim;
                const float beta = ((const float *) ((const char *) src5->data + 
                    it * src5->nb[1] + i3 * src5->nb[2]))[ih];
                
                float * y = (float *) dst->data + 
                    it * n_head * head_dim + i3 * n_seq_tokens * n_head * head_dim + ih * head_dim;

                // L2 normalize q and k (critical for KDA stability)
                float q_norm = 0.0f, k_norm = 0.0f;
                for (int i = 0; i < head_dim; ++i) {
                    q_norm += q_raw[i] * q_raw[i];
                    k_norm += k_raw[i] * k_raw[i];
                }
                q_norm = sqrtf(q_norm + 1e-6f);
                k_norm = sqrtf(k_norm + 1e-6f);
                
                // Debug output
                if (do_debug && ih == 0 && it == 0 && i3 == 0) {
                    fprintf(stderr, "DEBUG KDA: q_raw[0]=%f, k_raw[0]=%f, v[0]=%f, g[0]=%f, beta=%f\n",
                            q_raw[0], k_raw[0], v[0], g[0], beta);
                    fprintf(stderr, "DEBUG KDA: q_norm=%f, k_norm=%f, exp(g[0])=%f, scale=%f\n",
                            q_norm, k_norm, expf(g[0]), 1.0f / sqrtf((float)head_dim));
                }
                
                // Normalized q and k with scale = 1/sqrt(head_dim)
                // Note: scale is applied only to q after L2 normalization
                const float scale = 1.0f / sqrtf((float)head_dim);
                float q[128], k[128];  // assume head_dim <= 128
                for (int i = 0; i < head_dim; ++i) {
                    // L2 normalize then scale q
                    q[i] = (q_raw[i] / q_norm) * scale;
                    // L2 normalize k (no scale)
                    k[i] = k_raw[i] / k_norm;
                }

                // KDA recurrence: h[t] = exp(g[t]) * h[t-1] + k[t]^T * (beta[t] * (v[t] - h[t-1] @ k[t]))
                // Note: Apply decay first, then compute retrieval and update

                // Step 1: Apply decay to h first: h = h * exp(g)
                // Clamp g to [-80, 80] to avoid numerical overflow
                for (int i = 0; i < head_dim; ++i) {
                    const float g_clamped = fminf(fmaxf(g[i], -80.0f), 80.0f);
                    const float exp_gi = expf(g_clamped);
                    for (int j = 0; j < head_dim; ++j) {
                        h[i * head_dim + j] *= exp_gi;
                    }
                }

                // Step 2: Compute h^T @ k -> hk_buf [head_dim]
                // hk_buf[j] = sum_i (h[i,j] * k[i]) which is column j of h dotted with k
                for (int j = 0; j < head_dim; ++j) {
                    float sum = 0.0f;
                    for (int i = 0; i < head_dim; ++i) {
                        sum += h[i * head_dim + j] * k[i];
                    }
                    hk_buf[j] = sum;
                }

                // Step 3: Compute delta = beta * (v - hk) and update h
                // h = h + outer(k, delta) where outer(k,delta)[i,j] = k[i] * delta[j]
                for (int i = 0; i < head_dim; ++i) {
                    for (int j = 0; j < head_dim; ++j) {
                        const float delta_j = beta * (v[j] - hk_buf[j]);
                        h[i * head_dim + j] += k[i] * delta_j;
                    }
                }

                // Step 4: Compute output y = h^T @ q -> [head_dim]
                // vLLM: b_o = tl.sum(b_h * b_q[:, None], 0) means o[j] = sum_i(h[i,j] * q[i])
                for (int j = 0; j < head_dim; ++j) {
                    float sum = 0.0f;
                    for (int i = 0; i < head_dim; ++i) {
                        sum += h[i * head_dim + j] * q[i];
                    }
                    y[j] = sum;
                }
                
                // Debug output
                if (do_debug && ih == 0 && it == 0 && i3 == 0) {
                    // Find max abs value in h for stability check
                    float h_max = 0.0f;
                    for (int i = 0; i < head_dim * head_dim; i++) {
                        if (fabsf(h[i]) > h_max) h_max = fabsf(h[i]);
                    }
                    fprintf(stderr, "DEBUG KDA: y[0]=%.6f, h_max=%.6f, exp(g[0])=%.6f\n",
                            y[0], h_max, expf(g[0]));
                }
            }
        }
    }

    free(hk_buf);
}

As you can see, aside from some hard coded numbers, it is a cleaner implementation than the reshape and solve_tri used in pwilkin and my backend agnostic chunking implementations. So if your implementation is along this line, then it is not surprising that it is much faster.

But this is a recurrent impl, not sure if the chunking version of it can be faster or not.

am17an · 2026-03-11T03:12:03Z

Yes, this is what the current recurrent version in master roughly looks like. We need to figure out the boundary between the chunked and autoregressive version, clearly it's not 1 and is also device dependent

Change rq1 (tiled: head_id / rq1) to neq1 (interleaved: head_id % neq1) to match the broadcast semantics from PR ggml-org#20340. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CISC · 2026-03-11T19:40:52Z

Happy to discuss it or not, up to you but this tone is not productive. If you don't want to collaborate, just say so and I'll refrain.

Try to follow our policy please, it is there for a reason, it is extremely tiresome for everyone every time it is not.

ggerganov · 2026-03-11T20:20:17Z

Is it just me or did the Github runners become very slow lately? I.e. it takes very long to pick up the jobs.

lhez · 2026-03-11T20:25:02Z

Is it just me or did the Github runners become very slow lately? I.e. it takes very long to pick up the jobs.

I see the same. For me, it now takes about a day to finish all CI jobs.

CISC · 2026-03-11T20:27:47Z

Is it just me or did the Github runners become very slow lately? I.e. it takes very long to pick up the jobs.

It is not just you, I've been regularly slaying queues once required tests have finished (or PR has merged) just to ensure everything doesn't completely grind to a halt...

ggerganov · 2026-03-11T20:32:25Z

Yeah, the trend is clear. I think we have to move as much as possible from the CI to self-hosted runners very soon.

Queue time for last 6 months

ProgenyAlpha · 2026-03-11T20:36:29Z

Happy to discuss it or not, up to you but this tone is not productive. If you don't want to collaborate, just say so and I'll refrain.

Try to follow our policy please, it is there for a reason, it is extremely tiresome for everyone every time it is not.

What exactly about my comment was tiresome?

I shouldn't have to explain this but I have dyslexia. I use AI the same way someone might use spellcheck to format and catch errors. That is the extent of it. Spellcheck works for short discussions like this, not technical posts. It's 100% my words and my concepts and my posts and I have strict guardrails in place to keep it that way.

Perhaps a CODEOFCONDUCT.md needs to be made as am17an's response was unreasonable and the time spent on this exchange and the sudden commentary on multiple PR's has wasted far more time than briefly reading my formatted technical post ever would have.

If this is how outside contributors are treated for using accessibility tools, I have no choice but to stop contributing to this project once I've completed my PRs.

This is what my post looks like without formatting tools

Hey am17an ggerganov I noticed we're both hitting the same issue the crossover point where chunked beats autoregressive varies wildly depending on GPU head count state size and KDA vs non-KDA hardcoding a threshold or disabling chunked entirely for certain configs feels fragile what do you think about a lightweight runtime calibration approach in the backend dispatch layer the idea is on first use or device init run a quick microbenchmark fire both the AR and chunked kernels on a small synthetic input for each S_V KDA config the model needs find the crossover n_tokens where chunked starts winning cache the result per device so it only runs once at dispatch time just check n_tokens >= threshold something like this in the backend e.g. ggml-vulkan.cpp uint32_t gdn_chunk_threshold[3][2] size_idx kda in dispatch if n_tokens >= ctx->device->gdn_chunk_threshold[size_idx][kda] dispatch_chunked ctx subctx dst else dispatch_ar ctx subctx dst each backend Vulkan CUDA Metal would calibrate independently since the crossover is completely different per backend and hardware shaders stay untouched its purely a dispatch decision im happy to prototype this for Vulkan in my chunked PR #20377 if you think its worth pursuing just wanted to float the idea before putting in the work since it could also help on the CUDA side where KDA favors AR even at 512 tokens

If you really think this would be better than what I originally wrote... then I owe you ALL an apology.

CISC · 2026-03-11T20:43:59Z

Try to follow our policy please, it is there for a reason, it is extremely tiresome for everyone every time it is not.

What exactly about my comment was tiresome?

I'll reshuffle the words; It is extremely tiresome for everyone every time the policy is not followed.

CISC · 2026-03-11T20:52:24Z

If you really think this would be better than what I originally wrote... then I owe you ALL an apology.

It's about perception, one does not feel valued in a conversation if it seems artificially one-sided, nothing against your wording or your need for tools, TBH I find your original text just fine, just insert a few newlines, have more faith in your skills. :)

sultanqasim · 2026-03-11T20:56:14Z

@ProgenyAlpha the issue is that when one sees a clearly AI-written or at least AI-formatted comment, it's hard to tell how much effort a human put into it, and whether or not it's worth human time and attention reading and responding to it. LLMs can quickly and cheaply produce millions of long, detailed, and coherent sounding comments that may or may not be bullshit. With the amount of LLM-generated content getting produced these days, including PRs and comments on repos, it becomes tiresome to read and understand everything that LLMs produce. A policy of requiring all comments to be written by a human makes it easier to decide if it's worth another human spending their own time to read and respond to it.

Your writing is understandable as is, just use some punctuation, newlines, and perhaps backticks for inline code snippets.

ProgenyAlpha · 2026-03-11T21:23:36Z

@sultanqasim Thank you for your feedback, I have no issue with the policy if someone is blatantly violating it. I'm not violating it. The spam concern is valid in the abstract but doesn't apply here. I did not submit a long detailed LLM generated post and all of the things you say my original posted needed, is exactly what was done to the original post.

I have open PRs, I'm a new responsive and friendly active contributor, and my comment was a direct technical proposal to two people about a problem they both named and it was personal and friendly and not AI.

What was NOT personal and friendly was am17an response, which was far more egregious to what everyone seems to be trying to protect.

@CISC Ironically, one does not feel valued for being unfairly nitpicked.

am17an · 2026-03-12T02:50:23Z

I'm not violating it

Yes you are. Please read https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Seems like you already acknowledged this in #20334 (comment)

You are also violating another policy

If you are a new contributor, limit your open PRs to 1.

Sorry for being rude, I understand your intentions might be good. However these policies exist for maintainers to not get overloaded and ensure human-driven communication happens.

ProgenyAlpha · 2026-03-12T03:18:26Z

Yes you are. Please read https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

Except I'm not. I wrote everything. Nothing was written by AI. Nowhere in CONTRIBUTING.md does it prohibit using AI to format a technical post before posting it.

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Seems like you already acknowledged this in #20334 (comment)

Another missed detail. This comment on my PR was posted immediately following your rude comment here and I gave it a thumbs up out of respect for the POLITE way it was brought up and to keep things isolated to this PR.

You are also violating another policy

If you are a new contributor, limit your open PRs to 1.

This policy A) was written 3 days ago, after I started contributing weeks ago and B) the additional PRs were explicitly made after being directed to do so by @ggerganov and @0cc4m

Sorry for being rude, I understand your intentions might be good. However these policies exist for maintainers to not get overloaded and ensure human-driven communication happens.

I genuinely think your reaction, tone, and delivery has nothing to do with your desire to maintain the integrity of anything. Every communication from me has been human-driven and to act like I'm responsible for overloading the team after the drama you've created with your callous reply is a bit silly.

I have dyslexia. Using accessibility tools is not the same as AI authorship, and conflating the two is something I'd hope a project of this caliber would understand. I hope we can move past this and focus on the work.

am17an · 2026-03-12T03:22:22Z

Except I'm not. I wrote everything. Nothing was written by AI. Nowhere in CONTRIBUTING.md does it prohibit using AI to format a technical post before posting it.

Perhaps it is not clear to you or your AI, but formatting a technical post counts as being written by AI. At this point I think you're being disingenuous and I will not engage with you anymore. Good luck.

ProgenyAlpha · 2026-03-12T03:31:05Z

Perhaps it is not clear to you or you AI, but formatting a technical post counts as being written by AI. At this point I think you're being disingenuous and I will not engage with you anymore. Good luck.

Hard disagree, and your ignoring my other talking points and refusing to have dialogue just reinforces that your sole focus here had nothing to do with maintaining human to human communication. I disclose to you my disability and you return with a dismissive and dehumanizing "it's not clear to you and your AI" comment.

Anyone else find it ironic the thread was locked by an AI bot for being too heated? Lol, too funny.

* llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (ggml-org#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (ggml-org#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (ggml-org#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from 2068908 * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>

Adapt to the interleaved broadcast convention from ggml-org#20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...

* vulkan: add GATED_DELTA_NET op support Implements the fused gated delta net recurrence as a Vulkan compute shader with full support for scalar gate, KDA vector gate, GQA broadcast, multi-token sequences, and permuted (non-contiguous) q/k inputs. Specialization constants select head size (32/64/128) and KDA mode at pipeline creation time. Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: optimize GATED_DELTA_NET shader (Phase 1) - vec4 dot products on all inner loops (dp4 hardware intrinsic) - Cache exp(g) in shared memory for KDA path, eliminating ~32K redundant global reads and ~16K redundant exp() calls per token - vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops) - Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops KDA TG: +5.4% throughput. Non-KDA: no regressions. 13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: address review feedback for GATED_DELTA_NET Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros, scale in push constants, supports_op fix, dispatch restructuring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: add explicit FLOAT_TYPE casts for buffer loads Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts to ensure correct behavior across all Vulkan configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: fix Q/K broadcast for interleaved head layout Adapt to the interleaved broadcast convention from #20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by ggml-org#20340 (d28961d). Same class of bug as ggml-org#12517, fixed by ggml-org#12545.

Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by ggml-org#20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path.

…0468) * llama : fix pooling assertion crash in chunked GDN detection path The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by #20340 (d28961d). Same class of bug as #12517, fixed by #12545. * server : add mean pooling tests to embedding test suite Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by #20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path. --------- Co-authored-by: Domenico Crupi <domenico@zerovolt.it>

llama : enable chunked fused GDN path

ec2443a

ggerganov requested a review from CISC as a code owner March 10, 2026 08:52

ggerganov requested a review from am17an March 10, 2026 08:53

github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 10, 2026

am17an approved these changes Mar 10, 2026

View reviewed changes

models : avoid Q and K repeats when using fused GDA

39b6f5a

ggerganov force-pushed the gg/llama-allow-gdn-ch branch from 444eeed to 39b6f5a Compare March 10, 2026 10:20

am17an reviewed Mar 10, 2026

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

cont : fix comment

79541c0

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

ggerganov mentioned this pull request Mar 10, 2026

metal: fused_gdn_ch (chunked prompt processing) needed for Qwen3.5 on older Apple GPUs #20342

Open

am17an reviewed Mar 10, 2026

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggerganov and others added 2 commits March 10, 2026 12:28

cont : fix the fix

46c693d

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

cont : fix

c6b76ca

CISC reviewed Mar 10, 2026

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

This was referenced Mar 10, 2026

metal : add GDN kernel #20361

Merged

vulkan: add GATED_DELTA_NET op support #20334

Merged

ProgenyAlpha mentioned this pull request Mar 11, 2026

vulkan: chunked parallel kernel for GATED_DELTA_NET #20377

Draft

ggerganov merged commit d28961d into master Mar 11, 2026
7 of 75 checks passed

ggerganov deleted the gg/llama-allow-gdn-ch branch March 11, 2026 20:47

arkavo-com mentioned this pull request Mar 12, 2026

Fused GDN kernel: column-wise state access causes cache thrashing on Metal/CUDA/CPU #20436

Closed

ggml-org locked as too heated and limited conversation to collaborators Mar 12, 2026

ggml-org unlocked this conversation Mar 12, 2026

ikawrakow mentioned this pull request Mar 12, 2026

Split mode graph for models with pre-merged ffn_up/ffn_gate experts ikawrakow/ik_llama.cpp#1412

Merged

ZeroV0LT mentioned this pull request Mar 12, 2026

llama : fix pooling assertion crash in chunked GDN detection path #20468

Merged

ORippler mentioned this pull request Mar 13, 2026

graph : remove redundant GDN state transposes #20443

Merged

1 task

Conversation

ggerganov commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Mar 10, 2026

Uh oh!

am17an commented Mar 10, 2026

Uh oh!

ggerganov commented Mar 10, 2026

Uh oh!

am17an commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 10, 2026

Uh oh!

am17an commented Mar 10, 2026

Uh oh!

ggerganov commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ymcki commented Mar 11, 2026

Uh oh!

am17an commented Mar 11, 2026

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

lhez commented Mar 11, 2026

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

ProgenyAlpha commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

sultanqasim commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProgenyAlpha commented Mar 11, 2026

Uh oh!

am17an commented Mar 12, 2026

Uh oh!

ProgenyAlpha commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProgenyAlpha commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ggerganov commented Mar 10, 2026 •

edited

Loading

am17an commented Mar 10, 2026 •

edited

Loading

ProgenyAlpha commented Mar 11, 2026 •

edited

Loading

sultanqasim commented Mar 11, 2026 •

edited

Loading

ProgenyAlpha commented Mar 12, 2026 •

edited

Loading

am17an commented Mar 12, 2026 •

edited

Loading

ProgenyAlpha commented Mar 12, 2026 •

edited

Loading