Skip to content

llama : enable chunked fused GDN path#20340

Merged
ggerganov merged 9 commits intomasterfrom
gg/llama-allow-gdn-ch
Mar 11, 2026
Merged

llama : enable chunked fused GDN path#20340
ggerganov merged 9 commits intomasterfrom
gg/llama-allow-gdn-ch

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Mar 10, 2026

cont #19504

Backends can now implement the chunked version of the fused GDN operator.

Implementations:

@ggerganov ggerganov requested a review from CISC as a code owner March 10, 2026 08:52
@ggerganov ggerganov requested a review from am17an March 10, 2026 08:53
@github-actions github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 10, 2026
Copy link
Collaborator

@am17an am17an left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I tried the chunked kernel, it just about equal in performance to master. It seems like cublas is hard to beat, even with the sequential loop for chunks

@ggerganov
Copy link
Member Author

Hm, I actually disabled the chunked kernel in the CUDA backend only because I thought it is not implemented. If it is ready, then we should enable it - at the very least the ggml graph will become constant.

Btw, on DGX Spark it seems to perform better compared to the unfused path:

GGML_CUDA=ON ./scripts/compare-commits.sh master gg/llama-allow-gdn-ch llama-bench -m ~/models/qwen3-next-q4_0.gguf -m ~/models/Kimi-Linear-48B-A3B-Instruct-jp-imatrix.Q4_K_M.gguf -m ~/models/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-Q4_K_M.gguf -ngl 99 -fa 1 -t 1 -dio 1 -p 512,2048 -n 32 -ub 2048 -r 3
Model Test t/s master t/s gg/llama-allow-gdn-ch Speedup
kimi-linear 48B.A3B Q4_K_M pp512 849.98 1517.26 1.79
kimi-linear 48B.A3B Q4_K_M pp2048 983.61 2072.70 2.11
qwen35 27B Q4_K_M pp512 641.32 728.68 1.14
qwen35 27B Q4_K_M pp2048 592.71 681.24 1.15
qwen3next 80B.A3B Q4_0 pp512 1182.44 1259.99 1.07
qwen3next 80B.A3B Q4_0 pp2048 1378.34 1515.44 1.10

If you can confirm correctness of the implementation, I think it is fine to enable it.

@am17an
Copy link
Collaborator

am17an commented Mar 10, 2026

I think what you're enabling currently is just the autoregressive kernel right?

@ggerganov
Copy link
Member Author

Ok I see. It's still the autoregressive kernel iterating over all tokens.

@am17an
Copy link
Collaborator

am17an commented Mar 10, 2026

Strange that's it faster for Kimi-linear by so much. I think it makes sense to enable it when kda is true? I see it's faster on 5090 as well. cc: @ymcki

@ggerganov
Copy link
Member Author

Is the current branch much slower on 5090 with non-KDA models?

@am17an
Copy link
Collaborator

am17an commented Mar 10, 2026

Yes it's much slower for qwen3.5

@ggerganov ggerganov force-pushed the gg/llama-allow-gdn-ch branch from 444eeed to 39b6f5a Compare March 10, 2026 10:20
@ggerganov
Copy link
Member Author

Ok, enabled it only for KDA for now.

Also changed the broadcast pattern to interleaved since this is what is used for Qwen3.5 and helps to avoid explicit repeats of the Q and K tensors. Added TODOs to make the broadcast configurable which will allow to avoid the repeats for Qwen3 Next in a similar way.

After a few tests and making sure this branch works correctly, we can merge.

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
ggerganov and others added 2 commits March 10, 2026 12:28
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
@ymcki
Copy link
Contributor

ymcki commented Mar 11, 2026

Strange that's it faster for Kimi-linear by so much. I think it makes sense to enable it when kda is true? I see it's faster on 5090 as well. cc: @ymcki

Originally, pwilkin implemented backend agnostic chunking implementation for pp and recurrent implementation for inference. Recurrent is a special case of chunking in which chunk_size == tokens_len.

Later pwilkin found that for inference, doing tokens one by one in autoregressive mode is faster than recurrent, so the recurrent form is replaced by the autoregressive form.

Originally, cacaview implemented CPU and CUDA recurrent form:

cacaview's recurrent CPU impl
// ggml_compute_forward_kda_scan
// KDA (Kimi Delta Attention) recurrence:
//   h[t] = exp(g[t]) * h[t-1] + k[t]^T * (beta[t] * (v[t] - h[t-1] @ k[t]))
//   o[t] = q[t]^T @ h[t]

static void ggml_compute_forward_kda_scan_f32(
        const ggml_compute_params * params,
        ggml_tensor * dst) {
    const ggml_tensor * src0 = dst->src[0]; // h    {head_dim, head_dim, n_head, n_seqs+}
    const ggml_tensor * src1 = dst->src[1]; // q    {head_dim, n_head, n_seq_tokens, n_seqs}
    const ggml_tensor * src2 = dst->src[2]; // k    {head_dim, n_head, n_seq_tokens, n_seqs}
    const ggml_tensor * src3 = dst->src[3]; // v    {head_dim, n_head, n_seq_tokens, n_seqs}
    const ggml_tensor * src4 = dst->src[4]; // g    {head_dim, n_head, n_seq_tokens, n_seqs}
    const ggml_tensor * src5 = dst->src[5]; // beta {n_head, n_seq_tokens, n_seqs}
    const ggml_tensor * src6 = dst->src[6]; // ids  {n_seqs}

    const int ith = params->ith;
    const int nth = params->nth;

    const int64_t head_dim     = src0->ne[0];
    const int64_t n_head       = src1->ne[1];
    const int64_t n_seq_tokens = src1->ne[2];
    const int64_t n_seqs       = src1->ne[3];

    // Output offset for hidden state
    const int64_t y_off = ggml_nelements(src1) * sizeof(float);

    GGML_ASSERT(src0->nb[0] == sizeof(float));
    GGML_ASSERT(src1->nb[0] == sizeof(float));
    GGML_ASSERT(src2->nb[0] == sizeof(float));
    GGML_ASSERT(src3->nb[0] == sizeof(float));
    GGML_ASSERT(src4->nb[0] == sizeof(float));
    GGML_ASSERT(src5->nb[0] == sizeof(float));
    GGML_ASSERT(src6->nb[0] == sizeof(int32_t));

    // Parallelize over heads
    const int dh = (n_head + nth - 1) / nth;
    const int ih0 = dh * ith;
    const int ih1 = MIN(ih0 + dh, (int)n_head);

    const int32_t * ids = (const int32_t *) src6->data;

    // Temporary buffer for h @ k computation
    float * hk_buf = (float *) malloc(head_dim * sizeof(float));

    static int debug_count = 0;
    bool do_debug = false; // (ith == 0 && debug_count++ < 20);
    
    for (int i3 = 0; i3 < n_seqs; ++i3) {
        // Get initial hidden state for this sequence
        const float * h0 = (const float *) ((const char *) src0->data + ids[i3] * src0->nb[3]);
        // Output hidden state location
        float * h_out = (float *) ((char *) dst->data + i3 * src0->nb[3] + y_off);

        for (int ih = ih0; ih < ih1; ++ih) {
            // Per-head hidden state: [head_dim, head_dim]
            // Copy initial state to output (will be updated in place)
            const float * h_in = h0 + ih * head_dim * head_dim;
            float * h = h_out + ih * head_dim * head_dim;
            
            // Copy initial state, but check for invalid values and clear if needed
            bool need_clear = false;
            for (int i = 0; i < head_dim * head_dim && !need_clear; ++i) {
                if (!isfinite(h_in[i]) || fabsf(h_in[i]) > 1e6f) {
                    need_clear = true;
                }
            }
            for (int i = 0; i < head_dim * head_dim; ++i) {
                h[i] = need_clear ? 0.0f : h_in[i];
            }

            for (int it = 0; it < n_seq_tokens; ++it) {
                const float * q_raw = (const float *) ((const char *) src1->data + 
                    it * src1->nb[2] + i3 * src1->nb[3]) + ih * head_dim;
                const float * k_raw = (const float *) ((const char *) src2->data + 
                    it * src2->nb[2] + i3 * src2->nb[3]) + ih * head_dim;
                const float * v = (const float *) ((const char *) src3->data + 
                    it * src3->nb[2] + i3 * src3->nb[3]) + ih * head_dim;
                const float * g = (const float *) ((const char *) src4->data + 
                    it * src4->nb[2] + i3 * src4->nb[3]) + ih * head_dim;
                const float beta = ((const float *) ((const char *) src5->data + 
                    it * src5->nb[1] + i3 * src5->nb[2]))[ih];
                
                float * y = (float *) dst->data + 
                    it * n_head * head_dim + i3 * n_seq_tokens * n_head * head_dim + ih * head_dim;

                // L2 normalize q and k (critical for KDA stability)
                float q_norm = 0.0f, k_norm = 0.0f;
                for (int i = 0; i < head_dim; ++i) {
                    q_norm += q_raw[i] * q_raw[i];
                    k_norm += k_raw[i] * k_raw[i];
                }
                q_norm = sqrtf(q_norm + 1e-6f);
                k_norm = sqrtf(k_norm + 1e-6f);
                
                // Debug output
                if (do_debug && ih == 0 && it == 0 && i3 == 0) {
                    fprintf(stderr, "DEBUG KDA: q_raw[0]=%f, k_raw[0]=%f, v[0]=%f, g[0]=%f, beta=%f\n",
                            q_raw[0], k_raw[0], v[0], g[0], beta);
                    fprintf(stderr, "DEBUG KDA: q_norm=%f, k_norm=%f, exp(g[0])=%f, scale=%f\n",
                            q_norm, k_norm, expf(g[0]), 1.0f / sqrtf((float)head_dim));
                }
                
                // Normalized q and k with scale = 1/sqrt(head_dim)
                // Note: scale is applied only to q after L2 normalization
                const float scale = 1.0f / sqrtf((float)head_dim);
                float q[128], k[128];  // assume head_dim <= 128
                for (int i = 0; i < head_dim; ++i) {
                    // L2 normalize then scale q
                    q[i] = (q_raw[i] / q_norm) * scale;
                    // L2 normalize k (no scale)
                    k[i] = k_raw[i] / k_norm;
                }

                // KDA recurrence: h[t] = exp(g[t]) * h[t-1] + k[t]^T * (beta[t] * (v[t] - h[t-1] @ k[t]))
                // Note: Apply decay first, then compute retrieval and update

                // Step 1: Apply decay to h first: h = h * exp(g)
                // Clamp g to [-80, 80] to avoid numerical overflow
                for (int i = 0; i < head_dim; ++i) {
                    const float g_clamped = fminf(fmaxf(g[i], -80.0f), 80.0f);
                    const float exp_gi = expf(g_clamped);
                    for (int j = 0; j < head_dim; ++j) {
                        h[i * head_dim + j] *= exp_gi;
                    }
                }

                // Step 2: Compute h^T @ k -> hk_buf [head_dim]
                // hk_buf[j] = sum_i (h[i,j] * k[i]) which is column j of h dotted with k
                for (int j = 0; j < head_dim; ++j) {
                    float sum = 0.0f;
                    for (int i = 0; i < head_dim; ++i) {
                        sum += h[i * head_dim + j] * k[i];
                    }
                    hk_buf[j] = sum;
                }

                // Step 3: Compute delta = beta * (v - hk) and update h
                // h = h + outer(k, delta) where outer(k,delta)[i,j] = k[i] * delta[j]
                for (int i = 0; i < head_dim; ++i) {
                    for (int j = 0; j < head_dim; ++j) {
                        const float delta_j = beta * (v[j] - hk_buf[j]);
                        h[i * head_dim + j] += k[i] * delta_j;
                    }
                }

                // Step 4: Compute output y = h^T @ q -> [head_dim]
                // vLLM: b_o = tl.sum(b_h * b_q[:, None], 0) means o[j] = sum_i(h[i,j] * q[i])
                for (int j = 0; j < head_dim; ++j) {
                    float sum = 0.0f;
                    for (int i = 0; i < head_dim; ++i) {
                        sum += h[i * head_dim + j] * q[i];
                    }
                    y[j] = sum;
                }
                
                // Debug output
                if (do_debug && ih == 0 && it == 0 && i3 == 0) {
                    // Find max abs value in h for stability check
                    float h_max = 0.0f;
                    for (int i = 0; i < head_dim * head_dim; i++) {
                        if (fabsf(h[i]) > h_max) h_max = fabsf(h[i]);
                    }
                    fprintf(stderr, "DEBUG KDA: y[0]=%.6f, h_max=%.6f, exp(g[0])=%.6f\n",
                            y[0], h_max, expf(g[0]));
                }
            }
        }
    }

    free(hk_buf);
}

As you can see, aside from some hard coded numbers, it is a cleaner implementation than the reshape and solve_tri used in pwilkin and my backend agnostic chunking implementations. So if your implementation is along this line, then it is not surprising that it is much faster.

But this is a recurrent impl, not sure if the chunking version of it can be faster or not.

@am17an
Copy link
Collaborator

am17an commented Mar 11, 2026

Yes, this is what the current recurrent version in master roughly looks like. We need to figure out the boundary between the chunked and autoregressive version, clearly it's not 1 and is also device dependent

ProgenyAlpha added a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 11, 2026
Change rq1 (tiled: head_id / rq1) to neq1 (interleaved: head_id % neq1)
to match the broadcast semantics from PR ggml-org#20340.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ProgenyAlpha added a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 11, 2026
Change rq1 (tiled: head_id / rq1) to neq1 (interleaved: head_id % neq1)
to match the broadcast semantics from PR ggml-org#20340.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CISC
Copy link
Member

CISC commented Mar 11, 2026

Happy to discuss it or not, up to you but this tone is not productive. If you don't want to collaborate, just say so and I'll refrain.

Try to follow our policy please, it is there for a reason, it is extremely tiresome for everyone every time it is not.

@ggerganov
Copy link
Member Author

Is it just me or did the Github runners become very slow lately? I.e. it takes very long to pick up the jobs.

@lhez
Copy link
Collaborator

lhez commented Mar 11, 2026

Is it just me or did the Github runners become very slow lately? I.e. it takes very long to pick up the jobs.

I see the same. For me, it now takes about a day to finish all CI jobs.

@CISC
Copy link
Member

CISC commented Mar 11, 2026

Is it just me or did the Github runners become very slow lately? I.e. it takes very long to pick up the jobs.

It is not just you, I've been regularly slaying queues once required tests have finished (or PR has merged) just to ensure everything doesn't completely grind to a halt...

@ggerganov
Copy link
Member Author

Yeah, the trend is clear. I think we have to move as much as possible from the CI to self-hosted runners very soon.

Queue time for last 6 months image image image image image image image

@ProgenyAlpha
Copy link
Contributor

ProgenyAlpha commented Mar 11, 2026

Happy to discuss it or not, up to you but this tone is not productive. If you don't want to collaborate, just say so and I'll refrain.

Try to follow our policy please, it is there for a reason, it is extremely tiresome for everyone every time it is not.

What exactly about my comment was tiresome?

I shouldn't have to explain this but I have dyslexia. I use AI the same way someone might use spellcheck to format and catch errors. That is the extent of it. Spellcheck works for short discussions like this, not technical posts. It's 100% my words and my concepts and my posts and I have strict guardrails in place to keep it that way.

Perhaps a CODEOFCONDUCT.md needs to be made as am17an's response was unreasonable and the time spent on this exchange and the sudden commentary on multiple PR's has wasted far more time than briefly reading my formatted technical post ever would have.

If this is how outside contributors are treated for using accessibility tools, I have no choice but to stop contributing to this project once I've completed my PRs.

This is what my post looks like without formatting tools

Hey am17an ggerganov I noticed we're both hitting the same issue the crossover point where chunked beats autoregressive varies wildly depending on GPU head count state size and KDA vs non-KDA hardcoding a threshold or disabling chunked entirely for certain configs feels fragile what do you think about a lightweight runtime calibration approach in the backend dispatch layer the idea is on first use or device init run a quick microbenchmark fire both the AR and chunked kernels on a small synthetic input for each S_V KDA config the model needs find the crossover n_tokens where chunked starts winning cache the result per device so it only runs once at dispatch time just check n_tokens >= threshold something like this in the backend e.g. ggml-vulkan.cpp uint32_t gdn_chunk_threshold[3][2] size_idx kda in dispatch if n_tokens >= ctx->device->gdn_chunk_threshold[size_idx][kda] dispatch_chunked ctx subctx dst else dispatch_ar ctx subctx dst each backend Vulkan CUDA Metal would calibrate independently since the crossover is completely different per backend and hardware shaders stay untouched its purely a dispatch decision im happy to prototype this for Vulkan in my chunked PR #20377 if you think its worth pursuing just wanted to float the idea before putting in the work since it could also help on the CUDA side where KDA favors AR even at 512 tokens

If you really think this would be better than what I originally wrote... then I owe you ALL an apology.

@CISC
Copy link
Member

CISC commented Mar 11, 2026

Try to follow our policy please, it is there for a reason, it is extremely tiresome for everyone every time it is not.

What exactly about my comment was tiresome?

I'll reshuffle the words; It is extremely tiresome for everyone every time the policy is not followed.

@ggerganov ggerganov merged commit d28961d into master Mar 11, 2026
7 of 75 checks passed
@ggerganov ggerganov deleted the gg/llama-allow-gdn-ch branch March 11, 2026 20:47
@CISC
Copy link
Member

CISC commented Mar 11, 2026

If you really think this would be better than what I originally wrote... then I owe you ALL an apology.

It's about perception, one does not feel valued in a conversation if it seems artificially one-sided, nothing against your wording or your need for tools, TBH I find your original text just fine, just insert a few newlines, have more faith in your skills. :)

@sultanqasim
Copy link

sultanqasim commented Mar 11, 2026

@ProgenyAlpha the issue is that when one sees a clearly AI-written or at least AI-formatted comment, it's hard to tell how much effort a human put into it, and whether or not it's worth human time and attention reading and responding to it. LLMs can quickly and cheaply produce millions of long, detailed, and coherent sounding comments that may or may not be bullshit. With the amount of LLM-generated content getting produced these days, including PRs and comments on repos, it becomes tiresome to read and understand everything that LLMs produce. A policy of requiring all comments to be written by a human makes it easier to decide if it's worth another human spending their own time to read and respond to it.

Your writing is understandable as is, just use some punctuation, newlines, and perhaps backticks for inline code snippets.

@ProgenyAlpha
Copy link
Contributor

@sultanqasim Thank you for your feedback, I have no issue with the policy if someone is blatantly violating it. I'm not violating it. The spam concern is valid in the abstract but doesn't apply here. I did not submit a long detailed LLM generated post and all of the things you say my original posted needed, is exactly what was done to the original post.

I have open PRs, I'm a new responsive and friendly active contributor, and my comment was a direct technical proposal to two people about a problem they both named and it was personal and friendly and not AI.

What was NOT personal and friendly was am17an response, which was far more egregious to what everyone seems to be trying to protect.

@CISC Ironically, one does not feel valued for being unfairly nitpicked.

@am17an
Copy link
Collaborator

am17an commented Mar 12, 2026

I'm not violating it

Yes you are. Please read https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Seems like you already acknowledged this in #20334 (comment)

You are also violating another policy

If you are a new contributor, limit your open PRs to 1.

Sorry for being rude, I understand your intentions might be good. However these policies exist for maintainers to not get overloaded and ensure human-driven communication happens.

@ProgenyAlpha
Copy link
Contributor

ProgenyAlpha commented Mar 12, 2026

Yes you are. Please read https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

Except I'm not. I wrote everything. Nothing was written by AI. Nowhere in CONTRIBUTING.md does it prohibit using AI to format a technical post before posting it.

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Seems like you already acknowledged this in #20334 (comment)

Another missed detail. This comment on my PR was posted immediately following your rude comment here and I gave it a thumbs up out of respect for the POLITE way it was brought up and to keep things isolated to this PR.

You are also violating another policy

If you are a new contributor, limit your open PRs to 1.

This policy A) was written 3 days ago, after I started contributing weeks ago and B) the additional PRs were explicitly made after being directed to do so by @ggerganov and @0cc4m

Sorry for being rude, I understand your intentions might be good. However these policies exist for maintainers to not get overloaded and ensure human-driven communication happens.

I genuinely think your reaction, tone, and delivery has nothing to do with your desire to maintain the integrity of anything. Every communication from me has been human-driven and to act like I'm responsible for overloading the team after the drama you've created with your callous reply is a bit silly.

I have dyslexia. Using accessibility tools is not the same as AI authorship, and conflating the two is something I'd hope a project of this caliber would understand. I hope we can move past this and focus on the work.

@am17an
Copy link
Collaborator

am17an commented Mar 12, 2026

Except I'm not. I wrote everything. Nothing was written by AI. Nowhere in CONTRIBUTING.md does it prohibit using AI to format a technical post before posting it.

Perhaps it is not clear to you or your AI, but formatting a technical post counts as being written by AI. At this point I think you're being disingenuous and I will not engage with you anymore. Good luck.

@ggml-org ggml-org locked as too heated and limited conversation to collaborators Mar 12, 2026
@ggml-org ggml-org unlocked this conversation Mar 12, 2026
@ProgenyAlpha
Copy link
Contributor

ProgenyAlpha commented Mar 12, 2026

Perhaps it is not clear to you or you AI, but formatting a technical post counts as being written by AI. At this point I think you're being disingenuous and I will not engage with you anymore. Good luck.

Hard disagree, and your ignoring my other talking points and refusing to have dialogue just reinforces that your sole focus here had nothing to do with maintaining human to human communication. I disclose to you my disability and you return with a dismissive and dehumanizing "it's not clear to you and your AI" comment.

Anyone else find it ironic the thread was locked by an AI bot for being too heated? Lol, too funny.

ProgenyAlpha pushed a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
* llama : enable chunked fused GDN path

* models : avoid Q and K repeats when using fused GDA

* cont : fix comment

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cont : fix the fix

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cont : fix

* metal : add GDN kernel (ggml-org#20361)

* metal : add Metal backend for GGML_OP_GATED_DELTA_NET

Add a fused Metal kernel for the gated delta net recurrence op
(ggml-org#19504), enabling GPU-accelerated inference for DeltaNet-based
models (Qwen3.5, etc.) on Apple Silicon.

Supports both GDA (scalar gate) and KDA (per-row gate) modes
with head_size 64 and 128. Unsupported configurations (head_size
32, non-contiguous tensors) gracefully fall back to CPU.

Performance: Qwen3.5-0.8B Q4_K_M on M4 Max
  tg128: 170 -> 213 t/s (+25%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* metal : validate contiguity of all input tensors in supports_op

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* metal : add algorithm equivalence comment for GDA decay path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* cont : unslop + optimize

* cont : clean-up

---------

Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* CUDA: AR gated delta net improvements (ggml-org#20391)

* Add FastDiv to gated_delta_net_cuda

* Shard columns across warps

This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).

* Remove unneded include in gated_delta_net.cu

* Improve comments

* Apply code-formating

* Make sharding HIP-compatible

1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
2. Add test with partial warp to test sum reduction on CUDA

* Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t

* Rename variables

* Enable GDN also for prefill, move TODO for chunked_GDN

* Actually remove the TODO from 2068908

* Get warp size at runtime

warp_size is not known at compile time in hip host code.

* Don't expose ggml_cuda_get_physical_warp_size on host

---------

Co-authored-by: uvos <devnull@uvos.xyz>

* llama : refactor llm_build_delta_net_base API

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: uvos <devnull@uvos.xyz>
ProgenyAlpha added a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
Adapt to the interleaved broadcast convention from ggml-org#20340:
head_id / rq1 → head_id % neq1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ProgenyAlpha added a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
Adapt to the interleaved broadcast convention from ggml-org#20340:
head_id / rq1 → head_id % neq1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tekintian added a commit to tekintian/llama.cpp that referenced this pull request Mar 12, 2026
* 'master' of github.com:ggml-org/llama.cpp: (33 commits)
  convert : better mtp check and fix return [no ci] (ggml-org#20419)
  vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379)
  New conversations now auto-select the first loaded model (ggml-org#20403)
  ggml-virtgpu: Fix some build commands (ggml-org#20341)
  metal : avoid divisions in bin kernel (ggml-org#20426)
  ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154)
  vulkan: fix l2_norm epsilon handling (ggml-org#20350)
  vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296)
  vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059)
  opencl: use larger workgroup size for get_rows (ggml-org#20316)
  opencl: add cumsum op (ggml-org#18981)
  hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392)
  common/parser: add GigaChatV3/3.1 models support (ggml-org#19931)
  model : add support for Phi4ForCausalLMV (ggml-org#20168)
  graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427)
  common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416)
  ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230)
  llama : enable chunked fused GDN path (ggml-org#20340)
  llama : whitespace cleanup (ggml-org#20422)
  ggml : add NVFP4 quantization type support (ggml-org#19769)
  ...
0cc4m pushed a commit that referenced this pull request Mar 12, 2026
* vulkan: add GATED_DELTA_NET op support

Implements the fused gated delta net recurrence as a Vulkan compute
shader with full support for scalar gate, KDA vector gate, GQA
broadcast, multi-token sequences, and permuted (non-contiguous) q/k
inputs. Specialization constants select head size (32/64/128) and
KDA mode at pipeline creation time.

Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: optimize GATED_DELTA_NET shader (Phase 1)

- vec4 dot products on all inner loops (dp4 hardware intrinsic)
- Cache exp(g) in shared memory for KDA path, eliminating ~32K
  redundant global reads and ~16K redundant exp() calls per token
- vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops)
- Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops

KDA TG: +5.4% throughput. Non-KDA: no regressions.
13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: address review feedback for GATED_DELTA_NET

Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros,
scale in push constants, supports_op fix, dispatch restructuring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: add explicit FLOAT_TYPE casts for buffer loads

Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts
to ensure correct behavior across all Vulkan configurations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: fix Q/K broadcast for interleaved head layout

Adapt to the interleaved broadcast convention from #20340:
head_id / rq1 → head_id % neq1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
ZeroV0LT pushed a commit to ZeroV0LT/llama.cpp that referenced this pull request Mar 12, 2026
The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).

Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.

Regression introduced by ggml-org#20340 (d28961d).
Same class of bug as ggml-org#12517, fixed by ggml-org#12545.
ZeroV0LT pushed a commit to ZeroV0LT/llama.cpp that referenced this pull request Mar 12, 2026
Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.

These tests would have caught the regression introduced by ggml-org#20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.
ggerganov pushed a commit that referenced this pull request Mar 13, 2026
…0468)

* llama : fix pooling assertion crash in chunked GDN detection path

The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).

Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.

Regression introduced by #20340 (d28961d).
Same class of bug as #12517, fixed by #12545.

* server : add mean pooling tests to embedding test suite

Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.

These tests would have caught the regression introduced by #20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.

---------

Co-authored-by: Domenico Crupi <domenico@zerovolt.it>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants