common/parser: add proper reasoning tag prefill reading by pwilkin · Pull Request #20424 · ggml-org/llama.cpp

pwilkin · 2026-03-11T20:49:33Z

This changes the erroneous behavior of the autoparser that ascribed thinking behavior to templates. As people rightly mentioned, some models have dynamic or hybrid reasoning - they can reason or not depending on some switches and even the template behavior can change due to this (i.e. inserting <think> in assistant prefill after a "no_think" appears in a user message).

Therefore, the FORCED_OPEN and FORCED_CLOSED formats are gone. The parser will now just detect models with tagged reasoning, i.e. an opening and closing reasoning marker (deleted DELIMITER also since it's a special case with the opening marker being empty). However, it will check the assistant prefill for those markers and will append them to the input for the grammar and the parser, so that they are taken into account, therefore just simplifying the parsing mechanism since it doesn't now have to differentiate whether the <think>' / ` was added by the template or generated by the model.

pwilkin · 2026-03-11T20:51:47Z

Fixes #20356
Fixes #20325
Fixes #20265

This also clears the ground for disabling grammar triggers inside reasoning loops in a subsequent PR, which would resolve #20260

aldehir · 2026-03-11T21:17:53Z

Dumb question, why not find the start of the assistant message and prepend that?

I agree it would be easier to parse if we had a "prefill" of some sort that normalizes the input, such that we can handle the logic in the grammar and not through flags. However, if we're going this route I would look into prepending the start of the entire assistant message. This will also open the door for parsing output from requests with an assistant prefill.

pwilkin · 2026-03-11T21:30:35Z

Yeah, that would be the logical conclusion, but for now it's easier for me just to extract the reasoning markers since finding the actual start of the assistant message is nontrivial.

aldehir · 2026-03-11T23:09:26Z

Qwen3.5 uses

<think>\n\n</think>\n\n

{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- else %}
{{- '<think>\n' }}
{%- endif %}

however,

      "reasoning_prefill": "<think></think>\n\n",

It probably doesn't matter for this model, but it is technically not adhering to the template.

aldehir · 2026-03-11T23:15:21Z

    {
      "id": 248045,
      "piece": "<|im_start|>"
    },
    {
      "id": 74455,
      "piece": "assistant"
    },
    {
      "id": 198,
      "piece": "\n"
    },
    {
      "id": 248068,
      "piece": "<think>"
    },
    {
      "id": 271,
      "piece": "\n\n"
    },
    {
      "id": 248069,
      "piece": "</think>"
    },
    {
      "id": 271,
      "piece": "\n\n"
    }

Maybe set reasoning_prefill from the start of the opening tag to the end of the prompt?

aldehir · 2026-03-11T23:32:03Z

finding the actual start of the assistant message is nontrivial.

Run the template once with add_generation_prompt = false, capture the size, run again with true, extract the string content that spans the delta? I think that would work in most cases.

pwilkin · 2026-03-12T00:27:07Z

That usually works, yeah 😀 I can try that and see what the results are (this is what calculate_diff_split from the analyzer does BTW). I'm just worried about some weird edge cases.

bsdice · 2026-03-14T11:10:37Z

Nice patch! With model https://huggingface.co/mradermacher/Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking-GGUF the patches fix webui getting confused on /think and not splitting correctly reasoning and generation part. Build llama.cpp-cuda-git-b8334.r9.710878a7dd-1.

pwilkin · 2026-03-14T14:50:50Z

@aldehir changed the prefill extraction behavior to the differential one you mentioned.

aldehir · 2026-03-14T19:06:06Z

common/chat.h

    std::string                         grammar;
    bool                                grammar_lazy         = false;
-    bool                                thinking_forced_open = false;
+    std::string                         prefill;


Think we name this generation_prompt? It lines up with the add_generation_prompt flag.

pwilkin requested review from aldehir, allozaur, ggerganov and ngxson as code owners March 11, 2026 20:49

github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples server labels Mar 11, 2026

pwilkin mentioned this pull request Mar 11, 2026

Eval bug: unsloth/Qwen3.5-35B-A3B-GGUF peg-native chat format parser fails when model outputs text before <tool_call> (thinking model + tool calling) #20260

Open

loci-dev mentioned this pull request Mar 12, 2026

UPSTREAM PR #20424: common/parser: add proper reasoning tag prefill reading auroralabs-loci/llama.cpp#1247

Open

aldehir mentioned this pull request Mar 13, 2026

Eval bug: Response always starts with </think> tag when running Qwen3.5 9B #20516

Open

pwilkin added 5 commits March 14, 2026 15:48

Reasoning prefill

40d006e

wip

1979634

Final touches

1517c5e

Minor.

ff400f6

Implement proper prefill extraction

4083259

pwilkin force-pushed the reasoning-prefill branch from 3bfb08f to 4083259 Compare March 14, 2026 14:49

aldehir reviewed Mar 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common/parser: add proper reasoning tag prefill reading#20424

common/parser: add proper reasoning tag prefill reading#20424
pwilkin wants to merge 5 commits intoggml-org:masterfrom
pwilkin:reasoning-prefill

pwilkin commented Mar 11, 2026

Uh oh!

pwilkin commented Mar 11, 2026

Uh oh!

aldehir commented Mar 11, 2026 •

edited

Loading

Uh oh!

pwilkin commented Mar 11, 2026

Uh oh!

aldehir commented Mar 11, 2026 •

edited

Loading

Uh oh!

aldehir commented Mar 11, 2026

Uh oh!

aldehir commented Mar 11, 2026

Uh oh!

pwilkin commented Mar 12, 2026

Uh oh!

bsdice commented Mar 14, 2026 •

edited

Loading

Uh oh!

pwilkin commented Mar 14, 2026

Uh oh!

aldehir Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pwilkin commented Mar 11, 2026

Uh oh!

pwilkin commented Mar 11, 2026

Uh oh!

aldehir commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 11, 2026

Uh oh!

aldehir commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Mar 11, 2026

Uh oh!

aldehir commented Mar 11, 2026

Uh oh!

pwilkin commented Mar 12, 2026

Uh oh!

bsdice commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 14, 2026

Uh oh!

aldehir Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aldehir commented Mar 11, 2026 •

edited

Loading

aldehir commented Mar 11, 2026 •

edited

Loading

bsdice commented Mar 14, 2026 •

edited

Loading