Skip to content

common/parser: add proper reasoning tag prefill reading#20424

Open
pwilkin wants to merge 5 commits intoggml-org:masterfrom
pwilkin:reasoning-prefill
Open

common/parser: add proper reasoning tag prefill reading#20424
pwilkin wants to merge 5 commits intoggml-org:masterfrom
pwilkin:reasoning-prefill

Conversation

@pwilkin
Copy link
Contributor

@pwilkin pwilkin commented Mar 11, 2026

This changes the erroneous behavior of the autoparser that ascribed thinking behavior to templates. As people rightly mentioned, some models have dynamic or hybrid reasoning - they can reason or not depending on some switches and even the template behavior can change due to this (i.e. inserting <think> in assistant prefill after a "no_think" appears in a user message).

Therefore, the FORCED_OPEN and FORCED_CLOSED formats are gone. The parser will now just detect models with tagged reasoning, i.e. an opening and closing reasoning marker (deleted DELIMITER also since it's a special case with the opening marker being empty). However, it will check the assistant prefill for those markers and will append them to the input for the grammar and the parser, so that they are taken into account, therefore just simplifying the parsing mechanism since it doesn't now have to differentiate whether the <think>' / ` was added by the template or generated by the model.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 11, 2026

Fixes #20356
Fixes #20325
Fixes #20265

This also clears the ground for disabling grammar triggers inside reasoning loops in a subsequent PR, which would resolve #20260

@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples server labels Mar 11, 2026
@aldehir
Copy link
Collaborator

aldehir commented Mar 11, 2026

Dumb question, why not find the start of the assistant message and prepend that?

I agree it would be easier to parse if we had a "prefill" of some sort that normalizes the input, such that we can handle the logic in the grammar and not through flags. However, if we're going this route I would look into prepending the start of the entire assistant message. This will also open the door for parsing output from requests with an assistant prefill.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 11, 2026

Yeah, that would be the logical conclusion, but for now it's easier for me just to extract the reasoning markers since finding the actual start of the assistant message is nontrivial.

@aldehir
Copy link
Collaborator

aldehir commented Mar 11, 2026

Qwen3.5 uses

<think>\n\n</think>\n\n
{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- else %}
{{- '<think>\n' }}
{%- endif %}

however,

      "reasoning_prefill": "<think></think>\n\n",

It probably doesn't matter for this model, but it is technically not adhering to the template.

@aldehir
Copy link
Collaborator

aldehir commented Mar 11, 2026

    {
      "id": 248045,
      "piece": "<|im_start|>"
    },
    {
      "id": 74455,
      "piece": "assistant"
    },
    {
      "id": 198,
      "piece": "\n"
    },
    {
      "id": 248068,
      "piece": "<think>"
    },
    {
      "id": 271,
      "piece": "\n\n"
    },
    {
      "id": 248069,
      "piece": "</think>"
    },
    {
      "id": 271,
      "piece": "\n\n"
    }

Maybe set reasoning_prefill from the start of the opening tag to the end of the prompt?

@aldehir
Copy link
Collaborator

aldehir commented Mar 11, 2026

finding the actual start of the assistant message is nontrivial.

Run the template once with add_generation_prompt = false, capture the size, run again with true, extract the string content that spans the delta? I think that would work in most cases.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 12, 2026

That usually works, yeah 😀 I can try that and see what the results are (this is what calculate_diff_split from the analyzer does BTW). I'm just worried about some weird edge cases.

@bsdice
Copy link

bsdice commented Mar 14, 2026

Nice patch! With model https://huggingface.co/mradermacher/Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking-GGUF the patches fix webui getting confused on /think and not splitting correctly reasoning and generation part. Build llama.cpp-cuda-git-b8334.r9.710878a7dd-1.

@pwilkin pwilkin force-pushed the reasoning-prefill branch from 3bfb08f to 4083259 Compare March 14, 2026 14:49
@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 14, 2026

@aldehir changed the prefill extraction behavior to the differential one you mentioned.

std::string grammar;
bool grammar_lazy = false;
bool thinking_forced_open = false;
std::string prefill;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we name this generation_prompt? It lines up with the add_generation_prompt flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants