Reasoning representations and chat template specifications in major LLMs

This document is a technical reference that organizes how Reasoning is represented on the chat templates of major LLMs (large language models), based on the specifications of publicly available chat_template.jinja files, tokenizer_config.json files, encoder implementations, model cards, and official API documentation.

Definition: In this document, "Reasoning representation" does not refer to the invisible reasoning process internal to the model, but to structures explicitly emitted on the template, tokenizer, or prompt. Concrete examples include tags such as <think>...</think> or [THINK]...[/THINK], or the analysis channel in the Harmony architecture.

Overview of the target models

As of April 5, 2026, the major model families for which the Reasoning specification can be identified and traced from public templates are the following nine.

gpt-oss-120b
LLM-jp-4-thinking
Gemma 4
DeepSeek-V3.2
Qwen3.5
Kimi K2.5
Phi-4-reasoning
GLM-5
Mistral 3 family (using Ministral-3-14B-Reasoning-2512 as the representative public template)

Note: For Llama 4 and OLMo 2, the existence of ordinary Instruct/Chat templates has been confirmed. However, since no Reasoning-specific public template specification nor an explicit parameter to control Reasoning Effort could be confirmed, they are excluded from the main comparison in this document and treated in an appendix.

1. Architectural premise: separating the API Surface and the Prompt Surface

To accurately evaluate and implement each model's production environment, this document classifies the input/output interface into the following two layers.

1.1 API Surface

This is the input payload that the user sends directly through the system. A typical structure is as follows.

{
  "messages": [
    {"role": "user", "content": "Look up and summarize tomorrow's weather in Tokyo"}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Returns the weather for a city",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {"type": "string"}
          },
          "required": ["city"]
        }
      }
    }
  ]
}

1.2 Prompt Surface

This is the state in which the data entered into the API Surface above is finally serialized into the string and special-token sequence fed to the model. The structure of this Prompt Surface differs greatly from model to model. Representative patterns are as follows.

<think>...</think> family: GLM, Kimi, Qwen, DeepSeek, Phi
<|think|> or thought channel family: Gemma 4
[THINK]...[/THINK] family: Ministral Reasoning
Harmony channels (analysis / commentary / final) family: gpt-oss, LLM-jp-4-thinking

This document focuses primarily on the specification of this "Prompt Surface," and where necessary also notes the specification of API-layer control parameters (such as reasoning_effort or enable_thinking).

2. Standard input payload for evaluation

To compare, across models, the differences in the serialization process produced by each template, this document uses the following abstract input as a standard baseline.

2.1 Message input

[
  {"role": "user", "content": "Look up and summarize tomorrow's weather in Tokyo"}
]

2.2 Available tool definition

[
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Takes a city name and returns weather information",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {"type": "string"}
        },
        "required": ["city"]
      }
    }
  }
]

Note: The code blocks shown in the following sections are outlines (condensed versions that extract the important tokens and structure) showing how this standard payload is processed and represented by each model's template.

3. Per-model specification summary

Below is an overview list of the public artifacts, explicit Reasoning representations, control parameters, and tool-call specifications for each major model family.

gpt-oss-120b
- Public artifact: openai/gpt-oss-120b/chat_template.jinja
- Reasoning representation: analysis channel
- Control parameter: reasoning_effort=low|medium|high
- Tool-call representation: Harmony call syntax
LLM-jp-4-thinking
- Public artifact: llm-jp/llm-jp-4-8b-thinking/chat_template.jinja
- Reasoning representation: Harmony analysis channel
- Control parameter: reasoning_effort=low|medium|high
- Tool-call representation: Harmony call syntax
Gemma 4
- Public artifact: google/gemma-4-E2B-it/chat_template.jinja
- Reasoning representation: <|think|> / thought channel
- Control parameter: enable_thinking=True|False
- Tool-call representation: <|tool>, <|tool_call>, <|tool_response>
DeepSeek-V3.2
- Public artifact: deepseek-ai/DeepSeek-V3.2/encoding/encoding_dsv32.py
- Reasoning representation: <think>...</think>
- Control parameter: thinking_mode="thinking"|"chat", drop_thinking
- Tool-call representation: DSML syntax (<｜DSML｜function_calls>)
Qwen3.5
- Public artifact: Qwen/Qwen3.5-35B-A3B/tokenizer_config.json
- Reasoning representation: <think>...</think>
- Control parameter: enable_thinking=False
- Tool-call representation: <tool_call><function=...><parameter=...>
Kimi K2.5
- Public artifact: moonshotai/Kimi-K2.5/chat_template.jinja
- Reasoning representation: <think>...</think>
- Control parameter: thinking on/off (default is on)
- Tool-call representation: tool_declare, <|tool_calls_section_begin|> family
Phi-4-reasoning
- Public artifact: microsoft/Phi-4-reasoning/tokenizer_config.json
- Reasoning representation: <think>...</think> (fixed system instruction)
- Control parameter: none applicable
- Tool-call representation: no tool-call syntax
GLM-5
- Public artifact: zai-org/GLM-5-FP8/chat_template.jinja
- Reasoning representation: <think>...</think>
- Control parameter: enable_thinking / thinking.type=disabled / clear_thinking
- Tool-call representation: <tools>...</tools>, <tool_call>...
Mistral 3 family
- Public artifact: mistralai/Ministral-3-14B-Reasoning-2512/chat_template.jinja
- Reasoning representation: [THINK]...[/THINK]
- Control parameter: the public template is fixed; the API side supports reasoning_effort
- Tool-call representation: [AVAILABLE_TOOLS], [TOOL_CALLS], [TOOL_RESULTS]

4. gpt-oss-120b specification details

4.1 Reference artifacts

Hugging Face: openai/gpt-oss-120b/chat_template.jinja
GitHub: openai/harmony README

The main design characteristic of this model is that it does not use the <think> tag for reasoning blocks; instead it adopts the analysis channel based on the Harmony protocol.

4.2 API specification

In addition to standard chat messages and tool definitions, it supports the following keyword arguments (kwargs) when applying the template.

builtin_tools (e.g., ["browser", "python"])
model_identity
reasoning_effort ("low", "medium", or "high")

Example implementation:

tokenizer.apply_chat_template(
    messages,
    tools=tools,
    builtin_tools=["browser"],
    reasoning_effort="high",
)

4.3 Prompt structure

The basic structure of the serialized prompt based on the Harmony format is as follows.

<|start|>system<|message|>
You are ChatGPT...
Knowledge cutoff: 2024-06
Current date: 2026-04-05
Reasoning: high
# Valid channels: analysis, commentary, final.
...
<|end|>
<|start|>developer<|message|>
# Tools
## functions
namespace functions {
  type get_weather = (_: { city: string }) => any;
}
<|end|>
<|start|>user<|message|>
Look up and summarize tomorrow's weather in Tokyo
<|end|>

Reasoning and tool calls take the following structure.

<|start|>assistant<|channel|>analysis<|message|>
Need a weather lookup.
<|end|>
<|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>
{"city":"Tokyo"}
<|call|>

The prompt structure for tool-execution results is as follows.

<|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>
"{\"weather\":\"sunny\"}"
<|end|>

The final response to the user is emitted as follows.

<|start|>assistant<|channel|>final<|message|>
Tomorrow Tokyo will be sunny. ...
<|end|>

4.4 Reasoning representation specification

This model does not expand the thinking process inside text tags; instead it splits processing into logical channels.

Reasoning: uses the analysis channel of the assistant message.
Tool integration: uses the commentary channel.
Final answer: uses the final channel.

4.5 Reasoning effort and mode specification

This model supports explicit Effort control as a template argument. The specified value of "low", "medium", or "high" is inserted directly into the prompt text as a Reasoning: ... directive inside the synthesized system message.

4.6 Tool-call specification

A strict Harmony grammar is applied to tool definitions and calls.

Tool definition: defined inside the developer message in TypeScript namespace form.
Call instruction: uses the form assistant to=functions.NAME.
Channel setting: specifies commentary json.
Arguments: constructed as a JSON payload.
Terminating tag: uses <|call|>.

4.7 Management of reasoning history

For performance and context-length optimization, the template logic incorporates processing that drops past Chain of Thought (CoT) during inference. It is not designed to retain all analysis history as raw data.

4.8 Implementation notes

Integrating this model requires a parser and data model dedicated to the Harmony protocol. A generic <think> parser cannot be applied.
Tool Calling and Reasoning must not be mixed within the same assistant text block; they must be routed using the appropriate channels and destination attributes.

5. LLM-jp-4-thinking specification details

5.1 Reference artifacts

The representative public artifacts for this model are as follows.

Hugging Face: llm-jp/llm-jp-4-8b-thinking/chat_template.jinja
The corresponding model card

As an architectural characteristic, LLM-jp-4-thinking adopts a design that strongly aims for compatibility with the OpenAI Harmony specification.

5.2 API specification

When applying the chat template, it supports the following additional parameters (kwargs).

builtin_tools
model_identity
reasoning_effort="low"|"medium"|"high"

The example implementation published in the model card recommends specifying the parameter as in tokenizer.apply_chat_template(..., reasoning_effort="medium").

5.3 Prompt structure

It is highly similar to gpt-oss and is generally converted into the following Harmony-format structure.

<|start|>system<|message|>
You are LLM-jp-4...
Knowledge cutoff: 2025-12
Current date: 2026-04-05
Reasoning: medium
# Valid channels: analysis, commentary, final.
<|end|>
<|start|>developer<|message|>
# Tools
## functions
namespace functions {
  type get_weather = (_: { city: string }) => any;
}
<|end|>
<|start|>user<|message|>
Look up and summarize tomorrow's weather in Tokyo
<|end|>

Reasoning and tool-call requests are as follows.

<|start|>assistant<|channel|>analysis<|message|>
First get the weather
<|end|>
<|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>
{"city":"Tokyo"}
<|call|>

The final answer output is as follows.

<|start|>assistant<|channel|>final<|message|>
Tomorrow Tokyo will be sunny. ...
<|end|>

5.4 Reasoning representation specification

The reasoning process is represented not by <think> tags but as the analysis channel.
By running the tokenizer.parse_response(response) method, the reasoning part (thinking) and the final answer (content) can be separated.
The model card explicitly states that a dedicated tokenizer must be used.

5.5 Reasoning effort and mode specification

Like gpt-oss, LLM-jp-4-thinking is an architecture in which a multi-level Reasoning Effort can be specified as a parameter of the prompt template.

Allowed values: reasoning_effort="low"|"medium"|"high"
At template evaluation time, the corresponding parameter is embedded into the Reasoning: ... line inside the synthesized system message.

5.6 Tool-call specification

Tool calls conform to the Harmony-format grammar.

Definitions using TypeScript-style namespace functions
Call syntax: assistant to=functions.NAME
Channel used: commentary json
Terminating token: <|call|>
Response format: functions.NAME to=assistant

5.7 Handling of reasoning history

As explicitly stated in the template's comment block, a design that discards past CoT (Chain of Thought) history during the reasoning phase is adopted. This matches the philosophy of gpt-oss.

5.8 Implementation notes

While it is compatible with the philosophy of the openai-harmony library, the tokenizer implementation is not completely identical. For safety, it is recommended to use the official tokenizer and parser provided by LLM-jp.
When running the provided Reasoning parser, there are public examples where, depending on the configuration, enabling the trust_remote_code=True option is required.

5.9 Related resources

6. Gemma 4 specification details

6.1 Reference artifacts

This section references the following as representative artifacts of the Gemma 4 family.

google/gemma-4-E2B-it/chat_template.jinja
Official documentation: Thinking mode in Gemma

6.2 API specification

Enabling the reasoning feature is controlled using an argument of the apply_chat_template method.

tokenizer.apply_chat_template(messages, tools=tools, enable_thinking=True)

6.3 Prompt structure

When the reasoning feature is enabled in Gemma 4, it adopts an architecture in which a system turn is dynamically inserted. The serialized structure is as follows.

<|turn>system
<|think|><turn|>
<|tool>
declaration:get_weather{description:..., parameters:{city:string}}
<tool|>
<|turn>user
Look up and summarize tomorrow's weather in Tokyo<turn|>
<|turn>model

Note that the official documentation for the 26B / 31B models also mentions a configuration that opens a thought channel to the model side when reasoning is disabled.

6.4 Reasoning representation specification

The way reasoning is represented in Gemma 4 differs by model parameter size as follows.

E2B / E4B models: when reasoning is enabled, <|think|> is inserted at the beginning.
26B / 31B models: a representation form is documented that opens a thought channel after the model turn.
The provided template includes a strip_thinking macro that removes the reasoning part when reconstructing history.

These are based on a design philosophy of channel and turn control rather than plain-text-based tag insertion.

6.5 Reasoning effort and mode specification

This model's reasoning feature is controlled by the following boolean parameter.

enable_thinking=True|False

No multi-level Effort specification has been confirmed.

6.6 Tool-call specification

Tool calls use the following dedicated tag syntax.

Tool definition: <|tool> ... <tool|>
Tool call: <|tool_call>call:FUNC{arg:...}<tool_call|>
Tool-execution result: <|tool_response>response:FUNC{...}<tool_response|>

An example serialization for a call and its result is as follows.

<|tool_call>call:get_weather{city:<|"|>Tokyo<|"|>}<tool_call|>
<|tool_response>response:get_weather{weather:<|"|>sunny<|"|>}<tool_response|>

6.7 Implementation notes

Unlike earlier Gemma specifications, Gemma 4's public template has been updated to generate system turns for reasoning and tool calling.
When implementing, it is strongly recommended to reference the actual chat_template.jinja file as the source of truth for the specification.

7. DeepSeek-V3.2 specification details

7.1 Reference artifacts

The prompt-construction logic of this model is defined not by a Jinja template but by the following Python encoder implementation.

deepseek-ai/DeepSeek-V3.2/encoding/encoding_dsv32.py

7.2 API specification

By specifying the following parameters at encoding time, the reasoning behavior and history retention are controlled.

encode_messages(
    messages,
    tools=tools,
    thinking_mode="thinking",
    drop_thinking=True,
    add_default_bos_token=True,
)

thinking_mode: specify "thinking" or "chat".
drop_thinking: specifies whether to discard past reasoning history to save context size.

7.3 Prompt structure

The serialized structure of the input data is as follows.

<｜begin▁of▁sentence｜>
## Tools
...
<｜User｜>Look up and summarize tomorrow's weather in Tokyo<｜Assistant｜><think>

The following DSML syntax is used for tool calls.

<｜DSML｜function_calls>
  <｜DSML｜invoke name="get_weather">
    <｜DSML｜parameter name="city" string="true">Tokyo</｜DSML｜parameter>
  </｜DSML｜invoke>
</｜DSML｜function_calls>

Tool-execution results are returned in the following block structure.

<function_results>
  <result>{"weather":"sunny"}</result>
</function_results>
<think>

7.4 Reasoning representation specification

The reasoning process is stored in a <think>...</think> block.
As delimiters between user and assistant, <｜User｜> and <｜Assistant｜> are used respectively.
When the assistant's output completes, the reasoning content, answer content, tool calls, and the EOS token are concatenated.

7.5 Reasoning effort and mode specification

No multi-level Effort specification is supported; the following two modes are specified.

thinking_mode="thinking": enables reasoning mode.
thinking_mode="chat": enables a standard conversation mode with reasoning suppressed.
drop_thinking=True|False: controls whether to retain or discard past reasoning history.

7.6 Tool-call specification

DeepSeek-V3.2 uses its own DSML (DeepSeek Markup Language) syntax to call tools.

The function-call block is defined by <｜DSML｜function_calls>.
Each invoke is given a name attribute specifying the target tool.
Parameters are defined with <｜DSML｜parameter ...>, and for the string type, string="true" is explicitly specified.
List-type or object-type data is embedded in JSON format.

7.7 Implementation notes

When confirming the prompt-construction specification, you must reference encoding_dsv32.py as the source of truth, not a Jinja template.
The client side requires a parser implementation dedicated to DeepSeek-V3.2.

8. Qwen3.5 specification details

8.1 Reference artifacts

This section references the following as representative artifacts of the Qwen3.5 family.

Qwen/Qwen3.5-35B-A3B/tokenizer_config.json
The model card and quickstart documentation for Qwen/Qwen3.5-9B

8.2 API specification

In Qwen3.5, the reasoning feature (Thinking) is enabled by default. To disable the reasoning feature, specify the following in the API request.

extra_body={"chat_template_kwargs": {"enable_thinking": False}}

8.3 Prompt structure

The structure of the serialized data fed to the model is as follows.

<|im_start|>system
# Tools
You may call one or more functions...
<tools>
[{"name":"get_weather", ...}]
</tools><|im_end|>
<|im_start|>user
Look up and summarize tomorrow's weather in Tokyo<|im_end|>
<|im_start|>assistant
<think>

At tool-call time, an XML-like proprietary block such as the following is emitted.

<tool_call>
<function=get_weather>
<parameter=city>
Tokyo
</parameter>
</function>
</tool_call>

Tool-execution results are incorporated into the prompt not as a tool role but as a <tool_response> block wrapped inside a user turn.

<|im_start|>user
<tool_response>
{"weather":"sunny"}
</tool_response><|im_end|>

8.4 Reasoning representation specification

The reasoning process is stored in a <think>...</think> block. When the reasoning feature is enabled, the assistant's generated output begins with the <think> tag, and after reasoning completes, the final answer is emitted.

8.5 Reasoning effort and mode specification

The only public specification confirmable at this point is toggling the reasoning feature on/off.

Default: reasoning feature enabled
Disable setting: enable_thinking=False

8.6 Tool-call specification

Tool calls use Qwen's own syntax.

The schema is defined in the <tools>...</tools> block inside the system prompt.
The assistant emits a <tool_call> block.
Tool-execution results are returned in a <tool_response> inside a user turn.
The reasoning block must be placed before the tool call; placing it after the call is not allowed.

8.7 Implementation notes

The template is designed to automatically compress reasoning history that precedes the last user request.
In a self-hosting environment, you must incorporate Qwen-specific reasoning parsers and tool-call parsers.

9. Kimi K2.5 specification details

9.1 Reference artifacts

Hugging Face: moonshotai/Kimi-K2.5/chat_template.jinja
Official documentation: Kimi API Platform (thinking model guide)

9.2 API specification

Kimi K2.5 has Thinking mode enabled by default. To disable it, make the following specification via the extra_body parameter of the OpenAI-compatible API.

extra_body={"thinking": {"type": "disabled"}}

When Thinking mode is enabled, the design assumes that message.reasoning_content and message.content are obtained separately from the response payload.

9.3 Prompt structure

The abstract input based on "2. Standard input payload for evaluation" of this document is serialized into the following form.

<|im_system|>tool_declare<|im_middle|>
[{"name":"get_weather", ...}]
<|im_end|>
<|im_user|>user<|im_middle|>
Look up and summarize tomorrow's weather in Tokyo
<|im_end|>
<|im_assistant|>assistant<|im_middle|>
<think>

Tool calls are emitted as the following dedicated block inside the assistant turn.

<think>Get the weather first</think>
<|tool_calls_section_begin|>
<|tool_call_begin|get_weather
<|tool_call_argument_begin|>{"city":"Tokyo"}
<|tool_call_end|>
<|tool_calls_section_end|>

Tool-execution results are assigned to the tool role, but on the prompt they are formatted as text with the following label.

## Return of call_1
{"weather":"sunny"}

9.4 Reasoning representation specification

The assistant's reasoning process is stored inside a <think>...</think> block.
Depending on whether Thinking mode is enabled, generation begins with <think> if enabled, or with an empty <think></think> at the beginning if disabled.
As a Kimi template specification, a compression logic is incorporated that intentionally discards the assistant's Reasoning from past turns and retains only the most recent Reasoning.

9.5 Reasoning effort and mode specification

The only control parameter confirmable in the public specification is toggling Thinking mode on/off.

Default state: Thinking mode enabled (on)
Disable specification: thinking.type="disabled"
A graduated Effort specification (low/medium/high, etc.) is not supported by the current public template or the official API documentation.

9.6 Tool-call specification

Kimi K2.5 adopts a distinctive tool-definition and call representation.

Tool definition: declared inside a system turn using tool_declare.
Tool call: uses a dedicated block starting from <|tool_calls_section_begin|>.
Call arguments: constructed by a sequence of special tokens: tool_call_begin, tool_call_argument_begin, and tool_call_end.
Tool-execution result: handled as the tool role at the API layer, but converted into plain text of the form Return of <id> inside the prompt.

9.7 Implementation notes

When using Thinking mode, it is recommended to restore the reasoning_content of past turns into the history in a complete state.
Because the template provided by Hugging Face itself has the property of compressing some Reasoning data in the history, take care not to conflate the history-management implementation on the API-client side with the compression policy at the template layer.

10. Phi-4-reasoning specification details

10.1 Reference artifacts

Hugging Face: microsoft/Phi-4-reasoning/tokenizer_config.json
Model card

This model adopts an architecture that always injects a fixed system prompt into a template conforming to the ChatML format.

10.2 API specification

The input interface is the standard ChatML-format messages. As an important point for configuration, a dedicated fixed system instruction is automatically inserted at the beginning when the tokenizer template is processed.

10.3 Prompt structure

The serialized structure of the prompt is as follows.

<|im_start|>system<|im_sep|>
[Fixed reasoning instruction. Separates Thought and Solution,
prompting to answer in the form <think>{Thought}</think>{Solution}]
<|im_end|>
<|im_start|>user<|im_sep|>
Look up and summarize tomorrow's weather in Tokyo
<|im_end|>
<|im_start|>assistant<|im_sep|>

10.4 Reasoning representation specification

The Reasoning content is written inside a <think>...</think> block.
The output format is not something the user specifies as a system message in the API request; it is enforced by the system prompt automatically inserted by the tokenizer template.
<think> and </think> are registered as dedicated tokens in the tokenizer vocabulary.

10.5 Reasoning effort and mode specification

This model does not support multi-level Effort control parameters (low, medium, high, etc.) or an enable/disable switch. It always exhibits fixed behavior that outputs the "Thought + Solution" form.

10.6 Tool-call specification

By the specification of the official template and model card, no native Tool Calling grammar specialized for Phi-4-reasoning is defined.

Recommended configuration for system integration:

Implement the Tool Calling logic in an external agent layer.
Position the model itself as a reasoning-only component that generates <think>...</think>.

10.7 Implementation notes

Modifying or removing the leading fixed system prompt inserted by the template risks collapsing the model's expected output format.
When building a parser in a self-hosting environment, the Reasoning parser component of the DeepSeek-R1 family can be repurposed.

11. GLM-5 specification details

11.1 Reference artifacts

Hugging Face: zai-org/GLM-5-FP8/chat_template.jinja
Official documentation: GLM-5 overview / Thinking Mode / Function Calling

11.2 API specification

It supports OpenAI-compatible API requests. In the request payload, you can specify messages, tools, tool_choice, tool_calls, and reasoning_content.

To disable Thinking mode, specify the following parameter.

{"thinking": {"type": "disabled"}}

Also, to re-submit multi-turn conversation history while retaining the Reasoning process, specify clear_thinking=false.

11.3 Prompt structure

The abstract input based on "2. Standard input payload for evaluation" of this document is serialized into the following form.

[gMASK]<sop>
<|system|>
# Tools
You may call one or more functions...
<tools>
[{"name":"get_weather", ...}]
</tools>
<|user|>
Look up and summarize tomorrow's weather in Tokyo
<|assistant|>
<think>

When a tool call is executed, the following structure is inserted into the assistant's output.

<think>First use the weather tool</think>
<tool_call>
get_weather
<arg_key>city</arg_key><arg_value>Tokyo</arg_value>
</tool_call>

Tool-execution results are re-serialized as a message corresponding to the tool role in the following form.

<|observation|>
<tool_response>{"weather":"sunny"}</tool_response>

11.4 Reasoning representation specification

The reasoning process is stored inside a <think>...</think> block.
The template preferentially evaluates the presence of message.reasoning_content, and if there is no such data, it attempts to extract the <think>...</think> block from assistant.content.
Depending on whether Thinking mode is enabled, the generation start token is <|assistant|><think> (when enabled) or <|assistant|></think> (when disabled).

11.5 Reasoning effort and mode specification

GLM-5 does not provide graduated parameters (low/medium/high, etc.) to specify the depth of Reasoning. It supports only enabling/disabling Thinking mode and whether to retain history.

Local environment / template: enable_thinking=True|False
Hosted API: thinking.type="disabled"
Multi-turn history retention: clear_thinking=false

11.6 Tool-call specification

There is a structural difference between the API request and the internal prompt.

The API input layer uses the OpenAI-compatible tools / tool_calls format.
When converted into the model input prompt, it is reconstructed into <tools>...</tools> and <tool_call>...</tool_call> blocks.
Tool-execution results are handled as <tool_response>...</tool_response> blocks.

11.7 Implementation notes

If you have a requirement to continue Reasoning across multiple turns, explicitly specify clear_thinking=false.
When building a self-hosting environment, you need to implement dedicated Reasoning parsers and tool parsers specialized for the GLM family.

12. Mistral 3 family specification details (representative model: Ministral-3-14B-Reasoning-2512)

12.1 Reference model and selection criteria

To verify the internal template specification of Reasoning in Mistral-family models, this section uses the open-weight model mistralai/Ministral-3-14B-Reasoning-2512 as the reference. The Reasoning feature is also provided on the Hosted API side, but since the internal prompt is not public, verifying the detailed specification requires referencing the open-weight template.

12.2 API specification

The input specification differs depending on the usage environment.

Open-weight model: uses the standard messages and tools parameters.
Hosted API models:
- mistral-small-latest: graduated adjustment via the reasoning_effort parameter is supported.
- magistral-small-latest / magistral-medium-latest: these are native Reasoning models, and Thinking processing is always performed.

12.3 Prompt structure

The serialized structure of the prompt based on the standard input is as follows.

[SYSTEM_PROMPT]...[/SYSTEM_PROMPT]
[AVAILABLE_TOOLS][{"name":"get_weather",...}][/AVAILABLE_TOOLS]
[INST]Look up and summarize tomorrow's weather in Tokyo[/INST]

The assistant's reasoning and answer process is constructed in the following order.

[THINK]First get the weather[/THINK]
[TOOL_CALLS]get_weather[ARGS]{"city":"Tokyo"}

The prompt structure for tool-execution results is as follows.

[TOOL_RESULTS]{"weather":"sunny"}[/TOOL_RESULTS]

12.4 Reasoning representation specification

In the open-weight template, the Reasoning process is stored inside a [THINK]...[/THINK] block.
The system prompt is structured on the premise that the model outputs Reasoning.
The assistant content adopts an architecture that splits processing into thinking chunks and text chunks.

12.5 Reasoning effort and mode specification

Parameter support differs by deployment method.

Open-weight model: the public chat_template.jinja has no dynamic parameters such as low|medium|high, and it functions as a fixed Reasoning mode.
Hosted API models: mistral-small-latest allows adjustment via reasoning_effort. Also, native models (magistral-*) output Thinking chunks without any additional parameter specification.

12.6 Tool-call specification

The Ministral-3-Reasoning template defines the tool interface using dedicated delimiter strings.

Definition: [AVAILABLE_TOOLS]...[/AVAILABLE_TOOLS]
Call: [TOOL_CALLS]name[ARGS]{json}
Execution result: [TOOL_RESULTS]...[/TOOL_RESULTS]

12.7 Implementation notes

Note that the Mistral-family tags are not the common XML-style <think>, but the bracket-based [THINK].
Because it is not public how the Hosted API's reasoning_effort parameter is processed into the internal prompt, when implementing a strict parser it is recommended to conform to the open-weight specification.

13. Cross-comparison: major patterns

The architectural patterns of each model's Reasoning representation broadly fall into the following three categories.

13.1 Inline-tag configuration (`<think>...</think>` type)

A configuration that embeds the reasoning process directly into the prompt and completion as text-data tags.

Target models: GLM-5, Kimi K2.5, Qwen3.5, DeepSeek-V3.2, Phi-4-reasoning
Characteristics:
- The parser is relatively easy to implement.
- However, the Tool Calling grammar adopts each vendor's own specification and is not standardized.

13.2 Dedicated delimiter-token configuration

A configuration that separates the reasoning process by specific special tokens or channel-like delimiters.

Target models: Ministral-3-Reasoning ([THINK]...[/THINK]), Gemma 4 (<|think|> or thought channel)
Characteristics:
- The delimiter as a Special Token is strictly defined.
- It may be difficult to repurpose a standard <think> parser.

13.3 Multi-channel / Harmony configuration

A configuration that treats reasoning and tool calls not as text tags but as channels over an independent communication protocol.

Target models: gpt-oss-120b, LLM-jp-4-thinking
Characteristics:
- Reasoning process: uses the analysis channel
- Tool calls: use the commentary channel
- Final answer: uses the final channel

14. Cross-comparison: Reasoning Effort parameter specification

The control interface for reasoning depth (Reasoning Effort) falls into the following patterns depending on the model.

14.1 Prompt-exposed type (multi-level specification)

Target models: gpt-oss-120b, LLM-jp-4-thinking
Specification:
- Supports explicit parameter input of low|medium|high.
- The specified value is passed as a template argument (kwargs) and is expanded directly inside the synthesized system message in the form Reasoning: high.
- The parser and serving environment generally assume the Harmony specification.

14.2 Binary-toggle type (enable / disable)

Target models: GLM-5, Kimi K2.5, Qwen3.5, Gemma 4
Specification:
- Controls not the reasoning depth but whether the Reasoning channel itself is output.
- Representative parameter settings use enable_thinking=False or thinking.type=disabled.

14.3 Mode-enumeration type

Target model: DeepSeek-V3.2
Specification:
- On the API interface, the mode is selected as an Enum value such as thinking_mode="thinking"|"chat".
- The compression control of reasoning history is managed by an independent parameter (drop_thinking).

14.4 Static configuration (fixed type)

Target models: Phi-4-reasoning, open-weight Ministral-3-Reasoning
Specification:
- The Reasoning format is statically defined (hard-coded) inside the prompt template.
- A use case of dynamically adjusting the reasoning depth (Effort) from external parameters is not assumed.

15. Cross-comparison: tool-call syntax specification

The tool-call (Tool Calling) format in each model is classified, based on architecture, into the following four major patterns.

XML / pseudo-XML type: defined using XML-like tags such as <tools> or <tool_call>. The applicable models are GLM-5, Qwen3.5, and DeepSeek-V3.2 (DSML).
Dedicated special-token type: defines the structure using model-specific special system tokens (e.g., <|tool|>). The applicable models are Kimi K2.5 and Gemma 4.
Delimiter-string type: delimits blocks using a specific string such as [AVAILABLE_TOOLS]. The applicable model is Ministral-3-Reasoning.
Protocol / channel type: an advanced configuration that splits messages into logical channels for processing. The applicable models are gpt-oss-120b and LLM-jp-4-thinking.

Note: When implementing a tool-call-capable server, it is not recommended to base it solely on the OpenAI-compatible JSON format (messages, tools, tool_calls). Because the syntax specification actually fed to the model differs greatly from model to model, you need to implement model-specific parsers on the backend side.

16. Implementation best practices and notes

The technical notes and constraints when integrating each model into a system are shown below.

16.1 Limitations on template compatibility

Even among models with a Reasoning feature, prompt-template compatibility is extremely limited. GLM, Qwen, and DeepSeek all use the <think> tag in common, but the tool-call syntax, the history-compression algorithm, and the way the generation prompt is constructed all differ.

16.2 Handling of Harmony-family models

Harmony-family models represented by gpt-oss and LLM-jp have a fundamentally different architecture from other models. Rather than repurposing a simple text parser for the <think> tag, they must be implemented as a dedicated, channel-aware protocol stack.

16.3 Separating the reasoning feature from the effort parameter

Whether a Reasoning feature exists and the API specification for controlling reasoning depth (Effort) depend on the model family.

Reasoning feature present, no Effort specification: Phi-4, Ministral (open-weight version), DeepSeek (thinking mode)
Reasoning feature present, binary enable/disable control: GLM, Kimi, Qwen, Gemma
Reasoning feature present, multi-level Effort specification: gpt-oss, LLM-jp, Mistral API (managed version)

16.4 Specification for re-submitting context history

The behavior regarding retaining or discarding past conversation history (Reasoning content) differs by model as follows.

Discards the thinking process from history during inference: gpt-oss, LLM-jp, Kimi, Qwen, DeepSeek (when the drop_thinking=True parameter is specified)
Retention via an explicit parameter: GLM (when the clear_thinking=false parameter is specified)
Control via a fixed system prompt: Phi-4

17. Appendix: models out of scope for this document

The following models are excluded from the comparison because, at the time of this survey, no Reasoning-specific specification could be confirmed for them.

Llama 4: The existence of Instruct models (such as meta-llama/Llama-4-Scout-17B-16E-Instruct) is confirmed, but no public information on a Reasoning-specific chat template, Effort specification, or tool syntax could be confirmed.
OLMo 2: Ordinary chat templates (such as allenai/OLMo-2-1124-13B-Instruct) are provided, but no specification for a Reasoning-specific template or Effort parameter is public.

18. Summary

The classification of each model based on its architectural characteristics in this document is as follows.

Model group with the clearest text-based Reasoning representation: GLM-5, Kimi K2.5, Qwen3.5, DeepSeek-V3.2, Phi-4-reasoning
Model group that is protocol-oriented and requires a dedicated parser implementation: gpt-oss-120b, LLM-jp-4-thinking
Model group with its own tool syntax and high parser-implementation difficulty: DeepSeek-V3.2 (DSML), Gemma 4, Kimi K2.5
Model group with an explicit Reasoning Effort API specification: gpt-oss-120b, LLM-jp-4-thinking, Mistral (managed API version)

19. Related resources

Links to the official artifacts and documentation that define each model's specification are as follows.

Last Modified: April 5, 2026