* This document is a memo I researched and organized with the help of ChatGPT/Gemini in order to consolidate my own understanding.
Gemma 4 chat template specification summary
This document is a summary of the chat template specification of the google/gemma-4-31B-it model. It is organized primarily around the publicly available chat_template.jinja and tokenizer_config.json, cross-checked against the Gemma 4 Model Card, the Function Calling guide on Google AI for Developers, and the Gemma4Processor implementation in Hugging Face Transformers.
0. Purpose of the document and its underlying structure
To accurately understand Gemma 4's prompt processing, you need to be aware that input and output are composed of the following three layers.
- Message API layer: the layer of data structures you specify in Python code, such as
messages=[{"role": ..., "content": ...}], tools=[...], and enable_thinking=True/False.
- Chat template layer: the layer of the raw text protocol produced by
processor.apply_chat_template(..., tokenize=False). Here, control tokens such as <|turn> and <|tool_call> become visible as strings.
- Processor expansion layer: the layer in which
Gemma4Processor internally expands high-level placeholders for images, audio, video, and so on into the long special-token sequences that are actually passed to the tokenizer.
Not confusing these layers is the key to avoiding implementation trouble. In particular, note that tokens such as <|image|>, <|audio|>, and <|video|> are merely high-level placeholders at the template layer; in the final string passed to the tokenizer, they are expanded into even finer token sequences.
1. Overview of the Gemma 4 prompt protocol architecture
Gemma 4's official template converts the conversation history received from the API into the following custom markup format.
<bos>
<|turn>system
[optional: <|think|>][optional: system/developer content][optional: <|tool>...<tool|> ...]
<turn|>
<|turn>user
...
<turn|>
<|turn>model
[past assistant content / tool_calls / tool_responses]
<turn|>
...
<|turn>model
[generation start position]
This protocol has several important characteristics worth keeping in mind for implementation.
- Role conversion: the input
assistant role is automatically converted to model by the template. In other words, what you defined as assistant at the API layer is emitted as <|turn>model at the raw-text layer.
- Reasoning trigger: the switch that enables the model's reasoning process is the
<|think|> token. It is inserted at the very beginning of the first system turn.
- Tool declaration and invocation: tool declarations are enumerated inside the system turn in the form
<|tool>declaration:...<tool|>. The actual tool calls and responses, on the other hand, are placed inside the assistant's turn as <|tool_call>...<tool_call|> and <|tool_response>...<tool_response|>.
- Exclusion of thinking from history: when past assistant outputs are re-fed as history, the contents of the thinking channel are automatically removed. By design, the past thinking process itself is not retained in the multi-turn conversation history.
- Non-JSON format: the format emitted by
apply_chat_template(..., tokenize=False) is not JSON. It has a pseudo-JSON-like structure, but keys are not quoted, and string values are enclosed by <|"|>. Type names are also written in uppercase, such as STRING or OBJECT, in a custom format.
2. Special token specification
The special tokens used in Gemma 4 can be broadly divided into two categories by purpose. Here we organize each one's role and the processing layer in which it primarily appears.
2.1 Tokens for conversation, reasoning, and tool calls
The main tokens involved in controlling the conversation, the reasoning process, and tool calls are as follows. These are primarily based on the settings in tokenizer_config.json.
| tokenizer_config key | String representation | Purpose | Layer where it mainly appears |
bos_token | <bos> | Beginning-of-sequence token. It is always inserted at the start of the template. | Template |
eos_token | <eos> | End-of-sequence token. The template itself does not insert it automatically. | Tokenizer, general |
sot_token | <|turn> | Marks the start of an utterance. | Template |
eot_token | <turn|> | Marks the end of an utterance. | Template |
soc_token | <|channel> | Marks the start of a channel. Used, for example, at the beginning of a thinking block. | Model output / history formatting |
eoc_token | <channel|> | Marks the end of a channel. | Model output / history formatting |
think_token | <|think|> | Control token that enables thinking mode. | Template |
std_token | <|tool> | Starts a tool declaration. | Template |
etd_token | <tool|> | Ends a tool declaration. | Template |
stc_token | <|tool_call> | Starts a tool call. | Model output / history |
etc_token | <tool_call|> | Ends a tool call. | Model output / history |
str_token | <|tool_response> | Starts a tool execution result. | History |
etr_token | <tool_response|> | Ends a tool execution result. | History |
escape_token | <|"|> | Used as a delimiter for string literals inside tool schemas, arguments, and responses. | Tool-related processing in general |
pad_token | <pad> | Padding token. | Tokenizer, general |
mask_token | <mask> | Mask token. | Tokenizer, general |
unk_token | <unk> | Represents an unknown vocabulary item. | Tokenizer, general |
2.2 Tokens for multimodal input
These are tokens for processing multimodal data such as images, audio, and video.
| tokenizer_config key | String representation | Purpose | Layer where it mainly appears |
image_token | <|image|> | Placeholder for an image. | Template |
boi_token | <|image> | Boundary marking the start of the soft-token sequence for an image/video. | Processor expansion |
eoi_token | <image|> | Boundary marking the end of the soft-token sequence for an image/video. | Processor expansion |
audio_token | <|audio|> | Placeholder for audio. | Template |
boa_token | <|audio> | Boundary marking the start of the soft-token sequence for audio. | Processor expansion |
eoa_token | <audio|> | Boundary marking the end of the soft-token sequence for audio. | Processor expansion |
extra_special_tokens[0] | <|video|> | Placeholder for video. | Template |
As a caveat regarding multimodal support, the 31B dense model does not support the audio encoder; native audio support is limited to the E2B/E4B models. On the other hand, the tokenizer and the template itself do provide audio tokens and branches. Because the template is shared across the model family, it is safest to recognize that the actual capability depends on the individual model.
Also note that in the current Transformers implementation, <|video|> is re-registered as an additional special token.
3. Serialization rules of the chat template
Here we explain by what rules messages are converted into the raw text protocol.
3.1 Conditions for generating the system block
The template automatically emits a system turn at the beginning when any of the following conditions is met.
- When
enable_thinking=True is set.
- When
tools are defined.
- When the role of the first message is
system or developer.
In other words, even if you do not explicitly provide a system message, using thinking mode or tools automatically generates a system block.
3.2 Role mapping rules
In the message loop processing, role names are converted as follows. In particular, note that assistant becomes model.
assistant is converted to model.
user is kept as user.
system is kept as system.
developer is absorbed into the system block if it is the first message; otherwise the implementation emits it as developer.
In practice, to avoid unexpected behavior, it is recommended not to use the developer role anywhere other than the first message.
3.3 Basic structure of a turn
In principle, each message is emitted in the following format.
<|turn>{role}
{serialized body}<turn|>
As an exception, however, the trailing <turn|> is suppressed only when the assistant turn holds tool_responses and its content is empty. This is by design, so that the model generates the final answer as a continuation of the same turn, right after the tool execution result.
3.4 Adjacency rule between the system body and tool declarations
When tool declarations are emitted immediately after the first system body, the template does not insert any intentional line breaks or whitespace. As a result, the output is a densely packed text like the following.
<|turn>system
You are a helpful assistant.<|tool>declaration:get_current_weather{...}<tool|><turn|>
It may look hard to read at first glance, but this is the intended behavior.
3.5 Differences in processing depending on the content data type
The template's processing changes depending on the data type passed as content.
- Strings for
user, system, and developer are trim-ed and emitted as-is.
- Strings for
assistant are processed through strip_thinking() before being emitted. By this design, even if you feed the raw output back into the history, the thinking block is automatically removed.
The template can directly process only the following four element types.
Types other than these are not explicitly processed. Also, multiple elements are simply concatenated in order. In particular, each text element is individually trim-ed, so if you split a piece of text across multiple text elements, whitespace at the boundaries may be unintentionally lost. Text where you want to preserve spaces or line breaks between words is safer to pass as a single text element.
4. Format specification of the reasoning process
4.1 Conditions for enabling thinking mode and the prompt structure
The condition for enabling thinking mode on the template is simple: just insert the <|think|> token at the very beginning of the first system turn.
When thinking mode is enabled, an empty thinking block is not pre-appended to the end of the prompt. The end of the prompt looks like the following, and the intent is that a model that recognizes <|think|> starts generating its own reasoning process.
<bos><|turn>system
<|think|>You are a helpful assistant.<turn|>
<|turn>user
What is 2 + 2?<turn|>
<|turn>model
In actual model output, the reasoning process is generated enclosed in <|channel>thought and <channel|>, followed by the final answer.
4.2 Prompt structure when thinking mode is disabled
When thinking mode is disabled, the template inserts an empty thought channel in advance at the end of the generation prompt.
<|turn>model
<|channel>thought
<channel|>
The model then generates only the final answer directly, following this empty block.
4.3 Rule for discarding thinking history across multiple turns
In a multi-turn conversation, past assistant utterances are re-fed as history, but the thinking process is designed not to be retained in the history at that point.
A process called strip_thinking(text) inside the template automatically removes the parts enclosed by <|channel> ... <channel|> from the history string, so that only the final answer part is carried over to the next inference.
4.4 Differences between the official documentation and the actual implementation
In the public Jinja template, an empty thought block is appended when thinking is disabled, but some output examples in Google's official documentation are written with this part omitted.
When debugging or developing, it is safest to treat the raw text actually emitted by apply_chat_template() as authoritative, rather than the examples shown in the documentation.
5. Tool serialization specification
5.1 Placement of tool declarations
When available tool information is passed as tools=[...], it is enumerated inside the first system turn. Specifically, it is written in the following format.
<|tool>declaration:{function_name}{...}<tool|>
5.2 The custom DSL specification of tool schemas
Tool schemas are converted into Gemma's own DSL and serialized, rather than a common JSON format. The main characteristics are as follows.
- Key names use bare, unquoted notation.
- String values are enclosed by
<|"|>.
- Booleans are lowercase
true / false.
- Type names are all uppercase, such as
STRING, OBJECT, and ARRAY.
- Arrays are expressed with
[ ... ] and objects with { ... }.
Because assembling this string by hand is error-prone, it is basically safest to leave it to the apply_chat_template() processing.
5.3 Rules for converting schema parameters into the prompt
Not every JSON Schema keyword is embedded into the prompt as-is. The main information that the template processes and reflects is as follows.
- The function's
name and description
- The parameters'
type, properties, and required
- Each property's
description, type, and nullable
- The
enum of a string-typed property
- The structure of array items and an optional
response declaration
Also, as a major implementation characteristic, dictionaries of properties and arguments are output sorted in dictionary (lexicographic) order rather than in their original insertion order.
5.4 Format of a tool execution request
When the model calls a function, it is output in the following format.
<|tool_call>call:{function_name}{arg1:value1,arg2:value2,...}<tool_call|>
Argument values are also formatted by custom rules: strings are expressed as <|"|>text<|"|>, numbers as-is, and booleans as true / false.
5.5 Format of a tool execution result
When returning a function's execution result to the model, you include it as tool_responses inside the assistant's message. Rather than using an independent tool role, it takes the form of hanging inside the assistant's turn.
- Object response: emitted in a form like
{temperature:15,weather:<|"|>sunny<|"|>}.
- Scalar-value response: if it is not a mapping form, the template automatically wraps it with a
value: key and formats it like {value:<|"|>result<|"|>}.
5.6 Coexistence of call and response within the same turn
In the public template, the ordering of elements within the assistant's turn is strictly fixed.
tool_calls
tool_responses
content
Because they are processed in this order, the tool call, its result, and the final answer all coexist within the same turn.
5.7 Generating the final answer based on the tool execution result
The common flow is that, immediately after the application executes the function, you add an assistant turn that does not yet have content to the history and run inference again.
At this point, the template deliberately does not emit the turn-ending token <turn|>, and lets the model generate the continuation, the final answer, from the position right after the tool response is written. This is the core behavior of Gemma 4's function calling.
5.8 Simultaneous calling of and responding with multiple tools
When calling multiple tools at once or returning multiple results, the respective tags are simply laid out in order by a loop.
<|tool_call>call:f1{...}<tool_call|><|tool_call>call:f2{...}<tool_call|>
Even when multiple tags are laid out consecutively, it is designed so that the tokenizer's regex iterator can split and parse each one individually.
5.9 Choosing between automatic schema generation and manual definition
Google's official guide describes two methods: automatically generating a schema from a Python function, and passing a manually defined dictionary.
However, for complex functions such as those that take custom objects as arguments, there is a risk that internal property information is lost during the automatic conversion. Therefore, for complex schemas, manual schema definition is recommended.
6. Response parsing and schema specification
The response_schema for parsing Gemma 4's responses is embedded in tokenizer_config.json.
Reading through the structure of this schema reveals that the assistant's output is designed to be decomposed mainly into the following four elements.
role: always assistant
thinking: the reasoning process as an optional string
content: the answer text as an optional string
tool_calls: the list of tool calls as an optional array
What is important here is that tool_responses is not included as a parsing target. This is natural if you consider that a tool execution result is not something the model itself outputs to be parsed, but rather information that the application feeds back to the model as history.
6.1 Meaning of the top-level regex
The regular expression defined in response_schema.x-regex is roughly composed of the following order.
- An optional thought block
- An optional content
- An optional tool_calls block
- An optional trailing
<turn|>
Therefore, by using processor.parse_response(), you can extract from the raw output string the <|channel>thought ... <channel|> part, the plain answer text, and each <|tool_call> ... <tool_call|> block.
6.2 Basic policy for debugging
When developing and debugging, it is effective to use different approaches depending on your goal.
- When you want to check the raw protocol: decode with
skip_special_tokens=False. This lets you directly check the placement of internal special tokens.
- When you want to handle the reasoning process, answer, and tool calls in a structured way: make use of
processor.parse_response().
7. Processing specification for multimodal input
7.1 Unifying placeholders at the template layer
When an input sequence contains images, audio, or video, the Jinja template inserts the following placeholder for each medium.
- Image:
\n\n<|image|>\n\n
- Audio:
<|audio|>
- Video:
\n\n<|video|>\n\n
At the point of chat template processing, only one <|image|> is emitted per image, one <|audio|> per audio, and one <|video|> per video.
7.2 Soft-token expansion before tokenizer processing
These high-level placeholders are not passed to the model as-is. Just before being passed to the tokenizer, Gemma4Processor expands them into long internal placeholder sequences.
- Image: each
<|image|> is replaced, according to the number of soft tokens per image, with multiple <|image|> sandwiched by the surrounding boundary tokens.
- Audio: each
<|audio|> is similarly expanded into as many <|audio|> as the token count computed from the audio length, sandwiched by boundary tokens. In the current Gemma4Processor, this is dynamically computed from the audio waveform length, basically treated as one token per 40ms with an upper limit of 750 tokens.
- Video: video is expanded into a format with a timestamp attached to each frame. Notably, even for video, the same
<|image> / <image|> boundary tokens as for images are reused. A timestamp in mm:ss format is inserted before each frame's token sequence.
7.3 Recommendations for the ordering of multimodal elements
The Gemma 4 model card recommends placing multimodal content such as images and audio before text.
By the template's specification, input elements are concatenated as-is in array order. Therefore, in practice, it is natural to place media elements first and put text last, as follows.
[
{"type": "image", "url": "..."},
{"type": "audio", "url": "..."},
{"type": "text", "text": "What is shown in this image?"}
]
8. Prompt generation examples by use case
Here we show inputs for representative use cases and examples of the prompts generated by the template.
8.1 Minimal single-turn
The input example is as follows.
messages = [
{"role": "user", "content": "Write a haiku about memory."}
]
The template output is as follows.
<bos><|turn>user
Write a haiku about memory.<turn|>
<|turn>model
<|channel>thought
<channel|>
From here, the model continues to generate the final answer.
8.2 Single-turn with system applied and thinking enabled
The input example is as follows.
messages = [
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "What is 2 + 2?"},
]
The template output is as follows.
<bos><|turn>system
<|think|>You are a concise assistant.<turn|>
<|turn>user
What is 2 + 2?<turn|>
<|turn>model
In this case, the raw output returned by the model is expected to have the following form.
<|channel>thought
Compute 2+2 briefly.<channel|>4<turn|>
8.3 Tool declaration and call
The input example is as follows.
tools = [WEATHER_TOOL]
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hey, what's the weather in Tokyo right now?"},
]
The standard output is as follows.
<bos><|turn>system
You are a helpful assistant.<|tool>declaration:get_current_weather{description:<|"|>Gets the current weather in a given location.<|"|>,parameters:{properties:{location:{description:<|"|>The city and state, e.g. "San Francisco, CA" or "Tokyo, JP"<|"|>,type:<|"|>STRING<|"|>},unit:{description:<|"|>The unit to return the temperature in.<|"|>,enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>],type:<|"|>STRING<|"|>} },required:[<|"|>location<|"|>],type:<|"|>OBJECT<|"|>} }<tool|><turn|>
<|turn>user
Hey, what's the weather in Tokyo right now?<turn|>
<|turn>model
<|tool_call>call:get_current_weather{location:<|"|>Tokyo, JP<|"|>}<tool_call|>
The emitted tool call is parsed on the application side, and executed after its content has been validated.
8.4 Applying the tool response and generating the final answer
After executing the tool, you add a result like the following to the history.
{
"role": "assistant",
"tool_calls": [
{"function": {"name": "get_current_weather", "arguments": {"location": "Tokyo, JP"}}}
],
"tool_responses": [
{"name": "get_current_weather", "response": {"temperature": 15, "weather": "sunny"}}
]
}
The cross-section within the same turn that the template composes is as follows.
<|turn>model
<|tool_call>call:get_current_weather{location:<|"|>Tokyo, JP<|"|>}<tool_call|><|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>
From here, the model continues to generate the body of the final answer. The final history output in Google's official example is as follows.
<|turn>model
<|tool_call>call:get_current_weather{location:<|"|>Tokyo, JP<|"|>}<tool_call|><|tool_response>response:get_current_weather{temperature:15,weather:<|"|>sunny<|"|>}<tool_response|>The current weather in Tokyo is 15 degrees and sunny.<turn|>
8.5 Multimodal input
The input example is as follows.
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://.../cat.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
}
]
The output at the template layer is as follows.
<bos><|turn>user
<|image|>
What is shown in this image?<turn|>
<|turn>model
<|channel>thought
<channel|>
Meanwhile, the conceptual output at the processor expansion layer is as follows.
<bos><|turn>user
<|image><|image|><|image|>...<image|>
What is shown in this image?<turn|>
<|turn>model
<|channel>thought
<channel|>
The length of the ... omitted here varies depending on the image size and the max_soft_tokens setting.
9. Implementation considerations
9.1 Assistant history retention policy
The template is designed to remove the thought channel using strip_thinking(). Therefore, the canonical assistant content to store as message history should naturally be only the final answer, not the full raw text.
9.2 The responsibility boundary for tool execution
What Gemma 4 outputs is merely a string corresponding to a tool call object; the actual function execution must be carried out on the application side under its own responsibility. Google's official documentation also strongly recommends validating generated code and function calls.
9.3 Automatic sorting of dictionary key order
Because dictsort is used extensively internally, the keys of schemas, arguments, and responses are all output sorted in dictionary order. Be careful not to write test code that depends on the original insertion order of the Python dictionary.
9.4 Risk of whitespace loss when concatenating multiple text items
When you use multiple text items, each item is trim-ed. Therefore, it is safest to avoid notation that intentionally creates a space by splitting items, such as ["Hello ", " world"].
9.5 Handling of an empty messages array
Because the template's implementation directly references messages[0] internally, passing an empty array as input is not safe.
9.6 Limitation on the automatic insertion of eos_token
The only general-purpose token this template explicitly emits is the leading <bos>. <eos> is not automatically added on the Jinja template.
9.7 Differences between documentation examples and the template implementation
Some display examples in the official documentation may appear not to exactly match the actual output of the current Jinja template. When implementing and debugging, it is recommended to treat the actual output of chat_template.jinja and apply_chat_template() as authoritative.
10. Recommended workflow for production
The recommended workflow for using Gemma 4 safely and reliably in production and during development is as follows.
- Check the raw string when prompting and debugging: during development, always check the output of
apply_chat_template(..., tokenize=False) at least once. Understanding what string is actually passed to the model is the first step in troubleshooting.
- Do not skip special tokens when checking the raw output: if you want to check the behavior of protocol-level tokens, specify
skip_special_tokens=False when decoding.
- Use the dedicated method for parsing in production: in a real application, use
processor.parse_response() instead of parsing the raw output yourself.
- Manage tool execution with a whitelist approach: the tool calls the model returns are merely string requests. Execute them only after validating the function name and arguments.
- Define complex tool schemas manually: for complex functions involving nested objects or custom classes, it is safer to define the schema manually than to rely on automatic schema generation.
- Be aware of the internal expansion of multimodal input: even when only one placeholder such as
<|image|> is visible on the template, it is internally expanded into many soft tokens. Design with a clear distinction between this placeholder layer and the internal expansion layer.
11. References / primary sources
The descriptions in this document are based on the following official resources and implementation files.
Last Modified: April 6, 2026