What Is Gemma 4 assistant-MTP: How Multi-Token Prediction Draft Models Speed Up Inference

Explains what Gemma 4 assistant-MTP does: it is not a standalone chat model, but a draft model used with the main model for Multi-Token Prediction and speculative decoding, improving generation speed without changing the final output distribution.

When you see names like assistant-MTP, assistant, or MTP drafter around Gemma 4 models, do not treat them as standalone chat models.

A more accurate description is this: they are Multi-Token Prediction draft models paired with Gemma 4 main models, used for Speculative Decoding.

In one sentence: the main model makes the final decision, while assistant-MTP drafts ahead. If the draft is right, the main model can confirm multiple tokens at once, and generation becomes faster.

What It Actually Is

MTP means Multi-Token Prediction.

Gemma 4 assistant-MTP can be understood as a lightweight helper model, also often called a drafter, draft model, or draft head. It is usually paired with the corresponding Gemma 4 main model, for example:

  • Gemma 4 12B with the matching 12B assistant-MTP
  • Gemma 4 26B with the matching 26B assistant-MTP
  • Gemma 4 31B with the matching 31B assistant-MTP

Its job is not to answer the user directly, but to predict the next few likely tokens for the main model.

The final output is still verified and decided by the main model. So assistant-MTP is more like a “lookahead reader” or “draft head”, not a new chat model.

Why Normal Generation Is Slow

Traditional autoregressive language models usually generate text like this:

  1. Predict the next token from the existing context.
  2. Add that token back into the context.
  3. Predict the next token.
  4. Repeat until generation is complete.

This process is stable, but naturally serial. Even if the next few tokens are easy to guess, such as fixed formats, code templates, or common phrases, the model still computes them one by one.

In local inference or consumer GPU scenarios, token-by-token generation amplifies the memory bandwidth bottleneck: each generated token requires repeatedly moving a large amount of model weights, and compute units may not be fully utilized.

MTP uses that gap: a lighter draft model guesses multiple tokens first, then hands them to the main model for parallel verification.

How Speculative Decoding Works

The process can be broken into four steps:

  1. assistant-MTP first predicts several future tokens.

    For example, it may guess 4 candidate tokens at once.

  2. The main model reads those candidate tokens.

    The main model does not blindly trust the draft. It checks in parallel whether these tokens match its own distribution.

  3. Correctly guessed tokens are accepted.

    If the first 3 tokens pass verification, it is equivalent to the main model generating 3 tokens in one step.

  4. Generation rolls back at the first wrong position.

    If the 4th token is not accepted, generation continues from there using the main model’s normal logic.

So this is not “trading quality for speed”. The main model still performs the final verification; likely-correct future tokens are simply brought forward for checking.

Why The Output Can Stay Consistent

The easiest misunderstanding about speculative decoding is: if a smaller model is involved, does the result get worse?

In standard speculative decoding, the answer is usually no. The draft model only proposes candidates, while the main model accepts or rejects them. Accepted tokens must conform to the main model’s sampling logic; tokens that do not conform are rejected.

That means, in theory, the final output distribution can remain consistent with generation without a draft model. Google’s positioning for the Gemma 4 MTP drafter is also speed improvement without reducing output quality or reasoning behavior.

In actual engineering, the final effect still depends on the inference framework implementation, sampling parameters, completeness of MTP support, and whether the main model and assistant model are correctly paired.

Why It Can Speed Up Generation

The speedup comes from two factors:

  • The draft model is lighter, so predicting candidate tokens is cheaper.
  • The main model can verify multiple candidate tokens at once, reducing token-by-token waiting.

If assistant-MTP guesses accurately, one forward pass of the main model can accept multiple tokens, improving throughput noticeably. When Google announced this, it mentioned that Gemma 4 with an MTP drafter can achieve up to about 3x speedup on some hardware and frameworks.

But that number is not guaranteed in every scenario. The actual speedup depends on:

  • Main model size.
  • How well the assistant model matches the main model.
  • How many speculative tokens are predicted each time.
  • Prompt type.
  • Sampling temperature.
  • Inference framework implementation.
  • GPU / CPU / memory bandwidth.

In general, formatted text, code, fixed structures, and common phrases are easier for the draft model to predict. Highly open-ended, random, high-temperature generation may see weaker acceleration.

How To Use It

assistant-MTP requires inference framework support. Downloading an assistant model does not make it directly usable as a chat model.

There are two common usage patterns.

Method 1: Main Model With Built-In MTP Support

Some frameworks can directly read Gemma 4’s MTP-related structure and enable it through parameters. For example, a common direction in the vLLM community is to use speculative config:

1
2
vllm serve google/gemma-4-31B-it \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

This method may not require separately specifying an assistant model, depending on the model format and framework implementation.

Method 2: Load The Assistant / Drafter Model Separately

In local inference scenarios such as GGUF / llama.cpp, it is more common to load the main model and draft model separately. The idea looks like this:

1
2
3
4
5
6
llama-server \
  -m gemma-4-12B-it-Q4_K_M.gguf \
  --model-draft gemma-4-12B-it-assistant-MTP-Q8_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 \
  --ctx-size 8192

The key points are:

  • -m points to the main model.
  • --model-draft points to the assistant-MTP draft model.
  • --spec-type draft-mtp enables MTP draft mode.
  • --spec-draft-n-max 4 means drafting up to 4 tokens.

Parameters may change across llama.cpp versions, so check the current --help output and model card before using them.

How To Tune The Parameters

--spec-draft-n-max

This parameter controls the maximum number of tokens assistant-MTP drafts at once.

Start with a small value:

1
--spec-draft-n-max 2

Then try:

1
--spec-draft-n-max 4

A larger value is not always faster. If the draft hit rate drops, the main model will reject candidates frequently, wasting compute instead.

temperature

The higher the temperature, the more random the output, and the harder it is for assistant-MTP to predict the main model’s next tokens.

If the goal is speed and stability, start with:

1
--temp 0.7

Or lower:

1
--temp 0.4

For code completion, format repair, and structured output tasks, lower temperature is usually more suitable.

Context Length

MTP is not VRAM magic. Both the main model and the draft model consume resources, and long contexts still consume KV cache.

On 8GB or 12GB VRAM machines, do not start with 64K / 128K context. Try:

1
--ctx-size 8192

After confirming stability, increase it gradually.

Suitable Tasks

assistant-MTP is more suitable for these scenarios:

  • Code completion.
  • Structured output such as JSON / Markdown / XML.
  • Fixed-format reports.
  • Math steps or table-like output.
  • Low-temperature, more deterministic Q&A.
  • Reducing latency in local model chat.

What these tasks have in common is that later tokens follow stronger patterns, making them easier for the draft model to guess.

Unsuitable Tasks

It should not be treated as a tool for “making the model smarter”.

assistant-MTP does not make the main model more intelligent, nor does it improve factual accuracy. It solves generation speed, not reasoning quality.

These scenarios may see limited benefit:

  • Creative writing with very high temperature.
  • Highly random sampling.
  • Draft model and main model mismatch.
  • Incomplete MTP support in the inference framework.
  • VRAM is already very tight, and an extra draft model must still be loaded.

Especially on small-VRAM machines, remember that assistant-MTP also consumes VRAM or RAM. The speed benefit may be offset by the extra resource usage.

Common Misunderstandings

Misunderstanding 1: assistant-MTP Is A Chat Model

No. It is an auxiliary model for speculative decoding with the main model. Chatting with it directly is not meaningful and may produce poor results.

Misunderstanding 2: Output Must Be Identical After Enabling MTP

The theoretical goal is to preserve the main model’s output distribution because the main model performs final verification. But engineering implementation, sampling parameters, and framework versions can all affect real behavior. Compare and test before production use.

Misunderstanding 3: Larger --spec-draft-n-max Is Always Better

Not necessarily. Drafting more tokens may also increase the chance of wrong guesses. Watch acceptance rate and tokens/s, not just the parameter value.

Misunderstanding 4: It Solves VRAM Shortage

No. MTP is inference acceleration, not VRAM compression. On small-VRAM machines, solve quantization, context length, and GPU offload layers first, then consider MTP.

How To Tell Whether It Really Speeds Things Up

Do not rely only on feel. Use the same group of prompts for an A/B test:

  1. Run without MTP and record tokens/s.
  2. Run with MTP and record tokens/s.
  3. Keep the same model, context, temperature, and prompt.
  4. Compare output quality and latency.

You can test with three types of prompts:

1
Write a Python function that converts a Markdown table to CSV.
1
2
Fix the following JSON and output only valid JSON:
{"name":"demo","items":[{"id":1,"tags":["a","b",],},]}
1
Output 5 Linux troubleshooting steps in a fixed format. Each item should include: problem, command, and judgment criteria.

If these structured tasks become noticeably faster without output quality dropping, assistant-MTP is worth keeping in your environment.

Summary

Gemma 4 assistant-MTP is a Multi-Token Prediction draft model used together with the main model. It uses speculative decoding to predict multiple tokens ahead of time, then lets the main model verify them in parallel, reducing the latency of token-by-token generation.

Its value is speed, not improved model capability. The correct usage pattern is: the main model owns the final output, assistant-MTP drafts ahead, and the inference framework verifies and accepts candidate tokens.

If you can already run Gemma 4 stably, then consider MTP. Start with a small number of speculative tokens, observe tokens/s, VRAM usage, and output quality, then decide whether to add it to your daily run script.

References:

记录并分享
Built with Hugo
Theme Stack designed by Jimmy