Gemma 4 MTP Tuning: Pushing Toward 120 tokens/s With an assistant Draft Model

A command-line guide to using the assistant-MTP draft model with Gemma 4 for speculative decoding: how to mount the draft model in llama-cli, understand -md, --draft-max, -ngl, and why 120 tokens/s should be treated as a tuning target on specific hardware.

If the main model, assistant draft model, and inference framework are correctly paired, MTP can noticeably speed up Gemma 4 on a local GPU. On some 12GB VRAM cards, such as the RTX 4070, suitable quantization and parameters may get you close to 120 tokens/s.

But this is not a number you can guarantee by copying a command. It is better treated as a tuning target: the model must run, VRAM must be sufficient, draft hit rate must be high, and sampling parameters must be stable before the speed looks impressive.

What MTP Does Here

MTP means Multi-Token Prediction.

A normal autoregressive model generates one token at a time. assistant-MTP first drafts several future tokens for the main model, then the main model verifies them in parallel. If the draft is correct, the main model can accept multiple tokens at once, reducing token-by-token waiting.

This mechanism is often called:

  • Speculative Decoding
  • speculative decoding
  • draft model acceleration
  • draft model / drafter

Its goal is acceleration, not improving model capability. Whether a token is accepted is still decided by the main model.

Command-Line Example

Here is a more advanced llama-cli reference command:

1
2
3
4
5
6
./llama-cli \
  -m gemma-4-12b-it-qat-GGUF.gguf \
  --draft-max 2 \
  -md gemma-4-12b-it-qat-assistant-MTP-Q8_0-GGUF.gguf \
  -ngl 99 \
  -p "<|think|>\n写一篇关于量子计算的短文。"

This command means:

  • Use gemma-4-12b-it-qat-GGUF.gguf as the main model.
  • Use gemma-4-12b-it-qat-assistant-MTP-Q8_0-GGUF.gguf as the draft model.
  • Let the draft model predict at most 2 tokens per round.
  • Try to offload model layers to the GPU as much as possible.
  • Pass a prompt directly to test generation speed.

Note: parameter names may differ across llama.cpp versions. Some versions use -md, while others prefer --model-draft; some use --draft-max, while others use --spec-draft-n-max. Before testing, check:

1
./llama-cli --help

Or:

1
./llama-server --help

Parameter Explanation

-m

1
-m gemma-4-12b-it-qat-GGUF.gguf

This is the main model. The final output is verified and decided by it.

assistant-MTP must match the main model. Do not casually pair an assistant model with a main model of another size or version, or you may get no speed benefit at best, and load failures or abnormal output at worst.

-md

1
-md gemma-4-12b-it-qat-assistant-MTP-Q8_0-GGUF.gguf

-md mounts the draft model, namely the assistant-MTP draft model.

You can think of it as a small helper that predicts candidate answers. It guesses the next few tokens first, then the main model decides whether to accept them.

If your llama.cpp version does not recognize -md, try:

1
--model-draft gemma-4-12b-it-qat-assistant-MTP-Q8_0-GGUF.gguf

--draft-max

1
--draft-max 2

This controls the maximum number of tokens the draft model predicts at once.

Start with 2 instead of setting it very high immediately. More draft tokens does not necessarily mean faster generation; if the miss rate rises, the main model will reject candidates frequently and waste compute.

You can test like this:

1
--draft-max 1
1
--draft-max 2
1
--draft-max 4

Watch tokens/s and output quality, then decide which value to keep.

-ngl 99

1
-ngl 99

This parameter tries to offload model layers to the GPU. On 12GB VRAM, if the quantized model is small enough, most or even all layers may fit on the GPU.

But 8GB VRAM users should not copy this blindly. MTP loads an additional assistant model, so VRAM pressure is higher than running only the main model.

If you hit OOM, lower it in this order:

1
-ngl 80
1
-ngl 60
1
-ngl 40

The stable value depends on model quantization, context length, remaining GPU memory, and desktop/system usage.

-p

1
-p "<|think|>\n写一篇关于量子计算的短文。"

-p passes the prompt directly.

Whether <|think|> is needed depends on the chat template and model card of the current GGUF model. It is not a universal switch for all Gemma 4 models. For speed testing, start with a simpler prompt:

1
-p "写一篇关于量子计算的短文。"

First confirm that MTP itself runs, then discuss templates and special tokens.

A More Stable Test Command

For the first test, use more conservative parameters:

1
2
3
4
5
6
7
8
9
./llama-cli \
  -m gemma-4-12b-it-qat-GGUF.gguf \
  -md gemma-4-12b-it-qat-assistant-MTP-Q8_0-GGUF.gguf \
  --draft-max 2 \
  -ngl 60 \
  -c 8192 \
  -n 512 \
  --temp 0.7 \
  -p "用三段话解释量子计算。"

If it runs stably, gradually increase -ngl:

1
-ngl 80

Then try:

1
-ngl 99

Do not max out -ngl on the first run. MTP adds a draft model, so VRAM headroom matters more than in normal inference.

Why 120 tokens/s May Not Reproduce

120 tokens/s is tempting, but it depends on many conditions.

Factor Explanation
GPU 12GB cards such as RTX 4070 can usually run higher -ngl than 8GB cards
Quantization QAT / Q4 / Q8 draft model combinations affect VRAM and speed
Draft hit rate The more accurately the draft predicts, the more tokens the main model accepts at once
Prompt type Structured text, code, and fixed formats usually accelerate more easily
temperature The more random the output, the harder the draft is to predict
Context length Longer context increases KV cache pressure
llama.cpp version MTP support is still evolving, so parameters and performance may change

So it is better to treat this as a speed target you can try to reach, not a promised value.

Prompts Suitable For Benchmarking

MTP usually shows its value most clearly in structured, low-randomness outputs. For benchmarking, do not only ask the model to write free-form prose. Try these:

1
写一个 Python 函数,把 Markdown 表格转换成 CSV,只输出代码。
1
2
修复下面 JSON,只输出合法 JSON:
{"name":"demo","items":[{"id":1,"tags":["a","b",],},]}
1
用固定格式输出 10 条 Linux 故障排查步骤,每条包含:问题、命令、判断标准。

If tokens/s improves noticeably on these tasks and the output structure does not degrade, assistant-MTP is valuable on your machine.

Common Issues

Adding -md Causes OOM

That is normal. assistant-MTP also consumes VRAM or RAM.

First lower:

1
-ngl 60

Then lower context:

1
-c 4096

If it is still unstable, switch to a smaller quantization or skip MTP for now.

Parameter Not Recognized

This means your llama.cpp version does not match the command in this article. Check help first:

1
./llama-cli --help

Focus on:

1
draft
1
spec

If the current version has no MTP / draft support, update llama.cpp.

Output Looks Strange

First remove <|think|> and test with a normal prompt. Then lower temperature:

1
--temp 0.4

Then reduce the draft count to:

1
--draft-max 1

If this restores normal output, the previous template, sampling settings, or draft parameters were too aggressive.

Summary

The high-speed Gemma 4 assistant-MTP setup is essentially speculative decoding with a main model plus a draft model. -md mounts the draft model, --draft-max controls how many tokens are drafted at once, and -ngl determines GPU offload.

12GB VRAM machines can try to push higher speed, and 120 tokens/s can be used as a tuning target. 8GB VRAM machines should be more conservative because the draft model consumes extra resources.

The most stable approach is to run first, then accelerate: start with lower -ngl, shorter context, and a small draft count, confirm stability, then gradually increase.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy