If the main model, assistant draft model, and inference framework are correctly paired, MTP can noticeably speed up Gemma 4 on a local GPU. On some 12GB VRAM cards, such as the RTX 4070, suitable quantization and parameters may get you close to 120 tokens/s.
But this is not a number you can guarantee by copying a command. It is better treated as a tuning target: the model must run, VRAM must be sufficient, draft hit rate must be high, and sampling parameters must be stable before the speed looks impressive.
What MTP Does Here
MTP means Multi-Token Prediction.
A normal autoregressive model generates one token at a time. assistant-MTP first drafts several future tokens for the main model, then the main model verifies them in parallel. If the draft is correct, the main model can accept multiple tokens at once, reducing token-by-token waiting.
This mechanism is often called:
Speculative Decoding- speculative decoding
- draft model acceleration
- draft model / drafter
Its goal is acceleration, not improving model capability. Whether a token is accepted is still decided by the main model.
Command-Line Example
Here is a more advanced llama-cli reference command:
|
|
This command means:
- Use
gemma-4-12b-it-qat-GGUF.ggufas the main model. - Use
gemma-4-12b-it-qat-assistant-MTP-Q8_0-GGUF.ggufas the draft model. - Let the draft model predict at most 2 tokens per round.
- Try to offload model layers to the GPU as much as possible.
- Pass a prompt directly to test generation speed.
Note: parameter names may differ across llama.cpp versions. Some versions use -md, while others prefer --model-draft; some use --draft-max, while others use --spec-draft-n-max. Before testing, check:
|
|
Or:
|
|
Parameter Explanation
-m
|
|
This is the main model. The final output is verified and decided by it.
assistant-MTP must match the main model. Do not casually pair an assistant model with a main model of another size or version, or you may get no speed benefit at best, and load failures or abnormal output at worst.
-md
|
|
-md mounts the draft model, namely the assistant-MTP draft model.
You can think of it as a small helper that predicts candidate answers. It guesses the next few tokens first, then the main model decides whether to accept them.
If your llama.cpp version does not recognize -md, try:
|
|
--draft-max
|
|
This controls the maximum number of tokens the draft model predicts at once.
Start with 2 instead of setting it very high immediately. More draft tokens does not necessarily mean faster generation; if the miss rate rises, the main model will reject candidates frequently and waste compute.
You can test like this:
|
|
|
|
|
|
Watch tokens/s and output quality, then decide which value to keep.
-ngl 99
|
|
This parameter tries to offload model layers to the GPU. On 12GB VRAM, if the quantized model is small enough, most or even all layers may fit on the GPU.
But 8GB VRAM users should not copy this blindly. MTP loads an additional assistant model, so VRAM pressure is higher than running only the main model.
If you hit OOM, lower it in this order:
|
|
|
|
|
|
The stable value depends on model quantization, context length, remaining GPU memory, and desktop/system usage.
-p
|
|
-p passes the prompt directly.
Whether <|think|> is needed depends on the chat template and model card of the current GGUF model. It is not a universal switch for all Gemma 4 models. For speed testing, start with a simpler prompt:
|
|
First confirm that MTP itself runs, then discuss templates and special tokens.
A More Stable Test Command
For the first test, use more conservative parameters:
|
|
If it runs stably, gradually increase -ngl:
|
|
Then try:
|
|
Do not max out -ngl on the first run. MTP adds a draft model, so VRAM headroom matters more than in normal inference.
Why 120 tokens/s May Not Reproduce
120 tokens/s is tempting, but it depends on many conditions.
| Factor | Explanation |
|---|---|
| GPU | 12GB cards such as RTX 4070 can usually run higher -ngl than 8GB cards |
| Quantization | QAT / Q4 / Q8 draft model combinations affect VRAM and speed |
| Draft hit rate | The more accurately the draft predicts, the more tokens the main model accepts at once |
| Prompt type | Structured text, code, and fixed formats usually accelerate more easily |
| temperature | The more random the output, the harder the draft is to predict |
| Context length | Longer context increases KV cache pressure |
| llama.cpp version | MTP support is still evolving, so parameters and performance may change |
So it is better to treat this as a speed target you can try to reach, not a promised value.
Prompts Suitable For Benchmarking
MTP usually shows its value most clearly in structured, low-randomness outputs. For benchmarking, do not only ask the model to write free-form prose. Try these:
|
|
|
|
|
|
If tokens/s improves noticeably on these tasks and the output structure does not degrade, assistant-MTP is valuable on your machine.
Common Issues
Adding -md Causes OOM
That is normal. assistant-MTP also consumes VRAM or RAM.
First lower:
|
|
Then lower context:
|
|
If it is still unstable, switch to a smaller quantization or skip MTP for now.
Parameter Not Recognized
This means your llama.cpp version does not match the command in this article. Check help first:
|
|
Focus on:
|
|
|
|
If the current version has no MTP / draft support, update llama.cpp.
Output Looks Strange
First remove <|think|> and test with a normal prompt. Then lower temperature:
|
|
Then reduce the draft count to:
|
|
If this restores normal output, the previous template, sampling settings, or draft parameters were too aggressive.
Summary
The high-speed Gemma 4 assistant-MTP setup is essentially speculative decoding with a main model plus a draft model. -md mounts the draft model, --draft-max controls how many tokens are drafted at once, and -ngl determines GPU offload.
12GB VRAM machines can try to push higher speed, and 120 tokens/s can be used as a tuning target. 8GB VRAM machines should be more conservative because the draft model consumes extra resources.
The most stable approach is to run first, then accelerate: start with lower -ngl, shorter context, and a small draft count, confirm stability, then gradually increase.