Deploying DiffusionGemma Locally: Running Google’s Text Diffusion Model with vLLM

A practical guide to deploying and using DiffusionGemma locally: starting an OpenAI-compatible service with vLLM, testing it with curl, understanding diffusion parameters, hardware requirements, and deployment boundaries.

The previous article explained why DiffusionGemma is worth watching: it is not traditional token-by-token autoregressive generation, but uses text diffusion and parallel denoising over a 256-token canvas, making it better suited to low-latency local interaction, inline editing, and code completion.

This article focuses on the practical question: how to deploy it and run it from the command line.

The main official path today is vLLM. DiffusionGemma can be started through vLLM’s OpenAI-compatible local server, then queried with an interface similar to OpenAI Chat Completions.

Before you start

First decide whether DiffusionGemma is worth trying on your machine.

Item Recommendation
Model google/diffusiongemma-26B-A4B-it
GPU Prefer an NVIDIA discrete GPU
VRAM Google notes quantized deployments can fit within about 18GB VRAM on high-end consumer GPUs
Best scenarios Local, low-concurrency, low-latency, interactive generation
Poor scenarios High-QPS cloud serving, quality-first long-form generation
Serving framework vLLM
API shape OpenAI-compatible local server

DiffusionGemma is a 26B total MoE model, with 3.8B active parameters during inference. It is not a small model. MoE, quantization, and parallel generation only bring the local deployment threshold down into the range that high-end consumer GPUs can explore.

If you only want stable long-form writing, knowledge Q&A, or production APIs, standard Gemma 4 is still safer. DiffusionGemma is more suitable for trying low-latency editors, code infilling, and instant structured-text repair.

Option 1: start directly with vLLM

The core command from the official developer guide is:

1
2
3
4
5
6
7
8
9
vllm serve google/diffusiongemma-26B-A4B-it \
  --max-model-len 262144 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.85 \
  --attention-backend TRITON_ATTN \
  --generation-config vllm \
  --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
  --diffusion-config '{"canvas_length": 256}' \
  --enable-chunked-prefill

This command pulls google/diffusiongemma-26B-A4B-it from Hugging Face and starts a local OpenAI-compatible server. By default, the service usually listens on http://localhost:8000.

If your Hugging Face environment needs authentication, run:

1
huggingface-cli login

If vLLM is not installed, you can start with a Python virtual environment:

1
2
3
4
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -U vllm

Whether pip install -U vllm is enough depends on whether the current vLLM release already includes DiffusionGemma support. DiffusionGemma is a new architecture. If you see unknown model structures, unrecognized parameters, or attention backend errors, check the latest vLLM release, the Google developer guide, and the model card first.

Option 2: run vLLM with Docker

If you do not want to modify the local Python environment, you can use a vLLM Docker image. vLLM recipes have used commands similar to this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
docker run -itd --name diffusiongemma \
  --ipc=host \
  --network host \
  --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma \
    --model google/diffusiongemma-26B-A4B-it \
    --max-model-len 262144 \
    --max-num-seqs 4 \
    --gpu-memory-utilization 0.85 \
    --generation-config vllm \
    --enable-chunked-prefill \
    --host 0.0.0.0 \
    --port 8000

This keeps the environment cleaner and is useful on servers, workstations, or temporary test machines. Two things matter:

  • The host must already have the NVIDIA driver and NVIDIA Container Toolkit installed.
  • If the vLLM version inside the image does not include DiffusionGemma support, startup will still fail. Use a newer image or the appropriate branch.

The cache mount -v ~/.cache/huggingface:/root/.cache/huggingface is useful if you want to reuse the host Hugging Face cache and avoid downloading the model every time.

Test the service with curl

After the service starts, check the model list:

1
curl http://localhost:8000/v1/models

If the response includes google/diffusiongemma-26B-A4B-it, the service is basically up.

Then test Chat Completions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/diffusiongemma-26B-A4B-it",
    "messages": [
      {
        "role": "user",
        "content": "Explain the difference between DiffusionGemma and ordinary autoregressive LLMs in three sentences."
      }
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

If you prefer the OpenAI SDK, point base_url to the local service:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="google/diffusiongemma-26B-A4B-it",
    messages=[
        {
            "role": "user",
            "content": "Write a Python function that converts a Markdown table to CSV.",
        }
    ],
    max_tokens=512,
)

print(response.choices[0].message.content)

Understanding the parameters

Several parameters in the official command deserve a closer look.

--max-model-len 262144

This sets the maximum context length. DiffusionGemma / Gemma 4 supports very long context, but that does not mean you should always open the limit.

Longer context increases VRAM and scheduling pressure.

For a first local test, you can keep the official value. If VRAM is tight, reduce it and check whether your actual task is affected.

--max-num-seqs 4

This limits the number of sequences processed at the same time. DiffusionGemma is better suited to local, low-concurrency interaction. Higher concurrency is not necessarily faster and may increase VRAM pressure.

For a single-user local tool, try values between 1 and 4. Multi-user serving needs more serious benchmarking.

--gpu-memory-utilization 0.85

This tells vLLM how much of GPU memory it may use. 0.85 is a common conservative value.

If startup OOMs, try:

1
--gpu-memory-utilization 0.75

If VRAM is plentiful, you can raise it a little, but do not max it out at the beginning. Leave room for the system and other processes.

--attention-backend TRITON_ATTN

This selects the attention backend. The official command uses TRITON_ATTN, which is related to DiffusionGemma’s special attention and denoising path.

If the backend is unsupported, the issue is often a mismatch among vLLM, CUDA, Triton, and GPU architecture. Do not randomly change model parameters first; check the software stack.

--hf-overrides

This part of the official command is important:

1
--hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}'

It overrides diffusion sampler settings in the Hugging Face config. entropy_bound can be understood as a strategy controlling denoising or sampling behavior, used together with DiffusionGemma’s iterative generation.

This is not a normal LLM parameter. Start with the official value, confirm it runs, then experiment.

--diffusion-config

The official command uses:

1
--diffusion-config '{"canvas_length": 256}'

canvas_length corresponds to DiffusionGemma’s 256-token canvas. The model does not generate one token at a time linearly; it denoises in parallel inside a block. This value is directly tied to the block diffusion generation mechanism.

Do not change it casually at first. Use the official value to verify speed, quality, and VRAM usage, then test according to later vLLM documentation.

--enable-chunked-prefill

This enables chunked prefill. DiffusionGemma’s long-sequence processing coordinates prefill and denoising, and chunked prefill can help scheduling in long-context scenarios.

If you only test short prompts, you may not feel much difference. It matters more for long context.

A more conservative local test command

If you only want to see whether it can start, lower concurrency and VRAM pressure:

1
2
3
4
5
6
7
8
9
vllm serve google/diffusiongemma-26B-A4B-it \
  --max-model-len 65536 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.75 \
  --attention-backend TRITON_ATTN \
  --generation-config vllm \
  --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
  --diffusion-config '{"canvas_length": 256}' \
  --enable-chunked-prefill

This command is not necessarily optimal for performance, but it is better for first-time debugging. Get the model running first, then gradually increase context length and concurrency.

Good demos to try

DiffusionGemma should not be tested only with ordinary chat questions. Its real value is in nonlinear generation and real-time local repair.

Try prompts like these:

1
2
3
4
Complete the missing logic in this Python function. Output only the full function:

def markdown_table_to_csv(markdown: str) -> str:
    ...
1
2
3
Fix the following JSON so it becomes valid JSON, while preserving the original field meanings:

{"name":"demo","items":[{"id":1,"tags":["a","b",],},]}
1
2
3
4
Complete the following Markdown table to 5 rows and make sure every row has the same number of columns:

| Parameter | Purpose | Recommendation |
| --- | --- | --- |
1
2
3
You are an inline completion model inside an editor. Rewrite only the text inside brackets and keep the surrounding context coherent:

DiffusionGemma is suitable for [write a short phrase about low-latency interaction here], but not for quality-first long-form writing.

These tasks reveal its bidirectional attention, block-level self-repair, and structured output capabilities better than “tell me a story.”

Common issues

The model is unsupported at startup

First check the vLLM version. DiffusionGemma is new, and older vLLM versions may not include the implementation.

Check:

1
vllm --version

Then compare against the official developer guide, vLLM release notes, or the DiffusionGemma model card.

Hugging Face download fails

Check network and login status:

1
huggingface-cli whoami

Log in again if needed:

1
huggingface-cli login

On a server, consider pre-downloading the model or mounting the Hugging Face cache into the container.

OOM

Reduce pressure in this order:

1
--max-num-seqs 1
1
--gpu-memory-utilization 0.75
1
--max-model-len 65536

If it still OOMs, check whether quantized weights are being used, whether vLLM correctly loads the quantization format, and whether the GPU meets the model requirements.

Speed is not as high as expected

First confirm that your scenario is actually in DiffusionGemma’s advantage zone. Its speedup mainly targets local, low-concurrency, dedicated GPU, low-to-medium batch workloads.

In high-concurrency cloud serving, autoregressive models can use batching to saturate hardware, reducing DiffusionGemma’s advantage. Apple Silicon-style unified memory may also not show the same speedup.

Output quality is worse than Gemma 4

That is expected. Google explicitly states that because DiffusionGemma prioritizes speed and parallel layout generation, overall output quality is lower than standard Gemma 4. Quality-first production applications should still use standard Gemma 4.

Minimal validation flow

You can follow this sequence:

  1. Log in to Hugging Face.
1
huggingface-cli login
  1. Start the vLLM service.
1
2
3
4
5
6
7
8
9
vllm serve google/diffusiongemma-26B-A4B-it \
  --max-model-len 65536 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.75 \
  --attention-backend TRITON_ATTN \
  --generation-config vllm \
  --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
  --diffusion-config '{"canvas_length": 256}' \
  --enable-chunked-prefill
  1. Check the model list.
1
curl http://localhost:8000/v1/models
  1. Send one request.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/diffusiongemma-26B-A4B-it",
    "messages": [
      {
        "role": "user",
        "content": "Complete a Python function: input a Markdown table string and output a CSV string."
      }
    ],
    "max_tokens": 512,
    "temperature": 0.4
  }'
  1. Then test structured repair or code infilling.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/diffusiongemma-26B-A4B-it",
    "messages": [
      {
        "role": "user",
        "content": "Fix this JSON and output only valid JSON: {\"name\":\"demo\",\"items\":[{\"id\":1,\"tags\":[\"a\",\"b\",],},]}"
      }
    ],
    "max_tokens": 256,
    "temperature": 0.2
  }'

Once these five steps work, tune context length, concurrency, and GPU memory utilization upward.

Summary

Deploying DiffusionGemma is not about finding a generic chat-model replacement. It is about validating a new local interaction route: start an OpenAI-compatible service with vLLM, then experiment around inline editing, code infilling, structured text repair, and low-latency output.

For a first deployment, use conservative parameters: --max-num-seqs 1, --max-model-len 65536, and --gpu-memory-utilization 0.75. After it runs, return to the official configuration and gradually test speed, VRAM, and output quality.

References:

记录并分享
Built with Hugo
Theme Stack designed by Jimmy