Running Gemma 4 12B on 8GB VRAM: Tuning llama-cli Hybrid Offload Parameters

A guide to llama-cli parameters for running Gemma 4 12B GGUF on an 8GB VRAM machine: use GPU layer offload, Flash Attention, 8K context, mlock, and CPU thread control to stay stable when VRAM is tight.

Running Gemma 4 12B with 8GB VRAM is not mainly a disk-space problem. The real pressure is runtime VRAM.

With a GGUF quantized build such as Q4_K_M, the model file itself may already be close to 8GB. Once it runs, you also need KV cache, temporary compute buffers, desktop usage, and driver overhead. The result is familiar: the model looks like it should “almost fit,” but long context quickly causes OOM.

If the machine only has 8GB VRAM, a better strategy is not forcing everything onto the GPU. Use hybrid offload between VRAM and system memory: put as many layers as possible on the GPU, and leave the rest in RAM for CPU participation.

Assume the model path is:

1
./models/gemma-4-12b-it-Q4_K_M.gguf

On Linux / macOS, create run_gemma4.sh:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/usr/bin/env bash
set -e

MODEL_PATH="./models/gemma-4-12b-it-Q4_K_M.gguf"

./llama-cli \
  -m "$MODEL_PATH" \
  -ngl 26 \
  -c 8192 \
  -t 8 \
  --flash-attn \
  --mlock \
  -n -1 \
  --color \
  -i

Make it executable:

1
chmod +x run_gemma4.sh

Run it:

1
./run_gemma4.sh

On Windows, create run_gemma4.bat:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
@echo off
set MODEL_PATH=.\models\gemma-4-12b-it-Q4_K_M.gguf

llama-cli.exe ^
  -m "%MODEL_PATH%" ^
  -ngl 26 ^
  -c 8192 ^
  -t 8 ^
  --flash-attn ^
  -n -1 ^
  --color ^
  -i

On Windows, do not enable --mlock by default. Behavior varies by version, permissions, and system configuration. Getting the model running first matters more. On Linux, if system memory is plentiful, --mlock is worth trying.

Why hybrid offload matters

-ngl is the most important parameter in this setup. It controls how many layers are offloaded to the GPU.

1
-ngl 26

With 8GB VRAM, the goal is not to fit the entire model on the GPU, but to leave enough room for KV cache and runtime buffers. -ngl 26 is a reasonable starting point: some layers go to VRAM, and system memory handles the rest.

Tune it like this:

Symptom Adjustment
OOM at startup or crash during generation Lower -ngl 26 to 22 or 20
GPU memory usage is only around 6GB Raise -ngl 26 to 28 or 30
Stable but slow Use a lower-bit quantization, or raise -ngl
OOM on long context Lower -c first, then lower -ngl

On 8GB VRAM, do not look only at model file size. What matters is whether model layers, KV cache, VRAM fragmentation, and desktop usage all fit with margin.

1
--flash-attn

This parameter is very useful for small VRAM. It reduces attention memory pressure and improves long-context inference efficiency.

If your llama.cpp build, GPU backend, or GPU architecture does not support Flash Attention, startup may fail. In that case, remove --flash-attn first to confirm the model runs, then update llama.cpp or check CUDA / Metal / Vulkan backend support.

For 8GB VRAM, enable it if you can. If you cannot, lower context length first.

-c 8192: start with 8K context

1
-c 8192

Longer context means larger KV cache. Many models advertise long context support, but small-VRAM machines should not open the maximum immediately.

On 8GB VRAM, 8192 is a balanced starting point. It is enough for everyday chat, code snippets, and medium-length documents, while avoiding the VRAM pressure of 32K or 64K.

If it still OOMs, lower it:

1
-c 4096

If you switch to a smaller quantized model and have clear VRAM headroom, try:

1
-c 12288

Do not chase maximum context on the first run. Get stable first, then expand.

--mlock: reduce memory swapping

1
--mlock

If system memory is relatively plentiful, this parameter tries to keep the model resident in physical memory and avoid swapping it to slow disk-backed swap or pagefile.

In hybrid offload mode, some layers remain in RAM. If those pages are swapped out, responsiveness can drop sharply or become choppy. --mlock reduces that risk.

Two caveats:

  • On Linux, you may need to adjust ulimit -l or permissions.
  • On Windows, it is not always worth enabling by default. Getting the model running first matters more.

If --mlock prevents startup, remove it. It is a stability and speed optimization, not a requirement.

-t 8: do not blindly max CPU threads

1
-t 8

-t controls CPU thread count. In hybrid offload mode, layers not on the GPU need CPU participation, so thread count affects speed.

Use the number of physical CPU cores, not logical threads:

CPU Recommendation
6 cores / 12 threads -t 6
8 cores / 16 threads -t 8
12 cores / 24 threads -t 10 or -t 12

More threads are not always better. Too many can hurt scheduling, memory bandwidth, and desktop responsiveness. Start with physical core count, then tune with actual tokens/s.

About -p "<|think|>\n"

The original script included:

1
-i -p "<|think|>\n"

Use this carefully. Different models, GGUF conversions, and templates handle thinking markers differently. Forcing <|think|> into the prompt does not reliably enable “deep thinking” and may pollute output format.

A safer first step is only interactive mode:

1
-i

If the current Gemma 4 GGUF model card says a specific system prompt or special token is required, add it according to the model card. Do not treat one marker as a universal switch.

Conservative first-run version

If you are worried 8GB VRAM is unstable, start with a more conservative script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/usr/bin/env bash
set -e

MODEL_PATH="./models/gemma-4-12b-it-Q4_K_M.gguf"

./llama-cli \
  -m "$MODEL_PATH" \
  -ngl 20 \
  -c 4096 \
  -t 8 \
  --flash-attn \
  --mlock \
  -n 512 \
  --color \
  -i

This version sacrifices context and GPU offload layers, but it is easier to start. Once stable, move back toward:

1
-ngl 26

and:

1
-c 8192

For speed, try a smaller quantization first

If Q4_K_M can only offload around twenty layers on 8GB VRAM, speed will be limited by CPU and memory bandwidth. The most direct speed improvement is using a smaller quantized build.

Try:

Quantization Characteristic
Q4_K_M More stable quality, higher VRAM pressure
Q3_K_L Smaller, may offload more layers
Q3_K_M Saves more VRAM, quality drops further

With Q3_K_M or Q3_K_L, try:

1
-ngl 34

or even:

1
-ngl 38

If most layers fit on the GPU, speed can improve a lot. But lower quantization may reduce output quality. Compare with the same prompts, not only tokens/s.

Memory bandwidth also matters

Hybrid offload is not free. Layers outside VRAM run through CPU and system memory, so speed depends heavily on memory bandwidth.

Check:

  • Whether system memory is dual-channel.
  • Whether DDR5 has XMP / EXPO enabled.
  • Whether background programs are consuming memory bandwidth.
  • Whether a laptop is in high-performance power mode.

If memory is single-channel, hybrid offload speed can be noticeably worse. For an 8GB VRAM setup, sufficient system memory capacity is only the first step. Bandwidth matters too.

Troubleshooting order

When OOM happens, do not change everything at once. Use this order:

  1. Lower context:
1
-c 4096
  1. Lower GPU offload layers:
1
-ngl 22

Then:

1
-ngl 20
  1. Remove --mlock:
1
# Remove --mlock
  1. If --flash-attn fails, remove it first to confirm whether the backend is the issue:
1
# Remove --flash-attn
  1. Switch to a lower-bit quantized model.

Change one parameter at a time and record tokens/s, VRAM usage, and whether OOM occurs. That is how you find the real bottleneck.

Tuning table

Goal Parameters
Most stable startup -ngl 20 -c 4096 -n 512
Daily balance -ngl 26 -c 8192 -n -1
Higher speed Use Q3_K_M, then try -ngl 34 or higher
Longer context Keep --flash-attn, then increase from -c 8192 gradually
Avoid memory swapping Try --mlock on Linux

The worst approach on 8GB VRAM is trying to max everything at once. Start conservatively, then push -ngl and -c upward step by step.

Summary

Running Gemma 4 12B Q4_K_M on 8GB VRAM is mainly about hybrid offload. Start with -ngl 26, -c 8192, --flash-attn, --mlock, and -t 8. If it OOMs, lower context first, then reduce GPU layers.

If you want speed, switching to Q3_K_M or Q3_K_L is often more effective than forcing Q4_K_M. System memory can absorb part of the hybrid-offload pressure, but real responsiveness depends on GPU offload layers, KV cache size, and memory bandwidth.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy