Running Gemma 4 12B with 8GB VRAM is not mainly a disk-space problem. The real pressure is runtime VRAM.
With a GGUF quantized build such as Q4_K_M, the model file itself may already be close to 8GB. Once it runs, you also need KV cache, temporary compute buffers, desktop usage, and driver overhead. The result is familiar: the model looks like it should “almost fit,” but long context quickly causes OOM.
If the machine only has 8GB VRAM, a better strategy is not forcing everything onto the GPU. Use hybrid offload between VRAM and system memory: put as many layers as possible on the GPU, and leave the rest in RAM for CPU participation.
Recommended script
Assume the model path is:
|
|
On Linux / macOS, create run_gemma4.sh:
|
|
Make it executable:
|
|
Run it:
|
|
On Windows, create run_gemma4.bat:
|
|
On Windows, do not enable --mlock by default. Behavior varies by version, permissions, and system configuration. Getting the model running first matters more. On Linux, if system memory is plentiful, --mlock is worth trying.
Why hybrid offload matters
-ngl is the most important parameter in this setup. It controls how many layers are offloaded to the GPU.
|
|
With 8GB VRAM, the goal is not to fit the entire model on the GPU, but to leave enough room for KV cache and runtime buffers. -ngl 26 is a reasonable starting point: some layers go to VRAM, and system memory handles the rest.
Tune it like this:
| Symptom | Adjustment |
|---|---|
| OOM at startup or crash during generation | Lower -ngl 26 to 22 or 20 |
| GPU memory usage is only around 6GB | Raise -ngl 26 to 28 or 30 |
| Stable but slow | Use a lower-bit quantization, or raise -ngl |
| OOM on long context | Lower -c first, then lower -ngl |
On 8GB VRAM, do not look only at model file size. What matters is whether model layers, KV cache, VRAM fragmentation, and desktop usage all fit with margin.
--flash-attn: recommended for 8GB VRAM
|
|
This parameter is very useful for small VRAM. It reduces attention memory pressure and improves long-context inference efficiency.
If your llama.cpp build, GPU backend, or GPU architecture does not support Flash Attention, startup may fail. In that case, remove --flash-attn first to confirm the model runs, then update llama.cpp or check CUDA / Metal / Vulkan backend support.
For 8GB VRAM, enable it if you can. If you cannot, lower context length first.
-c 8192: start with 8K context
|
|
Longer context means larger KV cache. Many models advertise long context support, but small-VRAM machines should not open the maximum immediately.
On 8GB VRAM, 8192 is a balanced starting point. It is enough for everyday chat, code snippets, and medium-length documents, while avoiding the VRAM pressure of 32K or 64K.
If it still OOMs, lower it:
|
|
If you switch to a smaller quantized model and have clear VRAM headroom, try:
|
|
Do not chase maximum context on the first run. Get stable first, then expand.
--mlock: reduce memory swapping
|
|
If system memory is relatively plentiful, this parameter tries to keep the model resident in physical memory and avoid swapping it to slow disk-backed swap or pagefile.
In hybrid offload mode, some layers remain in RAM. If those pages are swapped out, responsiveness can drop sharply or become choppy. --mlock reduces that risk.
Two caveats:
- On Linux, you may need to adjust
ulimit -lor permissions. - On Windows, it is not always worth enabling by default. Getting the model running first matters more.
If --mlock prevents startup, remove it. It is a stability and speed optimization, not a requirement.
-t 8: do not blindly max CPU threads
|
|
-t controls CPU thread count. In hybrid offload mode, layers not on the GPU need CPU participation, so thread count affects speed.
Use the number of physical CPU cores, not logical threads:
| CPU | Recommendation |
|---|---|
| 6 cores / 12 threads | -t 6 |
| 8 cores / 16 threads | -t 8 |
| 12 cores / 24 threads | -t 10 or -t 12 |
More threads are not always better. Too many can hurt scheduling, memory bandwidth, and desktop responsiveness. Start with physical core count, then tune with actual tokens/s.
About -p "<|think|>\n"
The original script included:
|
|
Use this carefully. Different models, GGUF conversions, and templates handle thinking markers differently. Forcing <|think|> into the prompt does not reliably enable “deep thinking” and may pollute output format.
A safer first step is only interactive mode:
|
|
If the current Gemma 4 GGUF model card says a specific system prompt or special token is required, add it according to the model card. Do not treat one marker as a universal switch.
Conservative first-run version
If you are worried 8GB VRAM is unstable, start with a more conservative script:
|
|
This version sacrifices context and GPU offload layers, but it is easier to start. Once stable, move back toward:
|
|
and:
|
|
For speed, try a smaller quantization first
If Q4_K_M can only offload around twenty layers on 8GB VRAM, speed will be limited by CPU and memory bandwidth. The most direct speed improvement is using a smaller quantized build.
Try:
| Quantization | Characteristic |
|---|---|
Q4_K_M |
More stable quality, higher VRAM pressure |
Q3_K_L |
Smaller, may offload more layers |
Q3_K_M |
Saves more VRAM, quality drops further |
With Q3_K_M or Q3_K_L, try:
|
|
or even:
|
|
If most layers fit on the GPU, speed can improve a lot. But lower quantization may reduce output quality. Compare with the same prompts, not only tokens/s.
Memory bandwidth also matters
Hybrid offload is not free. Layers outside VRAM run through CPU and system memory, so speed depends heavily on memory bandwidth.
Check:
- Whether system memory is dual-channel.
- Whether DDR5 has XMP / EXPO enabled.
- Whether background programs are consuming memory bandwidth.
- Whether a laptop is in high-performance power mode.
If memory is single-channel, hybrid offload speed can be noticeably worse. For an 8GB VRAM setup, sufficient system memory capacity is only the first step. Bandwidth matters too.
Troubleshooting order
When OOM happens, do not change everything at once. Use this order:
- Lower context:
|
|
- Lower GPU offload layers:
|
|
Then:
|
|
- Remove
--mlock:
|
|
- If
--flash-attnfails, remove it first to confirm whether the backend is the issue:
|
|
- Switch to a lower-bit quantized model.
Change one parameter at a time and record tokens/s, VRAM usage, and whether OOM occurs. That is how you find the real bottleneck.
Tuning table
| Goal | Parameters |
|---|---|
| Most stable startup | -ngl 20 -c 4096 -n 512 |
| Daily balance | -ngl 26 -c 8192 -n -1 |
| Higher speed | Use Q3_K_M, then try -ngl 34 or higher |
| Longer context | Keep --flash-attn, then increase from -c 8192 gradually |
| Avoid memory swapping | Try --mlock on Linux |
The worst approach on 8GB VRAM is trying to max everything at once. Start conservatively, then push -ngl and -c upward step by step.
Summary
Running Gemma 4 12B Q4_K_M on 8GB VRAM is mainly about hybrid offload. Start with -ngl 26, -c 8192, --flash-attn, --mlock, and -t 8. If it OOMs, lower context first, then reduce GPU layers.
If you want speed, switching to Q3_K_M or Q3_K_L is often more effective than forcing Q4_K_M. System memory can absorb part of the hybrid-offload pressure, but real responsiveness depends on GPU offload layers, KV cache size, and memory bandwidth.