RTX 3060 12GB + Qwen3.6 35B: llama.cpp --n-cpu-moe Setup

I recently saw an interesting local LLM test: the same roughly 3,000 yuan old PC, with no hardware changes, became much more usable for a 35B MoE model after switching to a newer llama.cpp build and a better parameter set.

The test machine is not high-end:

Hardware	Configuration
CPU	AMD Ryzen 7 3700X
GPU	RTX 3060 12GB
Memory	32GB DDR4
System	Windows 11
Model	Qwen3.6-35B-A3B GGUF Q4_K_M

The conclusion is simple: the same hardware that previously could only barely run heavily quantized models can now reach a much more practical state with a newer llama.cpp, Q4 quantization, and a 64K context.

Quick Answer

An RTX 3060 12GB can experiment with Qwen3.6 35B-A3B GGUF when the model is quantized, llama.cpp is recent, and MoE experts are split between CPU and GPU with --n-cpu-moe. Start with --n-cpu-moe 32, then test nearby values instead of copying one command blindly.

This is not the same as running a dense 35B model comfortably. Expect tradeoffs in speed, RAM pressure, context length, and concurrency.

The Key Is MoE Scheduling, Not a New GPU

The most important parameter is:

1

--n-cpu-moe 32

Qwen3.6-35B-A3B is an MoE model, short for Mixture of Experts. Its total parameter count looks large, but each inference step does not activate every expert. Only part of the expert pool is used for each token.

That leaves room for local inference. Not everything has to fit into the GPU. The --n-cpu-moe parameter in llama.cpp lets you adjust how MoE expert layers are split between CPU and GPU, so consumer GPUs with limited VRAM can still run larger models.

On an RTX 3060 12GB, --n-cpu-moe 32 is a good value to try first. The GPU handles CUDA work it is good at, while the CPU shares part of the expert-layer load. This mixed scheduling can be faster than pushing everything to the GPU or leaning too much on the CPU.

How Much Faster Is It?

A typical comparison looks like this:

Item	Older Setup	Newer Setup
Generation speed	Around 15 tok/s	Around 33-36 tok/s
Quantization	Q2_K_M	Q4_K_M
Context	4K	64K
VRAM usage	Around 5GB	Around 7GB
Experience	Runs, but quality is unstable	Smoother, with clearly better answers

The important part is not just the speedup. Quantization quality and context length improve at the same time.

Many 12GB VRAM users previously had to use Q2 quantization just to launch 30B+ models. The model could start, but inference quality often suffered. Moving to Q4 means the local model shifts from “interesting to try” toward “useful for real work.”

A Windows Launch Command Template

Here is a Windows batch template. Replace the paths with your own:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


@echo off
chcp 65001 >nul

cd /d C:\Users\你的用户名\llama-b9297-bin-win-cuda-13.1-x64

llama-server.exe ^
 -m "D:\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
 -ngl 99 ^
 --n-cpu-moe 32 ^
 --flash-attn on ^
 --jinja ^
 -c 65536 ^
 -t 8 ^
 -b 512 ^
 -ub 128 ^
 --cache-type-k q4_0 ^
 --cache-type-v q4_0 ^
 -np 1 ^
 --cache-ram 0 ^
 --host 127.0.0.1 ^
 --port 8080

pause

Key parameters:

-ngl 99: offload as many eligible layers as possible to the GPU;
--n-cpu-moe 32: controls MoE expert-layer scheduling and is the main speedup knob here;
--flash-attn on: enables Flash Attention to reduce long-context pressure;
-c 65536: sets a 64K context;
--cache-type-k q4_0 / --cache-type-v q4_0: quantizes the KV cache to reduce long-context VRAM use;
-np 1: single concurrency, suitable for a 32GB RAM machine;
--cache-ram 0: disables prompt cache to keep memory usage under control.

One caveat: b9297 is just a test point. As of 2026-05-26, the llama.cpp releases page has already moved to newer versions, so in practice you can try a newer CUDA build first.

How to Tune Different GPUs

For this kind of MoE model, the idea is not “give up if VRAM is not enough.” It is to find the right CPU/GPU split.

Hardware	Suggestion
RTX 3060 12GB / 3080 10GB	Start with `--n-cpu-moe 32`
RTX 3070 8GB / 4060 8GB	Increase `--n-cpu-moe`, such as 128 or 256
RTX 3050 6GB / GTX 1650 4GB	Try more CPU offload, but expect much lower speed
Apple Silicon Mac	Use the Metal backend; unified memory is friendlier to large models

Do not treat these values as universal answers. The best --n-cpu-moe value depends on the model, quantization, GPU, CPU, and memory bandwidth. A safer approach is to test several points:

1

0 / 16 / 32 / 64 / 128 / 256

Compare tok/s, memory usage, first-token latency, and answer stability before settling on a final configuration.

Is 32GB RAM Enough?

The short answer: it can run, but there is not much headroom.

With this kind of setup, the llama-server working set can exceed 20GB. The OS also needs memory for the browser, editor, drivers, and background services. For single-user local use, 32GB is worth trying. For a long-running service or multiple concurrent calls, 64GB will feel much better.

Suggestions:

Test with single concurrency first;
Close unnecessary background programs;
Do not keep too many browser tabs open;
Confirm the CUDA backend loads correctly;
Do not start by pushing context to 128K.

Why This Matters

Local LLMs have long been surrounded by VRAM anxiety. Many people assume 35B-class models require 24GB VRAM, preferably an RTX 4090.

This test points to another direction: improvements in model architecture and inference frameworks can keep older hardware useful. MoE, KV cache quantization, Flash Attention, CUDA kernel optimization, and CPU/GPU mixed offload can together matter more than a simple GPU upgrade.

Of course, this is not magic. Running a 35B MoE model on 8GB or 12GB GPUs still requires tradeoffs in speed, context length, quantization quality, and memory usage. But for personal knowledge bases, coding assistants, long-document QA, or offline testing, this kind of setup is already worth experimenting with.

My Take

If you have an RTX 3060 12GB, RTX 3080 10GB, or even an 8GB GPU, it is worth taking another look at newer llama.cpp builds.

The point is not to copy one parameter blindly. The point is to understand the pattern:

An MoE model does not necessarily need every expert loaded into the GPU. A reasonable CPU/GPU split can matter more than the raw amount of VRAM.

Old PCs are not limited to tiny models forever. As inference frameworks and quantization methods keep improving, many local models once considered “too heavy” can become usable again.

References

llama.cpp Releases