I recently saw an interesting local LLM test: the same roughly 3,000 yuan old PC, with no hardware changes, became much more usable for a 35B MoE model after switching to a newer llama.cpp build and a better parameter set.
The test machine is not high-end:
| Hardware | Configuration |
|---|---|
| CPU | AMD Ryzen 7 3700X |
| GPU | RTX 3060 12GB |
| Memory | 32GB DDR4 |
| System | Windows 11 |
| Model | Qwen3.6-35B-A3B GGUF Q4_K_M |
The conclusion is simple: the same hardware that previously could only barely run heavily quantized models can now reach a much more practical state with a newer llama.cpp, Q4 quantization, and a 64K context.
The Key Is MoE Scheduling, Not a New GPU
The most important parameter is:
|
|
Qwen3.6-35B-A3B is an MoE model, short for Mixture of Experts. Its total parameter count looks large, but each inference step does not activate every expert. Only part of the expert pool is used for each token.
That leaves room for local inference. Not everything has to fit into the GPU. The --n-cpu-moe parameter in llama.cpp lets you adjust how MoE expert layers are split between CPU and GPU, so consumer GPUs with limited VRAM can still run larger models.
On an RTX 3060 12GB, --n-cpu-moe 32 is a good value to try first. The GPU handles CUDA work it is good at, while the CPU shares part of the expert-layer load. This mixed scheduling can be faster than pushing everything to the GPU or leaning too much on the CPU.
How Much Faster Is It?
A typical comparison looks like this:
| Item | Older Setup | Newer Setup |
|---|---|---|
| Generation speed | Around 15 tok/s | Around 33-36 tok/s |
| Quantization | Q2_K_M | Q4_K_M |
| Context | 4K | 64K |
| VRAM usage | Around 5GB | Around 7GB |
| Experience | Runs, but quality is unstable | Smoother, with clearly better answers |
The important part is not just the speedup. Quantization quality and context length improve at the same time.
Many 12GB VRAM users previously had to use Q2 quantization just to launch 30B+ models. The model could start, but inference quality often suffered. Moving to Q4 means the local model shifts from “interesting to try” toward “useful for real work.”
A Windows Launch Command Template
Here is a Windows batch template. Replace the paths with your own:
|
|
Key parameters:
-ngl 99: offload as many eligible layers as possible to the GPU;--n-cpu-moe 32: controls MoE expert-layer scheduling and is the main speedup knob here;--flash-attn on: enables Flash Attention to reduce long-context pressure;-c 65536: sets a 64K context;--cache-type-k q4_0/--cache-type-v q4_0: quantizes the KV cache to reduce long-context VRAM use;-np 1: single concurrency, suitable for a 32GB RAM machine;--cache-ram 0: disables prompt cache to keep memory usage under control.
One caveat: b9297 is just a test point. As of 2026-05-26, the llama.cpp releases page has already moved to newer versions, so in practice you can try a newer CUDA build first.
How to Tune Different GPUs
For this kind of MoE model, the idea is not “give up if VRAM is not enough.” It is to find the right CPU/GPU split.
| Hardware | Suggestion |
|---|---|
| RTX 3060 12GB / 3080 10GB | Start with --n-cpu-moe 32 |
| RTX 3070 8GB / 4060 8GB | Increase --n-cpu-moe, such as 128 or 256 |
| RTX 3050 6GB / GTX 1650 4GB | Try more CPU offload, but expect much lower speed |
| Apple Silicon Mac | Use the Metal backend; unified memory is friendlier to large models |
Do not treat these values as universal answers. The best --n-cpu-moe value depends on the model, quantization, GPU, CPU, and memory bandwidth. A safer approach is to test several points:
|
|
Compare tok/s, memory usage, first-token latency, and answer stability before settling on a final configuration.
Is 32GB RAM Enough?
The short answer: it can run, but there is not much headroom.
With this kind of setup, the llama-server working set can exceed 20GB. The OS also needs memory for the browser, editor, drivers, and background services. For single-user local use, 32GB is worth trying. For a long-running service or multiple concurrent calls, 64GB will feel much better.
Suggestions:
- Test with single concurrency first;
- Close unnecessary background programs;
- Do not keep too many browser tabs open;
- Confirm the CUDA backend loads correctly;
- Do not start by pushing context to 128K.
Why This Matters
Local LLMs have long been surrounded by VRAM anxiety. Many people assume 35B-class models require 24GB VRAM, preferably an RTX 4090.
This test points to another direction: improvements in model architecture and inference frameworks can keep older hardware useful. MoE, KV cache quantization, Flash Attention, CUDA kernel optimization, and CPU/GPU mixed offload can together matter more than a simple GPU upgrade.
Of course, this is not magic. Running a 35B MoE model on 8GB or 12GB GPUs still requires tradeoffs in speed, context length, quantization quality, and memory usage. But for personal knowledge bases, coding assistants, long-document QA, or offline testing, this kind of setup is already worth experimenting with.
My Take
If you have an RTX 3060 12GB, RTX 3080 10GB, or even an 8GB GPU, it is worth taking another look at newer llama.cpp builds.
The point is not to copy one parameter blindly. The point is to understand the pattern:
An MoE model does not necessarily need every expert loaded into the GPU. A reasonable CPU/GPU split can matter more than the raw amount of VRAM.
Old PCs are not limited to tiny models forever. As inference frameworks and quantization methods keep improving, many local models once considered “too heavy” can become usable again.