GTX 1060 Running Qwen 35B: Optimizing llama.cpp from 3 tok/s to 17 tok/s

A practical llama.cpp optimization guide for running Qwen 35B-class models on low-VRAM GPUs: why the default speed is slow, how to think about MoE offloading, memory bottlenecks, context length, stability parameters, and how to make a 6GB GTX 1060 more usable for local inference.

Can a 6GB GTX 1060 run a 35B-class large language model?

Under the usual assumptions, the first answer is probably “not really.” A 35B model is huge, 6GB of VRAM is tiny, and even a quantized model can easily run into slow generation, memory pressure, limited context length, or instability after a short run.

But if the model uses an MoE architecture, and you combine that with llama.cpp layer offloading, CPU memory, and parameter tuning, the answer becomes more interesting. It will not feel like a high-end GPU setup, but it can move from “barely runs” to “usable for local experiments.”

This guide is organized around practical tuning. The goal is not to mythologize the GTX 1060, but to explain where to look, what to tune, and how to identify bottlenecks when running Qwen 35B-like models on low-VRAM hardware.

Start with the conclusion

For a low-VRAM GPU running a 35B model, the key is not to force everything into VRAM. The goal is to let the GPU handle the parts that benefit most from acceleration.

The rough workflow is:

1
2
3
4
5
6
7
Get it running first
-> Understand why the default speed is slow
-> Tune GPU offloading
-> Use MoE characteristics to reduce unnecessary load
-> Fix memory and cache bottlenecks
-> Increase context length only after that
-> Finally handle stability

If you focus only on “how do I fit this into VRAM,” you can easily tune in the wrong direction. The more practical goal is to make VRAM, system memory, CPU, disk, and context cache work together.

Prepare the environment

This kind of setup works best if you have:

  • An NVIDIA GPU with around 6GB VRAM, such as a GTX 1060 6GB;
  • Enough system RAM, because low RAM makes swap and OOM problems much more likely;
  • A CUDA-enabled build of llama.cpp;
  • A quantized model file suitable for low-VRAM testing;
  • Realistic expectations about speed;
  • The ability to monitor VRAM, RAM, and process usage.

Start by checking the environment:

1
2
3
nvidia-smi
free -h
./llama-cli --help

If nvidia-smi cannot see the GPU, or your llama.cpp build has no CUDA support, parameter tuning will not deliver the result you want.

Step 1: Make the model run first

Do not chase 17 tok/s at the beginning. The first goal is simple: can the model load and produce output?

A basic command usually looks like this:

1
2
3
4
./llama-cli \
  -m /path/to/model.gguf \
  -p "Explain what an MoE model is in three sentences" \
  -n 128

If this fails, do not add GPU options yet. Check:

  • Whether the model path is correct;
  • Whether the quantization format is supported by your llama.cpp version;
  • Whether system RAM is sufficient;
  • Whether you downloaded the right model variant;
  • Whether the binary supports the model architecture.

Once the model can run, start optimizing speed.

Why the default speed may be only 3 tok/s

On low-VRAM hardware, slow defaults usually come from multiple bottlenecks at once.

Common cases include:

Bottleneck Symptom Direction
Too little GPU offload GPU is idle, CPU is busy Increase GPU offload within limits
Too much offload VRAM overflows or errors appear Reduce offloaded layers
Memory bandwidth limit CPU is busy but token speed is low Reduce overhead, try a better quantization
Context too large Slow start or RAM spikes Test with a smaller context first
Swap is active The whole system feels stuck Add RAM or lower parameters
Poor batch settings Prompt processing is slow Tune batch-related parameters

Do not look only at the tok/s number. Keep these running while testing:

1
2
watch -n 1 nvidia-smi
htop

Watch VRAM usage, GPU utilization, CPU load, and system memory together.

Step 2: Tune GPU offloading

The most common acceleration path in llama.cpp is offloading part of the model to the GPU.

A typical parameter is:

1
-ngl 20

Or in a fuller command:

1
2
3
4
5
./llama-cli \
  -m /path/to/model.gguf \
  -p "Write a local LLM tuning checklist" \
  -n 256 \
  -ngl 20

The value 20 is not universal. With a low-VRAM GPU, test gradually:

1
2
3
4
-ngl 10
-ngl 15
-ngl 20
-ngl 25

After each change, check three things:

  1. Whether the process starts normally;
  2. Whether VRAM is close to full;
  3. Whether tok/s actually improves.

If VRAM is already full, increasing -ngl will usually make things less stable, not faster.

Step 3: Understand why MoE matters

MoE models are different from dense models.

The key idea is that the total parameter count is large, but not every expert is activated for every token. In other words, a 35B label does not always mean every token requires the full 35B worth of computation.

That is why low-VRAM GPUs can sometimes experiment with these models.

But two warnings matter:

  • MoE is not free magic; the model file is still large;
  • When VRAM is insufficient, CPU memory still carries a lot of data.

So when tuning MoE models, the focus is to put the high-value acceleration work on the GPU and spend VRAM only where it helps most.

Step 4: Fix memory bottlenecks

Many people assume low-VRAM failure is only a VRAM problem. In practice, VRAM, RAM, and cache pressure often fail together.

If system RAM is close to full, or swap starts being used heavily, speed will drop sharply. Check with:

1
free -h

Or:

1
vmstat 1

Look for frequent swap activity.

Useful directions include:

  • Use a smaller quantized model;
  • Reduce context length;
  • Lower batch size;
  • Stop unrelated background tasks;
  • Put the model on a fast SSD;
  • Leave enough free system memory.

If system RAM is too small, a 6GB GPU alone cannot make the experience smooth.

Step 5: Do not max out context length first

Many models advertise long-context support, but low-VRAM machines should not start there.

Begin with a smaller context:

1
-c 4096

If that is stable, try:

1
-c 8192

When increasing context length, watch memory and speed.

Longer context means higher KV cache pressure. On low-VRAM devices, long context usually exposes problems more quickly than short-answer generation.

If your goal is local Q&A, code snippet explanation, or short summarization, you do not need a very large context at the beginning.

Step 6: Watch batch parameters

Batch-related parameters in llama.cpp affect prompt processing and generation behavior. Names can vary between versions, so check help first:

1
./llama-cli --help

General ideas:

  • For long prompts, batch tuning may improve prompt processing;
  • When VRAM is tight, too large a batch can reduce stability;
  • Do not copy someone else’s values blindly. Test on your own machine.

Change only one parameter at a time.

For example, first fix the model, context, and -ngl, then try batch settings. Otherwise it is hard to know which change helped.

Step 7: Record your five key parameters

The worst part of local low-VRAM tuning is getting something to run today and forgetting the working settings tomorrow.

Record these items for each test:

Parameter What to record
Model file Model name, quantization, file size
GPU offload -ngl or related offload settings
Context length -c value
Batch batch / ubatch and related settings
Result tok/s, VRAM, RAM, stability

A simple note can look like this:

1
2
3
4
5
6
7
model: Qwen-xx-35B-xxx.gguf
gpu: GTX 1060 6GB
ngl: 20
ctx: 4096
batch: default
speed: about 17 tok/s
status: stable for short text, long context needs more testing

Next time you change models or machines, you will have a baseline.

A steadier test flow

Use this order:

  1. Run without GPU offload to confirm the model loads;
  2. Add a low -ngl value and confirm output;
  3. Increase -ngl gradually to find the VRAM limit;
  4. Fix -ngl, then tune context length;
  5. Fix context, then test batch;
  6. Use the same prompt to compare tok/s;
  7. Run for 10 to 20 minutes and watch stability;
  8. Record the final parameters.

Do not use a different prompt every time. Different prompts make speed numbers hard to compare.

Prepare a fixed test prompt:

1
Explain in 800 words why MoE models can be suitable for low-VRAM inference, and give three cautions.

Use the same prompt in every round so you can tell whether the optimization really helped.

Common failed attempts

These mistakes waste a lot of time when tuning large models on low-VRAM hardware.

1. Raising GPU offload blindly

You see -ngl improve speed, so you keep increasing it.

The problem is that a GTX 1060 has only 6GB VRAM. Once you cross the limit, the program may error out, or it may start but become unstable.

2. Starting with very long context

Long context puts heavy pressure on memory and KV cache. Make the model stable with short context first, then expand.

3. Looking only at average tok/s

tok/s matters, but it is not the only metric.

Also watch:

  • First-token latency;
  • Prompt processing speed;
  • Whether VRAM overflows;
  • Whether long runs remain stable;
  • Whether the system becomes unusably slow.

4. Not recording parameters

Local inference tuning takes repeated experiments. Without notes, it is easy to forget the settings that worked.

What to expect from a GTX 1060

What is an old GPU like the GTX 1060 good for?

Good for:

  • Learning llama.cpp;
  • Testing GGUF models;
  • Running short local Q&A;
  • Experimenting with local model parameters;
  • Trying low-resource MoE inference;
  • Deciding whether a model is worth deploying on better hardware.

Not ideal for:

  • High-concurrency services;
  • Heavy long-context use;
  • Multiple users at the same time;
  • Production-scale RAG;
  • Latency-sensitive real-time applications.

Treat the GTX 1060 as an experiment machine and it is valuable. Treat it as a production LLM server and you will be disappointed.

One-sentence summary

Running Qwen 35B-like models on 6GB VRAM is not about stuffing everything into the GPU. It is about coordinating llama.cpp GPU offload, MoE behavior, system memory, context length, and batch parameters.

If you have an old GTX 1060, try this order:

1
Run first -> Tune -ngl -> Watch VRAM -> Control context -> Check RAM -> Test batch -> Record tok/s

Going from 3 tok/s to 17 tok/s is not magic. It comes from breaking the bottlenecks apart one by one.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy