Running Qwen3.6-35B Locally on an RTX 3070 8GB: llama.cpp Deployment Notes and Tuning Parameters

A practical summary of the key ideas, hardware requirements, llama.cpp parameters, and common pitfalls for running the Qwen3.6-35B-A3B multimodal GGUF model locally on an RTX 3070 8GB GPU.

Whether an 8GB GPU can run a 35B-class model depends on more than the total parameter count. Model architecture, quantization format, and the way the inference framework schedules work all matter.

The core idea in this setup is to use a GGUF quantized version of an MoE model such as Qwen3.6-35B-A3B, then use llama.cpp with CUDA acceleration, CPU Offload, MoE parameter scheduling, and KV Cache quantization to split memory pressure between the GPU and system RAM. With that approach, an older GPU such as the RTX 3070 8GB can still have a chance to run a 35B-class local multimodal model.

One point needs to be clear first: this is not “fitting a full 35B model entirely into 8GB of VRAM.” A more accurate way to understand it is that the GPU handles the compute that benefits most from GPU acceleration, while some expert layers and cache pressure are carried by system memory. The real experience depends on RAM capacity, CPU performance, quantization format, context length, and parameter choices.

Test environment

This kind of setup is sensitive to system memory. A reference configuration is:

  • CPU: Intel Core i7-12700 class
  • GPU: NVIDIA RTX 3070 8GB
  • RAM: 64GB
  • OS: Windows 11
  • Inference framework: llama.cpp CUDA build
  • Model format: GGUF

If you only have 16GB or 32GB of RAM, it is not necessarily impossible to try, but a 35B MoE model is more likely to create memory pressure during loading and long-context inference. For stable use, 64GB of RAM is a safer target.

Why 8GB VRAM can still run a 35B model

The key to Qwen3.6-35B-A3B is its MoE architecture. Its total parameter scale is 35B, but not all parameters are activated during each inference step; only part of the expert parameters are active.

That leads to two consequences:

  • The full model file is still large and requires enough disk space and system memory.
  • The active compute per inference step is lower than a full 35B Dense model.

llama.cpp’s CPU Offload and MoE-related parameters can further reduce the VRAM threshold. The GPU mainly handles attention and some high-value compute, while the CPU and system memory carry part of the expert-layer weights. The tradeoff is that speed, response latency, and stability depend more on the whole machine, not only the GPU model.

Preparing llama.cpp

Windows users can download a prebuilt CUDA version of llama.cpp directly. Pay attention to three points:

  1. The GPU driver should be new enough, and the CUDA runtime should match the llama.cpp package you download.
  2. After downloading, place it in a path without Chinese characters or special characters so batch scripts are easier to run.
  3. Put model files under a unified models directory to avoid very long paths in commands.

If you use AMD, Intel graphics, or a CPU-only environment, you can also choose Vulkan, HIP, SYCL, or CPU builds, but the parameters and performance will be different. This article focuses on the CUDA route for NVIDIA GPUs.

Download the model and multimodal projection file

The model used here is:

  • Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

The Q4_K_M quantization format is chosen mainly to balance accuracy, file size, and speed. On low-VRAM machines, it is not a good idea to start with a higher-precision version, because loading failures or frequent system paging become much more likely.

If you want image understanding, you also need the multimodal projection file, for example:

  • mmproj-BF16.gguf

This file is important. Downloading only the main model usually gives you text inference only. Without mmproj, the web UI may not expose a usable image upload feature, or uploaded images may not be processed correctly.

Keep the directory structure simple:

1
2
3
4
5
llama.cpp/
├─ llama-server.exe
└─ models/
   ├─ Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
   └─ mmproj-BF16.gguf

RTX 3070 8GB startup parameters

Below is an example startup script for an RTX 3070 8GB. Change the path to your own llama.cpp directory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
@echo off
chcp 65001 >nul
cd /d D:\AI\llama.cpp

llama-server.exe ^
  -m "models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
  --mmproj "models\mmproj-BF16.gguf" ^
  -ngl 99 ^
  --n-cpu-moe 999 ^
  --flash-attn on ^
  --jinja ^
  -c 32768 ^
  -t 12 ^
  -b 512 ^
  -ub 128 ^
  --cache-type-k q4_0 ^
  --cache-type-v q4_0 ^
  --mlock ^
  --host 127.0.0.1 ^
  --port 8080

pause

After startup, open this address in your browser:

1
http://127.0.0.1:8080

If the page opens and the model replies normally, the service has started successfully. The first model load can be slow. Avoid launching multiple instances repeatedly during loading, because that can fill system memory more easily.

Understanding the key parameters

-ngl 99 tries to place as many layers as possible on the GPU. How many layers actually fit depends on the model structure, quantization format, and VRAM usage.

--n-cpu-moe 999 pushes more MoE expert layers to the CPU side, reducing VRAM pressure. It is one of the key parameters for running large MoE models on low-VRAM hardware.

--flash-attn on enables Flash Attention, which can reduce the cost of attention computation. Whether it is available depends on the current llama.cpp version and GPU support.

-c 32768 sets the context length. Long context significantly increases KV Cache pressure. If startup fails or inference is very slow, try lowering it to 8192 or 16384.

--cache-type-k q4_0 and --cache-type-v q4_0 quantize the KV Cache, saving memory and VRAM, though they may have a small impact on output quality and speed.

-b 512 and -ub 128 control batching-related parameters. In a low-VRAM environment, do not start with overly aggressive batch settings.

Common issues

If startup reports insufficient VRAM, first reduce the context length, for example changing -c 32768 to -c 8192, then try lowering -b and -ub.

If the image upload button is unavailable, first check whether the --mmproj path is correct and whether the mmproj file matches the model.

If the model responds slowly after loading, it usually does not mean the GPU is idle. Large amounts of weights or expert layers may be handled by the CPU and system memory. Use Task Manager to observe GPU, CPU, memory, and disk usage to identify the bottleneck.

If the output format looks wrong, confirm that --jinja is enabled and check whether the model requires the corresponding chat template.

If the browser cannot open the service after startup, check the --host and --port settings, and make sure port 8080 is not occupied by another program.

Who should try this

This setup is suitable for users who already have 8GB VRAM devices such as RTX 3070, RTX 4060 Laptop, or RTX 3060 8GB, but want to experiment with larger MoE models.

It is not suitable for people who need maximum speed. Running a 35B MoE model on low VRAM essentially trades CPU and system memory for a lower VRAM requirement. Being able to run it is one thing; whether it feels smooth enough is another.

If your goal is high-frequency daily chatting, 7B, 8B, or 14B models may feel better. If your goal is to explore larger MoE models, multimodal capability, and the boundary of local deployment, an RTX 3070 8GB with 64GB of RAM is still worth trying.

Summary

The reason an RTX 3070 8GB can run Qwen3.6-35B-A3B is not that the GPU suddenly has more VRAM. It is the combination of MoE architecture, GGUF quantization, llama.cpp CPU Offload, and KV Cache optimization that lowers the threshold.

The most interesting part of this setup is that it lets older GPUs still participate in local large-model experiments. As long as you accept tradeoffs in speed and stability, an 8GB VRAM machine can still be a local AI model testing platform, not only an entry-level device for small models.

References:

记录并分享
Built with Hugo
Theme Stack designed by Jimmy