A 16GB GPU Can Still Run 35B Models: VRAM Compression Strategies for MoE Models in LM Studio

Wed, 22 Apr 2026 21:47:34 +0800

Many people think of 16GB VRAM as the point where local LLM deployment more or less tops out at 12B to 14B models, and anything larger becomes too painful even with quantization. That view is understandable, but it is not the true ceiling of a 16GB GPU.

If your model choice and parameter setup are good enough, a 16GB GPU does not have to stay limited to “small-parameter” models. One representative approach is to use MoE models inside LM Studio with a sensible unloading strategy, so that 35B-class models can still run at a genuinely usable speed.

01 Why a 16GB GPU is not necessarily limited to 12B to 14B

The core idea is straightforward: VRAM size matters, but model architecture matters just as much.

If you try to cram a standard dense model into a 16GB GPU, you will hit the wall quickly. These models usually involve all parameters during inference, so VRAM pressure and bandwidth pressure rise immediately.

But MoE models are different. Their total parameter count can be large, while only part of the expert parameters are activated in a single inference step. Take a 35B-class model as an example: although the total parameter count is high, the actual number of parameters participating in each inference step is much smaller, so its real VRAM requirement is not as extreme as many people assume.

That is exactly why a 16GB GPU still leaves some room to work with.

02 Key practical takeaway: 35B MoE models can run surprisingly fast

One representative case is a quantized MoE model such as Qwen 3.5 35B A3B. With a 16GB GPU and the right settings in LM Studio, Q6 quantization can reach something above 30 tokens/s, and Q4 can sometimes test even higher.

That result matters not just because the model “runs,” but because the speed is already in a clearly usable range.

As a comparison, large models of a similar scale that are not MoE often run into VRAM overflow and sharply lower speed on a 16GB GPU. In other words, the outcome is not determined by parameter count alone. What matters is how those parameters are actually used during inference.

03 In LM Studio, the key is not just one parameter

If you want this kind of model to run smoothly on a 16GB GPU, the real trick is not luck. It is tuning two parameters correctly:

GPU Offload
the setting that forces part of the expert layers into CPU memory

The first one is easy to understand. GPU Offload is basically something you push as high as possible, so the model prioritizes GPU computation.

The second one is the real key here. It is not the traditional “borrow system memory after VRAM overflows” approach. Instead, it proactively places part of the expert layers into CPU memory to reduce VRAM usage in advance. Since MoE models do not activate every expert on every step anyway, moving some experts into memory does not hurt overall inference speed as much as many people would expect.

A safer way to tune it is to start within a range and then adjust gradually for your machine:

start with related values somewhere between 20 and 35
then fine-tune based on VRAM usage and memory pressure

At its core, this method is using system memory to buy back VRAM headroom.

04 It can still run at 128K context, and smaller contexts reduce VRAM further

Another interesting point is that even with the context length pushed to 128K, a 35B-class MoE model can still maintain a relatively high speed.

That tells us something important: the bottleneck of a 16GB GPU is not as rigid as many people imagine. Especially inside a local inference tool like LM Studio, the real question is often not simply “can it run or not,” but rather:

are you willing to trade more system memory for less VRAM usage
are you willing to shorten the context length
are you willing to accept different capability tradeoffs across quantization levels

If the context is reduced further from 128K to 64K or 32K, VRAM pressure can drop even more. That means some 35B-class MoE models may even run, barely, on GPUs with less VRAM, though speed and memory pressure will need to be rebalanced.

05 The cost of this approach: much higher demands on RAM and virtual memory

This kind of setup is not free performance.

What you need to watch is that once VRAM pressure is compressed further, system RAM usage rises noticeably, and virtual memory pressure rises too. In other words, you are not removing the cost. You are shifting pressure from the GPU to RAM and disk swap space.

So if you want to try it yourself, it is worth checking a few things first:

whether your system RAM is large enough
whether your virtual memory allocation is large enough
whether too many background applications are already consuming resources

If those conditions are not in place, what you may get is not “35B running fast,” but an overall machine that becomes slow everywhere.

06 More aggressive quantization is not always better

There is another practical tradeoff here. Lower-bit quantization often saves more VRAM, but that does not automatically make it the best choice.

The practical takeaway is that some models do run faster under Q4, but their original capability can also degrade more. By comparison, Q6 tends to strike a better balance between speed and capability retention. So the right choice depends on what you care about more:

maximum speed and fitting into VRAM
or preserving more of the model’s original capability

Those two priorities do not necessarily lead to the same quantization choice.

07 What kinds of models are worth trying

From this angle, the best thing to try is not “blindly chase bigger parameter counts,” but to first look for models that fit this strategy:

models built on MoE architecture
models that are well supported in LM Studio and have complete quantized variants
models with clear advantages in long context or instruction following

And the idea does not stop at one 35B MoE model. It also extends naturally to other directions, such as experimental models with stronger long-context memory, better instruction following, or lighter quantized versions with strong speed performance.

The logic behind this is very consistent: first find models whose architecture fits the “trade memory for VRAM” strategy, and then talk about tuning. Do not start from parameter count alone and decide from there.

08 Short conclusion

If you happen to have a 16GB GPU and assume local LLMs stop at 12B to 14B, that assumption is worth updating.

A more accurate way to put it is:

a 16GB GPU is not automatically ruled out for larger models
dense models and MoE models need to be considered separately
GPU Offload and expert-layer transfer to CPU memory inside LM Studio can significantly change VRAM usage
in practice, you are trading higher memory pressure for larger model scale and better usable speed

This approach will not fit every machine, but it does show one important thing: in local LLM deployment, VRAM is not the only limit. Model architecture and inference configuration matter just as much.

Gemma 4 on Raspberry Pi 5: It Works, But Responses Are Slow

Wed, 08 Apr 2026 18:42:00 +0800

I ran a near-limit experiment: running Gemma 4 on a Raspberry Pi 5 (8GB RAM). I was not targeting larger variants, only the smallest E2B model.

Conclusion first: it runs and it is usable, but it fits low-interaction workflows better than real-time chat.

Test Environment

Device: Raspberry Pi 5 (4-core CPU, 8GB RAM)
OS: Ubuntu Server (no GUI)
Access method: SSH
Runtime: LM Studio CLI (command-line-only mode)
Model: Gemma 4 E2B (about 4.5GB)

Step 1: Install and Start LM Studio CLI

I installed the LM Studio CLI build on the Pi, then started the service and checked available commands.

For a terminal-only setup, this deployment mode is a good fit for Raspberry Pi.

Step 2: Move Model Storage to SSD

To avoid heavy SD card writes, I switched model download storage to an external SSD.

On Raspberry Pi 5, SSD usage is much more practical than on older models. For long-term local model runs, SSD is strongly recommended.

Step 3: Download and Load Gemma 4 E2B

After download, the model loaded into memory successfully.

According to official information, Gemma 4 includes:

Tool-calling support for agent-style workflows (function calling)
Multimodal capabilities (image/video; smaller models also include audio-related capability)
128K context window
Apache 2.0 license (commercial use allowed)

Given Raspberry Pi hardware limits, E2B is the most practical tier to start with.

Step 4: Start API and Enable LAN Access

After loading, I started the API on local port 4000 and confirmed model listing works via HTTP.

The issue: by default, it only listens on localhost, so other LAN devices cannot access it directly.

Since host binding was not exposed by the startup options, I used socat for port forwarding, bridging an external Pi port to LM Studio’s internal port.

Result: successful. I could query the model list from a MacBook on the same LAN.

Step 5: Connect to Editor (Zed)

LM Studio’s local server is OpenAI-API-compatible, so most tools that support custom base_url can connect.

I added a new LLM provider in Zed pointing to the Pi-hosted Gemma 4 instance, and in-editor chat worked.

Practical Usability

This setup is suitable for:

Local automation scripts
Low-concurrency, low-real-time assistant tasks
Personal learning and edge-device experimentation

Less suitable for:

High-frequency interactive chat
Development collaboration scenarios sensitive to response latency

Conclusion

Running Gemma 4 (E2B) on Raspberry Pi 5 is feasible, and the practical output quality is better than expected.

If your goal is offline operation, tool integration, and lightweight-to-mid tasks, this setup is worth trying. If your goal is smooth real-time interaction, stronger hardware is still the better choice.

LM Studio on KnightLi Blog