<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>LM Studio on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/lm-studio/</link>
        <description>Recent content in LM Studio on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Wed, 22 Apr 2026 21:47:34 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/lm-studio/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>A 16GB GPU Can Still Run 35B Models: VRAM Compression Strategies for MoE Models in LM Studio</title>
        <link>https://knightli.com/en/2026/04/22/16gb-gpu-run-35b-moe-models-in-lm-studio/</link>
        <pubDate>Wed, 22 Apr 2026 21:47:34 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/22/16gb-gpu-run-35b-moe-models-in-lm-studio/</guid>
        <description>&lt;p&gt;Many people think of 16GB VRAM as the point where local LLM deployment more or less tops out at 12B to 14B models, and anything larger becomes too painful even with quantization. That view is understandable, but it is not the true ceiling of a 16GB GPU.&lt;/p&gt;
&lt;p&gt;If your model choice and parameter setup are good enough, a 16GB GPU does not have to stay limited to “small-parameter” models. One representative approach is to use &lt;code&gt;MoE&lt;/code&gt; models inside &lt;code&gt;LM Studio&lt;/code&gt; with a sensible unloading strategy, so that 35B-class models can still run at a genuinely usable speed.&lt;/p&gt;
&lt;h2 id=&#34;01-why-a-16gb-gpu-is-not-necessarily-limited-to-12b-to-14b&#34;&gt;01 Why a 16GB GPU is not necessarily limited to 12B to 14B
&lt;/h2&gt;&lt;p&gt;The core idea is straightforward: VRAM size matters, but model architecture matters just as much.&lt;/p&gt;
&lt;p&gt;If you try to cram a standard dense model into a 16GB GPU, you will hit the wall quickly. These models usually involve all parameters during inference, so VRAM pressure and bandwidth pressure rise immediately.&lt;/p&gt;
&lt;p&gt;But &lt;code&gt;MoE&lt;/code&gt; models are different. Their total parameter count can be large, while only part of the expert parameters are activated in a single inference step. Take a 35B-class model as an example: although the total parameter count is high, the actual number of parameters participating in each inference step is much smaller, so its real VRAM requirement is not as extreme as many people assume.&lt;/p&gt;
&lt;p&gt;That is exactly why a 16GB GPU still leaves some room to work with.&lt;/p&gt;
&lt;h2 id=&#34;02-key-practical-takeaway-35b-moe-models-can-run-surprisingly-fast&#34;&gt;02 Key practical takeaway: 35B MoE models can run surprisingly fast
&lt;/h2&gt;&lt;p&gt;One representative case is a quantized &lt;code&gt;MoE&lt;/code&gt; model such as &lt;code&gt;Qwen 3.5 35B A3B&lt;/code&gt;. With a 16GB GPU and the right settings in &lt;code&gt;LM Studio&lt;/code&gt;, &lt;code&gt;Q6&lt;/code&gt; quantization can reach something above 30 &lt;code&gt;tokens/s&lt;/code&gt;, and &lt;code&gt;Q4&lt;/code&gt; can sometimes test even higher.&lt;/p&gt;
&lt;p&gt;That result matters not just because the model “runs,” but because the speed is already in a clearly usable range.&lt;/p&gt;
&lt;p&gt;As a comparison, large models of a similar scale that are not &lt;code&gt;MoE&lt;/code&gt; often run into VRAM overflow and sharply lower speed on a 16GB GPU. In other words, the outcome is not determined by parameter count alone. What matters is how those parameters are actually used during inference.&lt;/p&gt;
&lt;h2 id=&#34;03-in-lm-studio-the-key-is-not-just-one-parameter&#34;&gt;03 In LM Studio, the key is not just one parameter
&lt;/h2&gt;&lt;p&gt;If you want this kind of model to run smoothly on a 16GB GPU, the real trick is not luck. It is tuning two parameters correctly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GPU Offload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;the setting that forces part of the expert layers into CPU memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first one is easy to understand. &lt;code&gt;GPU Offload&lt;/code&gt; is basically something you push as high as possible, so the model prioritizes GPU computation.&lt;/p&gt;
&lt;p&gt;The second one is the real key here. It is not the traditional “borrow system memory after VRAM overflows” approach. Instead, it proactively places part of the expert layers into CPU memory to reduce VRAM usage in advance. Since &lt;code&gt;MoE&lt;/code&gt; models do not activate every expert on every step anyway, moving some experts into memory does not hurt overall inference speed as much as many people would expect.&lt;/p&gt;
&lt;p&gt;A safer way to tune it is to start within a range and then adjust gradually for your machine:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;start with related values somewhere between &lt;code&gt;20&lt;/code&gt; and &lt;code&gt;35&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;then fine-tune based on VRAM usage and memory pressure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At its core, this method is using system memory to buy back VRAM headroom.&lt;/p&gt;
&lt;h2 id=&#34;04-it-can-still-run-at-128k-context-and-smaller-contexts-reduce-vram-further&#34;&gt;04 It can still run at 128K context, and smaller contexts reduce VRAM further
&lt;/h2&gt;&lt;p&gt;Another interesting point is that even with the context length pushed to &lt;code&gt;128K&lt;/code&gt;, a 35B-class &lt;code&gt;MoE&lt;/code&gt; model can still maintain a relatively high speed.&lt;/p&gt;
&lt;p&gt;That tells us something important: the bottleneck of a 16GB GPU is not as rigid as many people imagine. Especially inside a local inference tool like &lt;code&gt;LM Studio&lt;/code&gt;, the real question is often not simply “can it run or not,” but rather:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;are you willing to trade more system memory for less VRAM usage&lt;/li&gt;
&lt;li&gt;are you willing to shorten the context length&lt;/li&gt;
&lt;li&gt;are you willing to accept different capability tradeoffs across quantization levels&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the context is reduced further from &lt;code&gt;128K&lt;/code&gt; to &lt;code&gt;64K&lt;/code&gt; or &lt;code&gt;32K&lt;/code&gt;, VRAM pressure can drop even more. That means some 35B-class &lt;code&gt;MoE&lt;/code&gt; models may even run, barely, on GPUs with less VRAM, though speed and memory pressure will need to be rebalanced.&lt;/p&gt;
&lt;h2 id=&#34;05-the-cost-of-this-approach-much-higher-demands-on-ram-and-virtual-memory&#34;&gt;05 The cost of this approach: much higher demands on RAM and virtual memory
&lt;/h2&gt;&lt;p&gt;This kind of setup is not free performance.&lt;/p&gt;
&lt;p&gt;What you need to watch is that once VRAM pressure is compressed further, system RAM usage rises noticeably, and virtual memory pressure rises too. In other words, you are not removing the cost. You are shifting pressure from the GPU to RAM and disk swap space.&lt;/p&gt;
&lt;p&gt;So if you want to try it yourself, it is worth checking a few things first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether your system RAM is large enough&lt;/li&gt;
&lt;li&gt;whether your virtual memory allocation is large enough&lt;/li&gt;
&lt;li&gt;whether too many background applications are already consuming resources&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those conditions are not in place, what you may get is not “35B running fast,” but an overall machine that becomes slow everywhere.&lt;/p&gt;
&lt;h2 id=&#34;06-more-aggressive-quantization-is-not-always-better&#34;&gt;06 More aggressive quantization is not always better
&lt;/h2&gt;&lt;p&gt;There is another practical tradeoff here. Lower-bit quantization often saves more VRAM, but that does not automatically make it the best choice.&lt;/p&gt;
&lt;p&gt;The practical takeaway is that some models do run faster under &lt;code&gt;Q4&lt;/code&gt;, but their original capability can also degrade more. By comparison, &lt;code&gt;Q6&lt;/code&gt; tends to strike a better balance between speed and capability retention. So the right choice depends on what you care about more:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;maximum speed and fitting into VRAM&lt;/li&gt;
&lt;li&gt;or preserving more of the model’s original capability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those two priorities do not necessarily lead to the same quantization choice.&lt;/p&gt;
&lt;h2 id=&#34;07-what-kinds-of-models-are-worth-trying&#34;&gt;07 What kinds of models are worth trying
&lt;/h2&gt;&lt;p&gt;From this angle, the best thing to try is not “blindly chase bigger parameter counts,” but to first look for models that fit this strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;models built on &lt;code&gt;MoE&lt;/code&gt; architecture&lt;/li&gt;
&lt;li&gt;models that are well supported in &lt;code&gt;LM Studio&lt;/code&gt; and have complete quantized variants&lt;/li&gt;
&lt;li&gt;models with clear advantages in long context or instruction following&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And the idea does not stop at one 35B &lt;code&gt;MoE&lt;/code&gt; model. It also extends naturally to other directions, such as experimental models with stronger long-context memory, better instruction following, or lighter quantized versions with strong speed performance.&lt;/p&gt;
&lt;p&gt;The logic behind this is very consistent: first find models whose architecture fits the “trade memory for VRAM” strategy, and then talk about tuning. Do not start from parameter count alone and decide from there.&lt;/p&gt;
&lt;h2 id=&#34;08-short-conclusion&#34;&gt;08 Short conclusion
&lt;/h2&gt;&lt;p&gt;If you happen to have a 16GB GPU and assume local LLMs stop at 12B to 14B, that assumption is worth updating.&lt;/p&gt;
&lt;p&gt;A more accurate way to put it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a 16GB GPU is not automatically ruled out for larger models&lt;/li&gt;
&lt;li&gt;dense models and &lt;code&gt;MoE&lt;/code&gt; models need to be considered separately&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GPU Offload&lt;/code&gt; and expert-layer transfer to CPU memory inside &lt;code&gt;LM Studio&lt;/code&gt; can significantly change VRAM usage&lt;/li&gt;
&lt;li&gt;in practice, you are trading higher memory pressure for larger model scale and better usable speed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach will not fit every machine, but it does show one important thing: in local LLM deployment, VRAM is not the only limit. Model architecture and inference configuration matter just as much.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Gemma 4 on Raspberry Pi 5: It Works, But Responses Are Slow</title>
        <link>https://knightli.com/en/2026/04/08/gemma4-on-raspberry-pi5-benchmark/</link>
        <pubDate>Wed, 08 Apr 2026 18:42:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/08/gemma4-on-raspberry-pi5-benchmark/</guid>
        <description>&lt;p&gt;I ran a near-limit experiment: running Gemma 4 on a &lt;code&gt;Raspberry Pi 5 (8GB RAM)&lt;/code&gt;. I was not targeting larger variants, only the smallest &lt;code&gt;E2B&lt;/code&gt; model.&lt;/p&gt;
&lt;p&gt;Conclusion first: it runs and it is usable, but it fits low-interaction workflows better than real-time chat.&lt;/p&gt;
&lt;h2 id=&#34;test-environment&#34;&gt;Test Environment
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Device: Raspberry Pi 5 (4-core CPU, 8GB RAM)&lt;/li&gt;
&lt;li&gt;OS: Ubuntu Server (no GUI)&lt;/li&gt;
&lt;li&gt;Access method: SSH&lt;/li&gt;
&lt;li&gt;Runtime: LM Studio CLI (command-line-only mode)&lt;/li&gt;
&lt;li&gt;Model: Gemma 4 E2B (about 4.5GB)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;step-1-install-and-start-lm-studio-cli&#34;&gt;Step 1: Install and Start LM Studio CLI
&lt;/h2&gt;&lt;p&gt;I installed the LM Studio CLI build on the Pi, then started the service and checked available commands.&lt;/p&gt;
&lt;p&gt;For a terminal-only setup, this deployment mode is a good fit for Raspberry Pi.&lt;/p&gt;
&lt;h2 id=&#34;step-2-move-model-storage-to-ssd&#34;&gt;Step 2: Move Model Storage to SSD
&lt;/h2&gt;&lt;p&gt;To avoid heavy SD card writes, I switched model download storage to an external SSD.&lt;/p&gt;
&lt;p&gt;On Raspberry Pi 5, SSD usage is much more practical than on older models. For long-term local model runs, SSD is strongly recommended.&lt;/p&gt;
&lt;h2 id=&#34;step-3-download-and-load-gemma-4-e2b&#34;&gt;Step 3: Download and Load Gemma 4 E2B
&lt;/h2&gt;&lt;p&gt;After download, the model loaded into memory successfully.&lt;/p&gt;
&lt;p&gt;According to official information, Gemma 4 includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tool-calling support for agent-style workflows (function calling)&lt;/li&gt;
&lt;li&gt;Multimodal capabilities (image/video; smaller models also include audio-related capability)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;128K&lt;/code&gt; context window&lt;/li&gt;
&lt;li&gt;Apache 2.0 license (commercial use allowed)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Given Raspberry Pi hardware limits, E2B is the most practical tier to start with.&lt;/p&gt;
&lt;h2 id=&#34;step-4-start-api-and-enable-lan-access&#34;&gt;Step 4: Start API and Enable LAN Access
&lt;/h2&gt;&lt;p&gt;After loading, I started the API on local port &lt;code&gt;4000&lt;/code&gt; and confirmed model listing works via HTTP.&lt;/p&gt;
&lt;p&gt;The issue: by default, it only listens on localhost, so other LAN devices cannot access it directly.&lt;/p&gt;
&lt;p&gt;Since host binding was not exposed by the startup options, I used &lt;code&gt;socat&lt;/code&gt; for port forwarding, bridging an external Pi port to LM Studio&amp;rsquo;s internal port.&lt;/p&gt;
&lt;p&gt;Result: successful. I could query the model list from a MacBook on the same LAN.&lt;/p&gt;
&lt;h2 id=&#34;step-5-connect-to-editor-zed&#34;&gt;Step 5: Connect to Editor (Zed)
&lt;/h2&gt;&lt;p&gt;LM Studio&amp;rsquo;s local server is OpenAI-API-compatible, so most tools that support custom &lt;code&gt;base_url&lt;/code&gt; can connect.&lt;/p&gt;
&lt;p&gt;I added a new LLM provider in Zed pointing to the Pi-hosted Gemma 4 instance, and in-editor chat worked.&lt;/p&gt;
&lt;h2 id=&#34;practical-usability&#34;&gt;Practical Usability
&lt;/h2&gt;&lt;p&gt;This setup is suitable for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Local automation scripts&lt;/li&gt;
&lt;li&gt;Low-concurrency, low-real-time assistant tasks&lt;/li&gt;
&lt;li&gt;Personal learning and edge-device experimentation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Less suitable for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;High-frequency interactive chat&lt;/li&gt;
&lt;li&gt;Development collaboration scenarios sensitive to response latency&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;Running Gemma 4 (E2B) on &lt;code&gt;Raspberry Pi 5&lt;/code&gt; is feasible, and the practical output quality is better than expected.&lt;/p&gt;
&lt;p&gt;If your goal is offline operation, tool integration, and lightweight-to-mid tasks, this setup is worth trying. If your goal is smooth real-time interaction, stronger hardware is still the better choice.&lt;/p&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/android-gemma4-install-run-guide/&#34; &gt;How to Install and Run Gemma 4 on Android: Complete Getting-Started Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/run-gemma4-on-laptop/&#34; &gt;How to Run Gemma 4 on a Laptop: 5-Minute Local Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/08/openclaw-connect-gemma4-local/&#34; &gt;Connect OpenClaw to Local Gemma 4: Complete Setup Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        
    </channel>
</rss>
