<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Inference Optimization on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/inference-optimization/</link>
        <description>Recent content in Inference Optimization on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Wed, 22 Apr 2026 21:47:34 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/inference-optimization/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>A 16GB GPU Can Still Run 35B Models: VRAM Compression Strategies for MoE Models in LM Studio</title>
        <link>https://knightli.com/en/2026/04/22/16gb-gpu-run-35b-moe-models-in-lm-studio/</link>
        <pubDate>Wed, 22 Apr 2026 21:47:34 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/22/16gb-gpu-run-35b-moe-models-in-lm-studio/</guid>
        <description>&lt;p&gt;Many people think of 16GB VRAM as the point where local LLM deployment more or less tops out at 12B to 14B models, and anything larger becomes too painful even with quantization. That view is understandable, but it is not the true ceiling of a 16GB GPU.&lt;/p&gt;
&lt;p&gt;If your model choice and parameter setup are good enough, a 16GB GPU does not have to stay limited to “small-parameter” models. One representative approach is to use &lt;code&gt;MoE&lt;/code&gt; models inside &lt;code&gt;LM Studio&lt;/code&gt; with a sensible unloading strategy, so that 35B-class models can still run at a genuinely usable speed.&lt;/p&gt;
&lt;h2 id=&#34;01-why-a-16gb-gpu-is-not-necessarily-limited-to-12b-to-14b&#34;&gt;01 Why a 16GB GPU is not necessarily limited to 12B to 14B
&lt;/h2&gt;&lt;p&gt;The core idea is straightforward: VRAM size matters, but model architecture matters just as much.&lt;/p&gt;
&lt;p&gt;If you try to cram a standard dense model into a 16GB GPU, you will hit the wall quickly. These models usually involve all parameters during inference, so VRAM pressure and bandwidth pressure rise immediately.&lt;/p&gt;
&lt;p&gt;But &lt;code&gt;MoE&lt;/code&gt; models are different. Their total parameter count can be large, while only part of the expert parameters are activated in a single inference step. Take a 35B-class model as an example: although the total parameter count is high, the actual number of parameters participating in each inference step is much smaller, so its real VRAM requirement is not as extreme as many people assume.&lt;/p&gt;
&lt;p&gt;That is exactly why a 16GB GPU still leaves some room to work with.&lt;/p&gt;
&lt;h2 id=&#34;02-key-practical-takeaway-35b-moe-models-can-run-surprisingly-fast&#34;&gt;02 Key practical takeaway: 35B MoE models can run surprisingly fast
&lt;/h2&gt;&lt;p&gt;One representative case is a quantized &lt;code&gt;MoE&lt;/code&gt; model such as &lt;code&gt;Qwen 3.5 35B A3B&lt;/code&gt;. With a 16GB GPU and the right settings in &lt;code&gt;LM Studio&lt;/code&gt;, &lt;code&gt;Q6&lt;/code&gt; quantization can reach something above 30 &lt;code&gt;tokens/s&lt;/code&gt;, and &lt;code&gt;Q4&lt;/code&gt; can sometimes test even higher.&lt;/p&gt;
&lt;p&gt;That result matters not just because the model “runs,” but because the speed is already in a clearly usable range.&lt;/p&gt;
&lt;p&gt;As a comparison, large models of a similar scale that are not &lt;code&gt;MoE&lt;/code&gt; often run into VRAM overflow and sharply lower speed on a 16GB GPU. In other words, the outcome is not determined by parameter count alone. What matters is how those parameters are actually used during inference.&lt;/p&gt;
&lt;h2 id=&#34;03-in-lm-studio-the-key-is-not-just-one-parameter&#34;&gt;03 In LM Studio, the key is not just one parameter
&lt;/h2&gt;&lt;p&gt;If you want this kind of model to run smoothly on a 16GB GPU, the real trick is not luck. It is tuning two parameters correctly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GPU Offload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;the setting that forces part of the expert layers into CPU memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first one is easy to understand. &lt;code&gt;GPU Offload&lt;/code&gt; is basically something you push as high as possible, so the model prioritizes GPU computation.&lt;/p&gt;
&lt;p&gt;The second one is the real key here. It is not the traditional “borrow system memory after VRAM overflows” approach. Instead, it proactively places part of the expert layers into CPU memory to reduce VRAM usage in advance. Since &lt;code&gt;MoE&lt;/code&gt; models do not activate every expert on every step anyway, moving some experts into memory does not hurt overall inference speed as much as many people would expect.&lt;/p&gt;
&lt;p&gt;A safer way to tune it is to start within a range and then adjust gradually for your machine:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;start with related values somewhere between &lt;code&gt;20&lt;/code&gt; and &lt;code&gt;35&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;then fine-tune based on VRAM usage and memory pressure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At its core, this method is using system memory to buy back VRAM headroom.&lt;/p&gt;
&lt;h2 id=&#34;04-it-can-still-run-at-128k-context-and-smaller-contexts-reduce-vram-further&#34;&gt;04 It can still run at 128K context, and smaller contexts reduce VRAM further
&lt;/h2&gt;&lt;p&gt;Another interesting point is that even with the context length pushed to &lt;code&gt;128K&lt;/code&gt;, a 35B-class &lt;code&gt;MoE&lt;/code&gt; model can still maintain a relatively high speed.&lt;/p&gt;
&lt;p&gt;That tells us something important: the bottleneck of a 16GB GPU is not as rigid as many people imagine. Especially inside a local inference tool like &lt;code&gt;LM Studio&lt;/code&gt;, the real question is often not simply “can it run or not,” but rather:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;are you willing to trade more system memory for less VRAM usage&lt;/li&gt;
&lt;li&gt;are you willing to shorten the context length&lt;/li&gt;
&lt;li&gt;are you willing to accept different capability tradeoffs across quantization levels&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the context is reduced further from &lt;code&gt;128K&lt;/code&gt; to &lt;code&gt;64K&lt;/code&gt; or &lt;code&gt;32K&lt;/code&gt;, VRAM pressure can drop even more. That means some 35B-class &lt;code&gt;MoE&lt;/code&gt; models may even run, barely, on GPUs with less VRAM, though speed and memory pressure will need to be rebalanced.&lt;/p&gt;
&lt;h2 id=&#34;05-the-cost-of-this-approach-much-higher-demands-on-ram-and-virtual-memory&#34;&gt;05 The cost of this approach: much higher demands on RAM and virtual memory
&lt;/h2&gt;&lt;p&gt;This kind of setup is not free performance.&lt;/p&gt;
&lt;p&gt;What you need to watch is that once VRAM pressure is compressed further, system RAM usage rises noticeably, and virtual memory pressure rises too. In other words, you are not removing the cost. You are shifting pressure from the GPU to RAM and disk swap space.&lt;/p&gt;
&lt;p&gt;So if you want to try it yourself, it is worth checking a few things first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether your system RAM is large enough&lt;/li&gt;
&lt;li&gt;whether your virtual memory allocation is large enough&lt;/li&gt;
&lt;li&gt;whether too many background applications are already consuming resources&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those conditions are not in place, what you may get is not “35B running fast,” but an overall machine that becomes slow everywhere.&lt;/p&gt;
&lt;h2 id=&#34;06-more-aggressive-quantization-is-not-always-better&#34;&gt;06 More aggressive quantization is not always better
&lt;/h2&gt;&lt;p&gt;There is another practical tradeoff here. Lower-bit quantization often saves more VRAM, but that does not automatically make it the best choice.&lt;/p&gt;
&lt;p&gt;The practical takeaway is that some models do run faster under &lt;code&gt;Q4&lt;/code&gt;, but their original capability can also degrade more. By comparison, &lt;code&gt;Q6&lt;/code&gt; tends to strike a better balance between speed and capability retention. So the right choice depends on what you care about more:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;maximum speed and fitting into VRAM&lt;/li&gt;
&lt;li&gt;or preserving more of the model’s original capability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those two priorities do not necessarily lead to the same quantization choice.&lt;/p&gt;
&lt;h2 id=&#34;07-what-kinds-of-models-are-worth-trying&#34;&gt;07 What kinds of models are worth trying
&lt;/h2&gt;&lt;p&gt;From this angle, the best thing to try is not “blindly chase bigger parameter counts,” but to first look for models that fit this strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;models built on &lt;code&gt;MoE&lt;/code&gt; architecture&lt;/li&gt;
&lt;li&gt;models that are well supported in &lt;code&gt;LM Studio&lt;/code&gt; and have complete quantized variants&lt;/li&gt;
&lt;li&gt;models with clear advantages in long context or instruction following&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And the idea does not stop at one 35B &lt;code&gt;MoE&lt;/code&gt; model. It also extends naturally to other directions, such as experimental models with stronger long-context memory, better instruction following, or lighter quantized versions with strong speed performance.&lt;/p&gt;
&lt;p&gt;The logic behind this is very consistent: first find models whose architecture fits the “trade memory for VRAM” strategy, and then talk about tuning. Do not start from parameter count alone and decide from there.&lt;/p&gt;
&lt;h2 id=&#34;08-short-conclusion&#34;&gt;08 Short conclusion
&lt;/h2&gt;&lt;p&gt;If you happen to have a 16GB GPU and assume local LLMs stop at 12B to 14B, that assumption is worth updating.&lt;/p&gt;
&lt;p&gt;A more accurate way to put it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a 16GB GPU is not automatically ruled out for larger models&lt;/li&gt;
&lt;li&gt;dense models and &lt;code&gt;MoE&lt;/code&gt; models need to be considered separately&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GPU Offload&lt;/code&gt; and expert-layer transfer to CPU memory inside &lt;code&gt;LM Studio&lt;/code&gt; can significantly change VRAM usage&lt;/li&gt;
&lt;li&gt;in practice, you are trading higher memory pressure for larger model scale and better usable speed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach will not fit every machine, but it does show one important thing: in local LLM deployment, VRAM is not the only limit. Model architecture and inference configuration matter just as much.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2</title>
        <link>https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/</link>
        <pubDate>Sun, 05 Apr 2026 22:09:11 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/</guid>
        <description>&lt;p&gt;The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.&lt;br&gt;
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.&lt;/p&gt;
&lt;h2 id=&#34;what-is-quantization&#34;&gt;What Is Quantization
&lt;/h2&gt;&lt;p&gt;Quantization means compressing model parameters from higher-precision formats (such as &lt;code&gt;FP16&lt;/code&gt;) into lower-bit formats (such as &lt;code&gt;Q8&lt;/code&gt; and &lt;code&gt;Q4&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;A simple analogy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Original model: like a high-quality photo, clear but large.&lt;/li&gt;
&lt;li&gt;Quantized model: like a compressed photo, slightly less detail but lighter and faster.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;common-quantization-formats&#34;&gt;Common Quantization Formats
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th&gt;Precision / Bit Width&lt;/th&gt;
          &lt;th&gt;Size&lt;/th&gt;
          &lt;th&gt;Quality Loss&lt;/th&gt;
          &lt;th&gt;Recommended Use&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;FP16&lt;/td&gt;
          &lt;td&gt;16-bit float&lt;/td&gt;
          &lt;td&gt;Largest&lt;/td&gt;
          &lt;td&gt;Almost none&lt;/td&gt;
          &lt;td&gt;Research, evaluation, max quality&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q8_0&lt;/td&gt;
          &lt;td&gt;8-bit integer&lt;/td&gt;
          &lt;td&gt;Larger&lt;/td&gt;
          &lt;td&gt;Almost none&lt;/td&gt;
          &lt;td&gt;High-end PCs, quality + performance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q5_K_M&lt;/td&gt;
          &lt;td&gt;5-bit mixed&lt;/td&gt;
          &lt;td&gt;Medium&lt;/td&gt;
          &lt;td&gt;Slight&lt;/td&gt;
          &lt;td&gt;Daily driver, balanced choice&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
          &lt;td&gt;4-bit mixed&lt;/td&gt;
          &lt;td&gt;Smaller&lt;/td&gt;
          &lt;td&gt;Acceptable&lt;/td&gt;
          &lt;td&gt;General default, strong value&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q3_K_M&lt;/td&gt;
          &lt;td&gt;3-bit mixed&lt;/td&gt;
          &lt;td&gt;Very small&lt;/td&gt;
          &lt;td&gt;Noticeable&lt;/td&gt;
          &lt;td&gt;Low-spec devices, run-first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q2_K&lt;/td&gt;
          &lt;td&gt;2-bit mixed&lt;/td&gt;
          &lt;td&gt;Smallest&lt;/td&gt;
          &lt;td&gt;Significant&lt;/td&gt;
          &lt;td&gt;Extreme resource limits, fallback&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;quantization-naming-rules&#34;&gt;Quantization Naming Rules
&lt;/h2&gt;&lt;p&gt;Take &lt;code&gt;gemma-4:4b-q4_k_m&lt;/code&gt; as an example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gemma-4:4b&lt;/code&gt;: model name and parameter scale.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;q4&lt;/code&gt;: 4-bit quantization.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;k&lt;/code&gt;: K-quants (an improved quantization method).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;m&lt;/code&gt;: medium level (common options also include &lt;code&gt;s&lt;/code&gt;/small and &lt;code&gt;l&lt;/code&gt;/large).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quick-selection-by-vram&#34;&gt;Quick Selection by VRAM
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;RAM / VRAM&lt;/th&gt;
          &lt;th&gt;Recommended Quantization&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;4 GB&lt;/td&gt;
          &lt;td&gt;Q3_K_M / Q2_K&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;8 GB&lt;/td&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;16 GB&lt;/td&gt;
          &lt;td&gt;Q5_K_M / Q8_0&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;32 GB+&lt;/td&gt;
          &lt;td&gt;FP16 / Q8_0&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.&lt;/p&gt;
&lt;h2 id=&#34;practical-tips&#34;&gt;Practical Tips
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Start with &lt;code&gt;Q4_K_M&lt;/code&gt; by default and test real tasks first.&lt;/li&gt;
&lt;li&gt;If response quality is not enough, move up to &lt;code&gt;Q5_K_M&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If VRAM or speed is the main bottleneck, move down to &lt;code&gt;Q3_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Use the same test set every time you switch quantization formats.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Quality first: &lt;code&gt;FP16&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Balance first: &lt;code&gt;Q5_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;General default: &lt;code&gt;Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Low-spec fallback: &lt;code&gt;Q3_K_M&lt;/code&gt; or &lt;code&gt;Q2_K&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key is not &amp;ldquo;bigger is always better&amp;rdquo;, but &amp;ldquo;the most stable and usable result under your hardware limits.&amp;rdquo;&lt;/p&gt;
&lt;!-- ollama-related-links:start --&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Gemma 4 Model Comparison and Selection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/uninstall-ollama-on-linux/&#34; &gt;Completely Uninstall Ollama on Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/ollama-model-storage-path-and-migration/&#34; &gt;Ollama Model Storage Path and Migration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/&#34; &gt;How to Check Whether Ollama Uses GPU&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- ollama-related-links:end --&gt;
</description>
        </item>
        
    </channel>
</rss>
