<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Quantization on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/quantization/</link>
        <description>Recent content in Quantization on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Tue, 19 May 2026 10:56:50 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/quantization/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>What Is AI-Trader? A Platform Where AI Agents Publish Trading Signals and Run Paper Trading</title>
        <link>https://knightli.com/en/2026/05/19/ai-trader-agent-native-trading-platform/</link>
        <pubDate>Tue, 19 May 2026 10:56:50 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/19/ai-trader-agent-native-trading-platform/</guid>
        <description>&lt;p&gt;&lt;code&gt;HKUDS/AI-Trader&lt;/code&gt; is a trading platform project for AI Agents. The README positions it as an &amp;ldquo;Agent-Native Trading Platform&amp;rdquo;, aiming to let AI Agents connect to the platform, publish trading signals, join discussions, copy trades, and use market data.&lt;/p&gt;
&lt;p&gt;Project URL: &lt;a class=&#34;link&#34; href=&#34;https://github.com/HKUDS/AI-Trader&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/HKUDS/AI-Trader&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Platform URL: &lt;a class=&#34;link&#34; href=&#34;https://ai4trade.ai&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://ai4trade.ai&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;At the time of writing, the GitHub API showed about 18k stars and Python as the main language. The repository API did not return a clear license value, so users should confirm licensing terms before formal use.&lt;/p&gt;
&lt;p&gt;This article is only an introduction to the open source project and is not investment advice. Automated trading involves real capital risk. No strategy, signal, or agent output can guarantee returns.&lt;/p&gt;
&lt;h2 id=&#34;positioning&#34;&gt;Positioning
&lt;/h2&gt;&lt;p&gt;The core idea of AI-Trader is simple: humans have trading platforms, and AI Agents may also need their own trading platform.&lt;/p&gt;
&lt;p&gt;According to the README, any AI Agent can read the platform Skill file and register quickly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Read https://ai4trade.ai/skill/ai4trade and register on the platform. Compatibility alias: https://ai4trade.ai/SKILL.md
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After connection, agents can publish trading signals, join community discussions, copy strategies from high-performing traders, sync signals to multiple brokers, and accumulate points through prediction performance.&lt;/p&gt;
&lt;h2 id=&#34;main-features&#34;&gt;Main Features
&lt;/h2&gt;&lt;p&gt;The README lists capabilities including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Instant Agent Integration: quick access for AI Agents.&lt;/li&gt;
&lt;li&gt;Collective Intelligence Trading: multiple agents discuss and collaborate on trading ideas.&lt;/li&gt;
&lt;li&gt;Cross-Platform Signal Sync: sync trading signals across platforms.&lt;/li&gt;
&lt;li&gt;One-Click Copy Trading: follow selected traders or agents.&lt;/li&gt;
&lt;li&gt;Universal Market Access: stocks, crypto, FX, options, futures, and more.&lt;/li&gt;
&lt;li&gt;Three Signal Types: strategy, action, and discussion signals.&lt;/li&gt;
&lt;li&gt;Reward System: earn points through signals and attention.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;From a product perspective, it is not just a local quantitative backtesting framework. It combines agents, signals, discussion, copy trading, and paper trading in one platform layer.&lt;/p&gt;
&lt;h2 id=&#34;two-types-of-users&#34;&gt;Two Types of Users
&lt;/h2&gt;&lt;p&gt;The README divides users into two groups.&lt;/p&gt;
&lt;p&gt;The first group is Agent Traders. AI Agents read the Skill document, connect to the platform, install required components, and publish signals.&lt;/p&gt;
&lt;p&gt;The second group is Human Traders. Regular users can visit the platform, create accounts, browse signals, or follow better-performing traders.&lt;/p&gt;
&lt;p&gt;Together, this forms a structure where AI Agents produce signals, and humans or other agents consume those signals.&lt;/p&gt;
&lt;h2 id=&#34;architecture&#34;&gt;Architecture
&lt;/h2&gt;&lt;p&gt;The README shows the project structure as:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;AI-Trader (GitHub - Open Source)
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;念岸岸 skills/              # Agent skill definitions
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;念岸岸 docs/api/            # OpenAPI specifications
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;念岸岸 service/             # Backend &amp;amp; frontend
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;岫   念岸岸 server/         # FastAPI backend
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;岫   弩岸岸 frontend/        # React frontend
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;弩岸岸 assets/              # Logo and images
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The repository puts agent skills, API documentation, backend, and frontend in one place. The backend uses FastAPI and the frontend uses React. The README update notes also mention that the web service and backend workers have been separated so pricing, historical performance, settlement, and market intelligence jobs can run in the background without affecting pages and health checks.&lt;/p&gt;
&lt;h2 id=&#34;why-it-is-worth-watching&#34;&gt;Why It Is Worth Watching
&lt;/h2&gt;&lt;p&gt;AI-Trader is worth watching not because &amp;ldquo;AI can automatically make money&amp;rdquo;, but because it makes the interface between agents and financial scenarios more explicit.&lt;/p&gt;
&lt;p&gt;There are several interesting points:&lt;/p&gt;
&lt;p&gt;First, it uses a Skill document as the agent access point. This is close to how Codex, Claude Code, OpenClaw, and other agent tools work.&lt;/p&gt;
&lt;p&gt;Second, it places trading signals, discussion, copy trading, and a reward system at the platform layer instead of only providing a local script.&lt;/p&gt;
&lt;p&gt;Third, it provides OpenAPI documentation, making the platform interfaces easier for developers to understand.&lt;/p&gt;
&lt;p&gt;Fourth, it supports paper trading. For research on agent decision-making, a simulated environment is much safer than giving agents direct access to real money.&lt;/p&gt;
&lt;h2 id=&#34;risks-and-boundaries&#34;&gt;Risks and Boundaries
&lt;/h2&gt;&lt;p&gt;Automated trading is a high-risk scenario.&lt;/p&gt;
&lt;p&gt;First, signals generated by agents are not investment advice. Models can hallucinate, overfit, misread news, or fail to understand extreme market conditions.&lt;/p&gt;
&lt;p&gt;Second, copy trading has contagion risk. If a wrong signal is widely followed, losses may concentrate.&lt;/p&gt;
&lt;p&gt;Third, real capital access must be strictly isolated. Do not give agents unlimited order permissions.&lt;/p&gt;
&lt;p&gt;Fourth, licensing and compliance need to be confirmed before commercial or production use, especially when brokers, financial data, and user accounts are involved.&lt;/p&gt;
&lt;h2 id=&#34;who-it-is-for&#34;&gt;Who It Is For
&lt;/h2&gt;&lt;p&gt;AI-Trader is suitable for researchers studying agent decision-making, developers exploring financial agent interfaces, and teams interested in paper trading or signal collaboration. It is not suitable for users looking for guaranteed profit tools.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;AI-Trader is a signal and paper-trading platform designed around AI Agents. The useful way to read it is not &amp;ldquo;AI helps you earn money&amp;rdquo;, but how agents should connect to financial workflows, publish signals, and operate inside controlled risk boundaries.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Running Qwen3.6 Locally: VRAM Requirements for 27B and 35B-A3B Quantized Models</title>
        <link>https://knightli.com/en/2026/05/01/qwen3-6-local-vram-quantization-table/</link>
        <pubDate>Fri, 01 May 2026 12:02:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/qwen3-6-local-vram-quantization-table/</guid>
        <description>&lt;p&gt;The Qwen3.6 open-weight models that are most relevant for local deployment are mainly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt;: a 27B dense model.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt;: a 35B total / 3B active MoE model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are also online product or API model names such as &lt;code&gt;Qwen3.6-Plus&lt;/code&gt; and &lt;code&gt;Qwen3.6-Max&lt;/code&gt;.
If a model does not have public full weights and stable quantized files, it is not suitable for a local VRAM table.
This article only covers versions that can be deployed around Hugging Face weights and GGUF quantized files.&lt;/p&gt;
&lt;p&gt;As with the Gemma 4 table in &lt;code&gt;/05/10&lt;/code&gt;, two concepts need to be separated first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GGUF file size&lt;/strong&gt;: how large the model weight file is.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Actual VRAM usage&lt;/strong&gt;: affected by weights, KV cache, context length, runtime backend, multimodal modules, and batch size.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Qwen3.6 has a very long default context. The official model card states native support for &lt;code&gt;262,144&lt;/code&gt; tokens and extension to &lt;code&gt;1,010,000&lt;/code&gt; tokens.
So the “minimum VRAM” column below only applies to short or medium context.
If you really want 128K, 256K, or longer context, reserve much more room for KV cache.&lt;/p&gt;
&lt;h2 id=&#34;quick-summary&#34;&gt;Quick Summary
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;VRAM&lt;/th&gt;
          &lt;th&gt;Good Fit&lt;/th&gt;
          &lt;th&gt;Avoid&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;8GB&lt;/td&gt;
          &lt;td&gt;Extreme 2-bit tests for 27B / 35B-A3B, with clear quality risk&lt;/td&gt;
          &lt;td&gt;Q4 and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;12GB&lt;/td&gt;
          &lt;td&gt;27B Q2/Q3, 35B-A3B Q2/Q3 with short context&lt;/td&gt;
          &lt;td&gt;27B Q4 with long context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;16GB&lt;/td&gt;
          &lt;td&gt;27B Q3/Q4, 35B-A3B Q3/IQ4_XS&lt;/td&gt;
          &lt;td&gt;35B-A3B Q4 with long context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;24GB&lt;/td&gt;
          &lt;td&gt;27B Q4/Q5/Q6, 35B-A3B Q4&lt;/td&gt;
          &lt;td&gt;35B-A3B Q8, BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;32GB&lt;/td&gt;
          &lt;td&gt;27B Q8, 35B-A3B Q5/Q6&lt;/td&gt;
          &lt;td&gt;BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;48GB&lt;/td&gt;
          &lt;td&gt;35B-A3B Q8, 27B with longer context more comfortably&lt;/td&gt;
          &lt;td&gt;35B-A3B BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;80GB+&lt;/td&gt;
          &lt;td&gt;27B / 35B-A3B BF16&lt;/td&gt;
          &lt;td&gt;No need to chase BF16 for ordinary local chat&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you have a 24GB GPU, focus on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-27B Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-27B Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-35B-A3B UD-Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only have 16GB VRAM, start with low-bit variants and do not enable very long context right away.&lt;/p&gt;
&lt;h2 id=&#34;official-weight-sizes&#34;&gt;Official Weight Sizes
&lt;/h2&gt;&lt;p&gt;The following BF16 weight sizes come from &lt;code&gt;model.safetensors.index.json&lt;/code&gt; in the official Hugging Face repositories.
They are useful as a reference for the original model scale.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model&lt;/th&gt;
          &lt;th&gt;Architecture&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Official BF16 Weight Size&lt;/th&gt;
          &lt;th&gt;Official Context&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;27B dense&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;55.56GB&lt;/td&gt;
          &lt;td&gt;Native 262K, extendable to 1,010K&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;35B total / 3B active MoE&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;71.90GB&lt;/td&gt;
          &lt;td&gt;Native 262K, extendable to 1,010K&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Although &lt;code&gt;35B-A3B&lt;/code&gt; activates about 3B parameters per step, it still needs to load the full MoE weights.
So it should not be estimated like a 3B small model.&lt;/p&gt;
&lt;h2 id=&#34;qwen36-27b-vram-table&#34;&gt;Qwen3.6-27B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt; is a dense model. Its advantage is stable behavior, while its inference cost is closer to a traditional 27B model.
For local deployment, it is more compute-heavy than 35B-A3B, but its VRAM requirements are easier to estimate.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.39GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.85GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.85GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;Low-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ3_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.99GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;VRAM-saving 3-bit&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_S&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.36GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;3-bit entry point&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.59GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;Common 3-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;15.44GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Near-Q4, more VRAM efficient&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_NL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.07GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.82GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Recommended 27B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;19.51GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;22.52GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28.60GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td&gt;Research, evaluation, precision comparison&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For ordinary local coding and chat, &lt;code&gt;Q4_K_M&lt;/code&gt; is the easiest starting point to recommend.
A 24GB GPU can run &lt;code&gt;Q4_K_M&lt;/code&gt; fairly comfortably, but for long context, reduce quantization size or context length.&lt;/p&gt;
&lt;h2 id=&#34;qwen36-35b-a3b-vram-table&#34;&gt;Qwen3.6-35B-A3B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt; is an MoE model with 35B total parameters and about 3B active parameters per step.
Its advantage is a strong balance between speed and capability, especially for local agents, tool use, and coding workflows.&lt;/p&gt;
&lt;p&gt;But note that MoE &lt;code&gt;3B active&lt;/code&gt; mainly affects compute. It does not mean VRAM usage is comparable to a 3B model.
Full operation still needs the expert weights.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.76GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.52GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.29GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;Low-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ3_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.21GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;VRAM-saving 3-bit&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q3_K_S&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;15.36GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;3-bit entry point&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.60GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Common 3-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;17.73GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ4_NL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18.04GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Near-Q4 recommended option&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;22.13GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Recommended 35B-A3B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;26.46GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;29.31GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;36.90GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;69.37GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td&gt;Research, evaluation, precision comparison&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;With 24GB VRAM, &lt;code&gt;UD-Q4_K_M&lt;/code&gt; is a key option, but do not set the context too high.
If you want room for 128K+ context, &lt;code&gt;UD-IQ4_XS&lt;/code&gt;, &lt;code&gt;UD-IQ4_NL&lt;/code&gt;, or 3-bit versions are more realistic.&lt;/p&gt;
&lt;h2 id=&#34;27b-vs-35b-a3b&#34;&gt;27B vs 35B-A3B
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Need&lt;/th&gt;
          &lt;th&gt;Better Choice&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Stable dense-model behavior&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-27B&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Faster response, agents, and tool use&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Daily local use on 24GB VRAM&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;35B-A3B UD-Q4_K_M&lt;/code&gt; or &lt;code&gt;27B Q4_K_M&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Testing on 16GB VRAM&lt;/td&gt;
          &lt;td&gt;Use 2-bit/3-bit for both; avoid long context&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Long context first&lt;/td&gt;
          &lt;td&gt;Use lower-bit quantization and leave more KV cache room&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quality first with 32GB+ VRAM&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;27B Q5/Q6&lt;/code&gt; or &lt;code&gt;35B-A3B Q5/Q6&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you mainly write code, run agents, or use tools, &lt;code&gt;35B-A3B&lt;/code&gt; is worth trying first.
If you care more about dense-model stability and consistency, &lt;code&gt;27B&lt;/code&gt; is more straightforward.&lt;/p&gt;
&lt;h2 id=&#34;why-long-context-uses-so-much-vram&#34;&gt;Why Long Context Uses So Much VRAM
&lt;/h2&gt;&lt;p&gt;The Qwen3.6 model card recommends keeping longer context for complex tasks and even notes that 128K+ context can help reasoning.
But for local deployment, long context means a much larger &lt;code&gt;KV cache&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Actual VRAM usage is affected by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;KV cache&lt;/code&gt;: longer context means higher usage.&lt;/li&gt;
&lt;li&gt;Whether vision input is enabled: Qwen3.6 includes a vision encoder, and multimodal use adds overhead.&lt;/li&gt;
&lt;li&gt;Whether &lt;code&gt;--language-model-only&lt;/code&gt; is used: in runtimes such as vLLM, skipping vision can free memory for KV cache.&lt;/li&gt;
&lt;li&gt;Batch size and concurrency: more concurrency requires more VRAM.&lt;/li&gt;
&lt;li&gt;KV cache quantization: &lt;code&gt;q8_0&lt;/code&gt;, &lt;code&gt;q4_0&lt;/code&gt;, and similar settings can save VRAM, but may affect details.&lt;/li&gt;
&lt;li&gt;Runtime differences: llama.cpp, vLLM, SGLang, KTransformers, and LM Studio do not use exactly the same amount of memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So do not look only at GGUF file size.
If the file is already close to the VRAM limit, the model may load but still OOM when generating long outputs or using long context.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How to Choose
&lt;/h2&gt;&lt;p&gt;If you just want to try Qwen3.6 locally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;12GB VRAM: try &lt;code&gt;27B UD-IQ2_M&lt;/code&gt; or &lt;code&gt;35B-A3B UD-IQ2_M&lt;/code&gt;, with short context.&lt;/li&gt;
&lt;li&gt;16GB VRAM: try &lt;code&gt;27B Q3_K_M&lt;/code&gt; or &lt;code&gt;35B-A3B UD-IQ3_XXS&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;24GB VRAM: prefer &lt;code&gt;27B Q4_K_M&lt;/code&gt;, &lt;code&gt;35B-A3B UD-IQ4_NL&lt;/code&gt;, or &lt;code&gt;35B-A3B UD-Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;32GB VRAM: consider &lt;code&gt;27B Q5/Q6&lt;/code&gt; or &lt;code&gt;35B-A3B Q5/Q6&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;48GB and above: try &lt;code&gt;Q8_0&lt;/code&gt;, or reserve more room for long context.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most users do not need BF16.
The point of local Qwen3.6 deployment is not to choose the largest file, but to balance VRAM, context length, speed, and output quality.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-27B&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-27B - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-35B-A3B&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-35B-A3B - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-27B-FP8&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-27B-FP8 - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Qwen/Qwen3.6-35B-A3B-FP8 - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/Qwen3.6-27B-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/Qwen3.6-27B-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/Qwen3.6-35B-A3B-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Running DeepSeek V4 Locally: VRAM Estimates for Pro, Flash, and Base Versions</title>
        <link>https://knightli.com/en/2026/05/01/deepseek-v4-local-vram-quantization-table/</link>
        <pubDate>Fri, 01 May 2026 11:55:25 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/deepseek-v4-local-vram-quantization-table/</guid>
        <description>&lt;p&gt;DeepSeek V4 and Gemma 4 are not in the same class for local deployment.
With Gemma 4, it still makes sense to discuss how to run 26B or 31B models on 24GB or 32GB GPUs. DeepSeek V4 is a huge MoE model, and full local deployment quickly moves into multi-GPU workstation or server territory.&lt;/p&gt;
&lt;p&gt;The official DeepSeek V4 Preview release mainly includes two inference models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Pro&lt;/code&gt;: &lt;code&gt;1.6T total / 49B active params&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Flash&lt;/code&gt;: &lt;code&gt;284B total / 13B active params&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The official Hugging Face collection also includes two Base models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Pro-Base&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Flash-Base&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This article only discusses rough VRAM requirements when the full model weights are loaded.
For MoE models, &lt;code&gt;active params&lt;/code&gt; mainly affects per-token compute. It does not mean only those parameters need to be loaded.
Without expert-on-demand loading, CPU/NVMe offload, distributed inference, or specialized runtime optimizations, VRAM should still be estimated from the full weight size.&lt;/p&gt;
&lt;h2 id=&#34;quick-summary&#34;&gt;Quick Summary
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;VRAM Scale&lt;/th&gt;
          &lt;th&gt;What Is Realistic&lt;/th&gt;
          &lt;th&gt;Do Not Expect&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;24GB&lt;/td&gt;
          &lt;td&gt;Cannot fully run DeepSeek V4; use smaller distilled models or API&lt;/td&gt;
          &lt;td&gt;Full V4-Flash / V4-Pro local loading&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;48GB&lt;/td&gt;
          &lt;td&gt;Still not suitable for full loading; good for small models or remote API clients&lt;/td&gt;
          &lt;td&gt;Stable V4-Flash Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;80GB&lt;/td&gt;
          &lt;td&gt;Theoretically try V4-Flash Q2/Q3 or heavy offload&lt;/td&gt;
          &lt;td&gt;V4-Pro&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;128GB&lt;/td&gt;
          &lt;td&gt;V4-Flash Q4 becomes more realistic; Q5/Q6 still tight&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;192GB&lt;/td&gt;
          &lt;td&gt;V4-Flash FP8/Q6 is more comfortable; Pro Q2 enters experimental range&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;256GB&lt;/td&gt;
          &lt;td&gt;V4-Flash FP8 is fairly comfortable; Pro Q2/Q3 can be tested&lt;/td&gt;
          &lt;td&gt;V4-Pro Q5 and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;512GB&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4 starts to become discussable&lt;/td&gt;
          &lt;td&gt;V4-Pro FP8&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;1TB+&lt;/td&gt;
          &lt;td&gt;V4-Pro FP8 and low-bit Pro-Base are more realistic&lt;/td&gt;
          &lt;td&gt;Low-cost single-machine deployment&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;2TB+&lt;/td&gt;
          &lt;td&gt;Pro-Base FP8 class&lt;/td&gt;
          &lt;td&gt;Ordinary workstation deployment&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If your goal is to run a model on a personal computer, DeepSeek V4 is not the right target.
More realistic options are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use the official DeepSeek API or compatible services.&lt;/li&gt;
&lt;li&gt;Wait for stable community GGUF/EXL2/MLX quantizations and inference support.&lt;/li&gt;
&lt;li&gt;Use smaller DeepSeek distilled models.&lt;/li&gt;
&lt;li&gt;Use local models in the 7B to 70B range from Qwen, Gemma, Llama, and similar families.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;official-weight-sizes&#34;&gt;Official Weight Sizes
&lt;/h2&gt;&lt;p&gt;The following figures come from &lt;code&gt;model.safetensors.index.json&lt;/code&gt; in the official Hugging Face repositories.
They reflect current public weight file sizes, not full runtime VRAM use under long context.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model&lt;/th&gt;
          &lt;th&gt;Parameter Scale&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Official Weight Size&lt;/th&gt;
          &lt;th&gt;Notes&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Flash&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;284B total / 13B active&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;159.61GB&lt;/td&gt;
          &lt;td&gt;Inference model, smallest in this group&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Pro&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;1.6T total / 49B active&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;864.70GB&lt;/td&gt;
          &lt;td&gt;Inference model, stronger but enormous&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Flash-Base&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;284B total&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;294.67GB&lt;/td&gt;
          &lt;td&gt;Base model, closer to full FP8 weight size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Pro-Base&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;1.6T total&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1606.03GB&lt;/td&gt;
          &lt;td&gt;Base model, about 1.6TB&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Even the smallest &lt;code&gt;V4-Flash&lt;/code&gt; is already close to 160GB of official weights.
That is why it should not be treated like a 13B model just because it has &lt;code&gt;13B active params&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-flash-vram-estimate&#34;&gt;DeepSeek V4 Flash VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Flash&lt;/code&gt; is the most approachable DeepSeek V4 variant for local experiments.
But that only means “more approachable than Pro”; it is still not a consumer single-GPU model.&lt;/p&gt;
&lt;p&gt;The table below uses the official 159.61GB weight size as the baseline.
Q4/Q3/Q2 rows are bit-width estimates and do not imply that stable official GGUF versions currently exist.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;159.61GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td&gt;Multi-GPU servers, inference service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;120GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td&gt;Quality-first quantization tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;100GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td&gt;More realistic starting point for Flash&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;60GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td&gt;Large-VRAM single GPU or multi-GPU tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td&gt;Extreme low-bit experiments with clear quality risk&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If mature &lt;code&gt;V4-Flash Q4&lt;/code&gt; builds appear later, it still probably will not be a 24GB GPU model.
A more realistic starting point is 96GB to 128GB total VRAM, or CPU/offload setups that trade speed for capacity.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-pro-vram-estimate&#34;&gt;DeepSeek V4 Pro VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Pro&lt;/code&gt; is the flagship inference model, with official weights around 864.70GB.
Even at 4-bit quantization, the full weights remain in the hundreds of GB.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;864.70GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB+&lt;/td&gt;
          &lt;td&gt;Multi-node or multi-GPU inference service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;648GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td&gt;High-quality quantized service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;540GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td&gt;Quality/cost balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;432GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td&gt;Lowest practical quality line for Pro&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;324GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;384GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td&gt;Low-bit experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;216GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;320GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments with high quality and stability risk&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For individual users, &lt;code&gt;V4-Pro&lt;/code&gt; is better consumed through an API.
If the goal is full local deployment, treat it as a multi-GPU server model, not a 4090, 5090, or RTX PRO single-GPU model.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-flash-base-vram-estimate&#34;&gt;DeepSeek V4 Flash-Base VRAM Estimate
&lt;/h2&gt;&lt;p&gt;Base models are usually for research, fine-tuning, or continued training, not ordinary chat deployment.
&lt;code&gt;V4-Flash-Base&lt;/code&gt; has official weights of about 294.67GB.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;294.67GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;384GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td&gt;Research, preprocessing, evaluation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;221GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;320GB&lt;/td&gt;
          &lt;td&gt;High-quality quantization research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;184GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;224GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;147GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;224GB&lt;/td&gt;
          &lt;td&gt;Lower-cost Base experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;111GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td&gt;Low-bit experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;74GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you only want to use DeepSeek V4 capabilities, do not start with the Base model.
Base models cost more to deploy and tune; most applications should use the inference model or API.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-pro-base-vram-estimate&#34;&gt;DeepSeek V4 Pro-Base VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Pro-Base&lt;/code&gt; is the heaviest variant, with official weights around 1606.03GB.
That is already a 1.6TB-class model file.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1606.03GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.4TB+&lt;/td&gt;
          &lt;td&gt;Large-scale research clusters&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1205GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.5TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2TB&lt;/td&gt;
          &lt;td&gt;High-quality quantization research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1004GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.5TB&lt;/td&gt;
          &lt;td&gt;Research and evaluation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;803GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB&lt;/td&gt;
          &lt;td&gt;Low-bit research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;602GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td&gt;Extreme low-bit research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;402GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This kind of model should not be discussed in the framework of “can a home GPU run it?”
Even Q4 is already beyond the comfortable range of most single-machine workstations.&lt;/p&gt;
&lt;h2 id=&#34;why-active-params-are-not-enough&#34;&gt;Why Active Params Are Not Enough
&lt;/h2&gt;&lt;p&gt;DeepSeek V4 is an MoE model.
MoE means each token activates only part of the experts, so compute is much lower than the total parameter count.
But this does not mean VRAM only needs to hold the active parameters.&lt;/p&gt;
&lt;p&gt;Full local inference also depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether all expert weights must stay resident on GPU.&lt;/li&gt;
&lt;li&gt;Whether on-demand expert loading is supported.&lt;/li&gt;
&lt;li&gt;CPU memory to GPU memory transfer costs.&lt;/li&gt;
&lt;li&gt;NVMe offload latency.&lt;/li&gt;
&lt;li&gt;KV cache growth under long context.&lt;/li&gt;
&lt;li&gt;Extra runtime overhead under 1M context.&lt;/li&gt;
&lt;li&gt;Multi-node and multi-GPU communication cost.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;V4-Pro&lt;/code&gt; with &lt;code&gt;49B active&lt;/code&gt; should not be deployed like a 49B model.
&lt;code&gt;V4-Flash&lt;/code&gt; with &lt;code&gt;13B active&lt;/code&gt; should not be treated like a 13B small model either.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How to Choose
&lt;/h2&gt;&lt;p&gt;If you are an ordinary individual user:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Do not try to fully self-host DeepSeek V4.&lt;/li&gt;
&lt;li&gt;Use the official API when you need DeepSeek V4 capabilities.&lt;/li&gt;
&lt;li&gt;For private local deployment, first check whether you have mature inference infrastructure or internal multi-GPU servers.&lt;/li&gt;
&lt;li&gt;With only 24GB to 48GB VRAM, 7B, 14B, 32B, or 70B quantized models are more practical.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you have 128GB to 256GB total VRAM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Watch for stable community implementations of &lt;code&gt;V4-Flash Q4/Q5&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Do not treat &lt;code&gt;V4-Pro&lt;/code&gt; as your main local model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you have 512GB+ total VRAM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;V4-Pro Q4&lt;/code&gt; starts to become an engineering validation target.&lt;/li&gt;
&lt;li&gt;You still need to care about inference framework support, expert scheduling, KV cache, throughput, and concurrency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key question for DeepSeek V4 local deployment is not “which quantized file should I download?”
It is “do I have the system-level inference capacity for this model?”
It is closer to a server model than a desktop model.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://api-docs.deepseek.com/news/news260424&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DeepSeek V4 Preview Release - DeepSeek API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/collections/deepseek-ai/deepseek-v4&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DeepSeek-V4 collection - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Pro - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Flash - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-Base&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Pro-Base - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Flash-Base - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Running Gemma 4 Locally: VRAM Requirements for E2B, E4B, 26B, and 31B Quantized Models</title>
        <link>https://knightli.com/en/2026/05/01/gemma-4-local-vram-quantization-table/</link>
        <pubDate>Fri, 01 May 2026 11:42:34 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/gemma-4-local-vram-quantization-table/</guid>
        <description>&lt;p&gt;Gemma 4 currently has four main sizes for local deployment: &lt;code&gt;E2B&lt;/code&gt;, &lt;code&gt;E4B&lt;/code&gt;, &lt;code&gt;26B A4B&lt;/code&gt;, and &lt;code&gt;31B&lt;/code&gt;.
&lt;code&gt;E2B&lt;/code&gt; and &lt;code&gt;E4B&lt;/code&gt; target lightweight and edge devices, &lt;code&gt;26B A4B&lt;/code&gt; uses an MoE architecture, and &lt;code&gt;31B&lt;/code&gt; is the larger dense model.&lt;/p&gt;
&lt;p&gt;The easiest mistake in local inference is mixing up two numbers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GGUF file size&lt;/strong&gt;: how large the model weight file is.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Actual VRAM usage&lt;/strong&gt;: affected by model weights, KV cache, runtime overhead, context length, and whether multimodal projection files are loaded.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The tables below estimate VRAM requirements based on GGUF file size.
The default assumption is local text inference with &lt;code&gt;llama.cpp&lt;/code&gt;, LM Studio, Ollama, or similar runtimes, using short to medium context.
If you need long context, image/audio input, or concurrent requests, leave more VRAM headroom.&lt;/p&gt;
&lt;h2 id=&#34;quick-summary&#34;&gt;Quick Summary
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;VRAM&lt;/th&gt;
          &lt;th&gt;Good Fit&lt;/th&gt;
          &lt;th&gt;Avoid&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;4GB&lt;/td&gt;
          &lt;td&gt;Low-bit E2B quantizations&lt;/td&gt;
          &lt;td&gt;E4B and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;6GB&lt;/td&gt;
          &lt;td&gt;E2B Q4/Q5, low-bit E4B&lt;/td&gt;
          &lt;td&gt;26B, 31B&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;8GB&lt;/td&gt;
          &lt;td&gt;E2B Q8, E4B Q4/Q5&lt;/td&gt;
          &lt;td&gt;26B Q4, 31B Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;12GB&lt;/td&gt;
          &lt;td&gt;E4B Q8, low-quality 2-bit/3-bit 26B or 31B tests&lt;/td&gt;
          &lt;td&gt;26B Q4 with long context, 31B Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;16GB&lt;/td&gt;
          &lt;td&gt;Low-bit 26B, low-bit 31B&lt;/td&gt;
          &lt;td&gt;31B Q4 with long context, 26B Q5 and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;24GB&lt;/td&gt;
          &lt;td&gt;26B Q4/Q5, 31B Q4&lt;/td&gt;
          &lt;td&gt;31B Q8, BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;32GB&lt;/td&gt;
          &lt;td&gt;26B Q6/Q8, 31B Q5/Q6&lt;/td&gt;
          &lt;td&gt;BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;48GB&lt;/td&gt;
          &lt;td&gt;31B Q8 more comfortably, 26B Q8 with longer context&lt;/td&gt;
          &lt;td&gt;31B BF16&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;80GB+&lt;/td&gt;
          &lt;td&gt;26B/31B BF16&lt;/td&gt;
          &lt;td&gt;Single consumer GPU deployment&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you just want something usable locally, start with &lt;code&gt;E4B Q4_K_M&lt;/code&gt; or &lt;code&gt;E2B Q4_K_M&lt;/code&gt;.
With 24GB VRAM, &lt;code&gt;26B A4B Q4_K_M&lt;/code&gt; and &lt;code&gt;31B Q4_K_M&lt;/code&gt; start to become realistic choices.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-e2b-vram-table&#34;&gt;Gemma 4 E2B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;E2B&lt;/code&gt; is the lightest version, suitable for laptops, mini PCs, mobile devices, and low-VRAM testing.
It is easy to run, but complex reasoning, coding, and long tasks are limited.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.29GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.40GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.54GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td&gt;Lightweight chat and summaries&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.98GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Balance of quality and size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.11GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Recommended E2B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.36GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Slightly steadier than Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.50GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td&gt;Higher-quality small model&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5.05GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td&gt;Near-original precision for lightweight deployment&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.31GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Debugging, comparison, research&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For daily use, &lt;code&gt;E2B Q4_K_M&lt;/code&gt; is already enough.
With only 4GB VRAM, 2-bit or 3-bit variants can work, but output quality will be less stable.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-e4b-vram-table&#34;&gt;Gemma 4 E4B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;E4B&lt;/code&gt; is the more practical lightweight model.
Compared with E2B, it is better for everyday writing, document summaries, light coding assistance, and local assistant use.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.53GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.74GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM usability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.06GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td&gt;Lightweight local assistant&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.72GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td&gt;Balance of quality and speed&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.98GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td&gt;Recommended E4B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5.48GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td&gt;Steadier everyday use&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;7.07GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8.19GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;15.05GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Research, evaluation, precision comparison&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If your GPU has 8GB VRAM, &lt;code&gt;E4B Q4_K_M&lt;/code&gt; is a realistic starting point.
With 12GB or 16GB VRAM, &lt;code&gt;E4B Q8_0&lt;/code&gt; is also worth considering.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-26b-a4b-vram-table&#34;&gt;Gemma 4 26B A4B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;26B A4B&lt;/code&gt; is the MoE version. It has a larger total parameter count, but activates only part of the experts during inference.
It is better suited to more complex Q&amp;amp;A, coding, tool use, and agent workflows.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.97GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme 16GB GPU tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.55GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Running 26B with low VRAM&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.53GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;Better quality while still VRAM-conscious&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.42GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Balance of quality and size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.87GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Recommended 26B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;21.15GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;23.17GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;26.86GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;50.51GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td&gt;Not realistic for most single consumer GPUs&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;24GB VRAM is the comfortable dividing line for 26B A4B.
A 16GB GPU can try low-bit versions, but context length, concurrency, and multimodal input should be kept modest.&lt;/p&gt;
&lt;h2 id=&#34;gemma-4-31b-vram-table&#34;&gt;Gemma 4 31B VRAM Table
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;31B&lt;/code&gt; is the larger dense model.
Its strength is stronger overall capability, but its VRAM pressure is more direct than 26B A4B.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GGUF File Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_XXS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8.53GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td&gt;Extreme low-VRAM tests with clear quality loss&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-IQ2_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.75GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18GB&lt;/td&gt;
          &lt;td&gt;Low-VRAM tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;UD-Q2_K_XL&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11.77GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td&gt;16GB GPU experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_S&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.21GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;More VRAM-efficient 3-bit&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14.74GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Common 3-bit compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;IQ4_XS&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.37GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td&gt;Near-Q4 compromise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18.32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;24GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Recommended 31B default&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;21.66GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td&gt;Higher-quality quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6_K&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;25.20GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td&gt;Quality first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q8_0&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;32.64GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td&gt;Near-original precision&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;BF16&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;61.41GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td&gt;Server or large-VRAM workstation&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Low-bit 31B can be tested on a 16GB GPU, but for daily use, 24GB VRAM is a better starting point.
&lt;code&gt;Q4_K_M&lt;/code&gt; is the balanced choice, while &lt;code&gt;Q5_K_M&lt;/code&gt; and above make more sense with 32GB+ VRAM.&lt;/p&gt;
&lt;h2 id=&#34;why-actual-usage-is-higher-than-file-size&#34;&gt;Why Actual Usage Is Higher Than File Size
&lt;/h2&gt;&lt;p&gt;The GGUF file size is only the weight size.
Runtime usage also includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;KV cache&lt;/code&gt;: longer context means higher memory use.&lt;/li&gt;
&lt;li&gt;Batch size and concurrency: processing more tokens or more users increases VRAM.&lt;/li&gt;
&lt;li&gt;Multimodal components: image, audio, or video input often requires &lt;code&gt;mmproj&lt;/code&gt; or extra modules.&lt;/li&gt;
&lt;li&gt;Runtime backend: CUDA, Metal, ROCm, and CPU/GPU split loading behave differently.&lt;/li&gt;
&lt;li&gt;KV cache quantization: &lt;code&gt;q8_0&lt;/code&gt;, &lt;code&gt;q4_0&lt;/code&gt;, and similar modes can save VRAM, but may affect detail.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the “minimum VRAM” column should be read as the threshold for startup and short-context inference.
For 32K, 64K, 128K, or even 256K context, VRAM requirements rise significantly.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How to Choose
&lt;/h2&gt;&lt;p&gt;If you just want to try Gemma 4 locally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;4GB to 6GB VRAM: choose &lt;code&gt;E2B Q3_K_M&lt;/code&gt; or &lt;code&gt;E2B Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;8GB VRAM: prefer &lt;code&gt;E4B Q4_K_M&lt;/code&gt;; &lt;code&gt;E2B Q8_0&lt;/code&gt; is also fine.&lt;/li&gt;
&lt;li&gt;12GB VRAM: choose &lt;code&gt;E4B Q8_0&lt;/code&gt;, or try low-bit 26B/31B variants.&lt;/li&gt;
&lt;li&gt;16GB VRAM: try &lt;code&gt;26B A4B UD-Q3_K_M&lt;/code&gt; or &lt;code&gt;31B Q3_K_S&lt;/code&gt;, but do not expect long context to feel comfortable.&lt;/li&gt;
&lt;li&gt;24GB VRAM: focus on &lt;code&gt;26B A4B UD-Q4_K_M&lt;/code&gt; and &lt;code&gt;31B Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;32GB and above: consider &lt;code&gt;Q5_K_M&lt;/code&gt;, &lt;code&gt;Q6_K&lt;/code&gt;, or longer context.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most users do not need BF16.
Local deployment is not about picking the largest file, but about balancing VRAM, speed, context length, and output quality.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/google/gemma-4-E2B-it&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;google/gemma-4-E2B-it - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/google/gemma-4-E4B-it&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;google/gemma-4-E4B-it - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;ggml-org/gemma-4-26B-A4B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-E2B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-E4B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-26B-A4B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/unsloth/gemma-4-31B-it-GGUF&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;unsloth/gemma-4-31B-it-GGUF - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>A 16GB GPU Can Still Run 35B Models: VRAM Compression Strategies for MoE Models in LM Studio</title>
        <link>https://knightli.com/en/2026/04/22/16gb-gpu-run-35b-moe-models-in-lm-studio/</link>
        <pubDate>Wed, 22 Apr 2026 21:47:34 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/22/16gb-gpu-run-35b-moe-models-in-lm-studio/</guid>
        <description>&lt;p&gt;Many people think of 16GB VRAM as the point where local LLM deployment more or less tops out at 12B to 14B models, and anything larger becomes too painful even with quantization. That view is understandable, but it is not the true ceiling of a 16GB GPU.&lt;/p&gt;
&lt;p&gt;If your model choice and parameter setup are good enough, a 16GB GPU does not have to stay limited to “small-parameter” models. One representative approach is to use &lt;code&gt;MoE&lt;/code&gt; models inside &lt;code&gt;LM Studio&lt;/code&gt; with a sensible unloading strategy, so that 35B-class models can still run at a genuinely usable speed.&lt;/p&gt;
&lt;h2 id=&#34;01-why-a-16gb-gpu-is-not-necessarily-limited-to-12b-to-14b&#34;&gt;01 Why a 16GB GPU is not necessarily limited to 12B to 14B
&lt;/h2&gt;&lt;p&gt;The core idea is straightforward: VRAM size matters, but model architecture matters just as much.&lt;/p&gt;
&lt;p&gt;If you try to cram a standard dense model into a 16GB GPU, you will hit the wall quickly. These models usually involve all parameters during inference, so VRAM pressure and bandwidth pressure rise immediately.&lt;/p&gt;
&lt;p&gt;But &lt;code&gt;MoE&lt;/code&gt; models are different. Their total parameter count can be large, while only part of the expert parameters are activated in a single inference step. Take a 35B-class model as an example: although the total parameter count is high, the actual number of parameters participating in each inference step is much smaller, so its real VRAM requirement is not as extreme as many people assume.&lt;/p&gt;
&lt;p&gt;That is exactly why a 16GB GPU still leaves some room to work with.&lt;/p&gt;
&lt;h2 id=&#34;02-key-practical-takeaway-35b-moe-models-can-run-surprisingly-fast&#34;&gt;02 Key practical takeaway: 35B MoE models can run surprisingly fast
&lt;/h2&gt;&lt;p&gt;One representative case is a quantized &lt;code&gt;MoE&lt;/code&gt; model such as &lt;code&gt;Qwen 3.5 35B A3B&lt;/code&gt;. With a 16GB GPU and the right settings in &lt;code&gt;LM Studio&lt;/code&gt;, &lt;code&gt;Q6&lt;/code&gt; quantization can reach something above 30 &lt;code&gt;tokens/s&lt;/code&gt;, and &lt;code&gt;Q4&lt;/code&gt; can sometimes test even higher.&lt;/p&gt;
&lt;p&gt;That result matters not just because the model “runs,” but because the speed is already in a clearly usable range.&lt;/p&gt;
&lt;p&gt;As a comparison, large models of a similar scale that are not &lt;code&gt;MoE&lt;/code&gt; often run into VRAM overflow and sharply lower speed on a 16GB GPU. In other words, the outcome is not determined by parameter count alone. What matters is how those parameters are actually used during inference.&lt;/p&gt;
&lt;h2 id=&#34;03-in-lm-studio-the-key-is-not-just-one-parameter&#34;&gt;03 In LM Studio, the key is not just one parameter
&lt;/h2&gt;&lt;p&gt;If you want this kind of model to run smoothly on a 16GB GPU, the real trick is not luck. It is tuning two parameters correctly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GPU Offload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;the setting that forces part of the expert layers into CPU memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first one is easy to understand. &lt;code&gt;GPU Offload&lt;/code&gt; is basically something you push as high as possible, so the model prioritizes GPU computation.&lt;/p&gt;
&lt;p&gt;The second one is the real key here. It is not the traditional “borrow system memory after VRAM overflows” approach. Instead, it proactively places part of the expert layers into CPU memory to reduce VRAM usage in advance. Since &lt;code&gt;MoE&lt;/code&gt; models do not activate every expert on every step anyway, moving some experts into memory does not hurt overall inference speed as much as many people would expect.&lt;/p&gt;
&lt;p&gt;A safer way to tune it is to start within a range and then adjust gradually for your machine:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;start with related values somewhere between &lt;code&gt;20&lt;/code&gt; and &lt;code&gt;35&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;then fine-tune based on VRAM usage and memory pressure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At its core, this method is using system memory to buy back VRAM headroom.&lt;/p&gt;
&lt;h2 id=&#34;04-it-can-still-run-at-128k-context-and-smaller-contexts-reduce-vram-further&#34;&gt;04 It can still run at 128K context, and smaller contexts reduce VRAM further
&lt;/h2&gt;&lt;p&gt;Another interesting point is that even with the context length pushed to &lt;code&gt;128K&lt;/code&gt;, a 35B-class &lt;code&gt;MoE&lt;/code&gt; model can still maintain a relatively high speed.&lt;/p&gt;
&lt;p&gt;That tells us something important: the bottleneck of a 16GB GPU is not as rigid as many people imagine. Especially inside a local inference tool like &lt;code&gt;LM Studio&lt;/code&gt;, the real question is often not simply “can it run or not,” but rather:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;are you willing to trade more system memory for less VRAM usage&lt;/li&gt;
&lt;li&gt;are you willing to shorten the context length&lt;/li&gt;
&lt;li&gt;are you willing to accept different capability tradeoffs across quantization levels&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the context is reduced further from &lt;code&gt;128K&lt;/code&gt; to &lt;code&gt;64K&lt;/code&gt; or &lt;code&gt;32K&lt;/code&gt;, VRAM pressure can drop even more. That means some 35B-class &lt;code&gt;MoE&lt;/code&gt; models may even run, barely, on GPUs with less VRAM, though speed and memory pressure will need to be rebalanced.&lt;/p&gt;
&lt;h2 id=&#34;05-the-cost-of-this-approach-much-higher-demands-on-ram-and-virtual-memory&#34;&gt;05 The cost of this approach: much higher demands on RAM and virtual memory
&lt;/h2&gt;&lt;p&gt;This kind of setup is not free performance.&lt;/p&gt;
&lt;p&gt;What you need to watch is that once VRAM pressure is compressed further, system RAM usage rises noticeably, and virtual memory pressure rises too. In other words, you are not removing the cost. You are shifting pressure from the GPU to RAM and disk swap space.&lt;/p&gt;
&lt;p&gt;So if you want to try it yourself, it is worth checking a few things first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether your system RAM is large enough&lt;/li&gt;
&lt;li&gt;whether your virtual memory allocation is large enough&lt;/li&gt;
&lt;li&gt;whether too many background applications are already consuming resources&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those conditions are not in place, what you may get is not “35B running fast,” but an overall machine that becomes slow everywhere.&lt;/p&gt;
&lt;h2 id=&#34;06-more-aggressive-quantization-is-not-always-better&#34;&gt;06 More aggressive quantization is not always better
&lt;/h2&gt;&lt;p&gt;There is another practical tradeoff here. Lower-bit quantization often saves more VRAM, but that does not automatically make it the best choice.&lt;/p&gt;
&lt;p&gt;The practical takeaway is that some models do run faster under &lt;code&gt;Q4&lt;/code&gt;, but their original capability can also degrade more. By comparison, &lt;code&gt;Q6&lt;/code&gt; tends to strike a better balance between speed and capability retention. So the right choice depends on what you care about more:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;maximum speed and fitting into VRAM&lt;/li&gt;
&lt;li&gt;or preserving more of the model’s original capability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those two priorities do not necessarily lead to the same quantization choice.&lt;/p&gt;
&lt;h2 id=&#34;07-what-kinds-of-models-are-worth-trying&#34;&gt;07 What kinds of models are worth trying
&lt;/h2&gt;&lt;p&gt;From this angle, the best thing to try is not “blindly chase bigger parameter counts,” but to first look for models that fit this strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;models built on &lt;code&gt;MoE&lt;/code&gt; architecture&lt;/li&gt;
&lt;li&gt;models that are well supported in &lt;code&gt;LM Studio&lt;/code&gt; and have complete quantized variants&lt;/li&gt;
&lt;li&gt;models with clear advantages in long context or instruction following&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And the idea does not stop at one 35B &lt;code&gt;MoE&lt;/code&gt; model. It also extends naturally to other directions, such as experimental models with stronger long-context memory, better instruction following, or lighter quantized versions with strong speed performance.&lt;/p&gt;
&lt;p&gt;The logic behind this is very consistent: first find models whose architecture fits the “trade memory for VRAM” strategy, and then talk about tuning. Do not start from parameter count alone and decide from there.&lt;/p&gt;
&lt;h2 id=&#34;08-short-conclusion&#34;&gt;08 Short conclusion
&lt;/h2&gt;&lt;p&gt;If you happen to have a 16GB GPU and assume local LLMs stop at 12B to 14B, that assumption is worth updating.&lt;/p&gt;
&lt;p&gt;A more accurate way to put it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a 16GB GPU is not automatically ruled out for larger models&lt;/li&gt;
&lt;li&gt;dense models and &lt;code&gt;MoE&lt;/code&gt; models need to be considered separately&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GPU Offload&lt;/code&gt; and expert-layer transfer to CPU memory inside &lt;code&gt;LM Studio&lt;/code&gt; can significantly change VRAM usage&lt;/li&gt;
&lt;li&gt;in practice, you are trading higher memory pressure for larger model scale and better usable speed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach will not fit every machine, but it does show one important thing: in local LLM deployment, VRAM is not the only limit. Model architecture and inference configuration matter just as much.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Use llama-quantize for GGUF Models</title>
        <link>https://knightli.com/en/2026/04/12/llama-quantize-gguf-guide/</link>
        <pubDate>Sun, 12 Apr 2026 09:42:36 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/12/llama-quantize-gguf-guide/</guid>
        <description>&lt;p&gt;&lt;code&gt;llama-quantize&lt;/code&gt; is the quantization tool in &lt;code&gt;llama.cpp&lt;/code&gt;. It is used to convert high-precision &lt;code&gt;GGUF&lt;/code&gt; models into smaller quantized versions.&lt;/p&gt;
&lt;p&gt;Its most common use is turning formats such as &lt;code&gt;F32&lt;/code&gt;, &lt;code&gt;BF16&lt;/code&gt;, or &lt;code&gt;FP16&lt;/code&gt; into versions like &lt;code&gt;Q4_K_M&lt;/code&gt;, &lt;code&gt;Q5_K_M&lt;/code&gt;, or &lt;code&gt;Q8_0&lt;/code&gt; that are easier to run locally. After quantization, models usually become much smaller and often faster at inference, but some quality loss is expected.&lt;/p&gt;
&lt;h2 id=&#34;basic-workflow&#34;&gt;Basic workflow
&lt;/h2&gt;&lt;p&gt;A typical workflow is to prepare the original model, convert it to GGUF, and then run quantization.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# install Python dependencies&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python3 -m pip install -r requirements.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# convert the model to ggml FP16 format&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python3 convert_hf_to_gguf.py ./models/mymodel/
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# quantize the model to 4-bits (using Q4_K_M method)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After that, you can run the quantized model with &lt;code&gt;llama-cli&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# start inference on a gguf model&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -cnv -p &lt;span class=&#34;s2&#34;&gt;&amp;#34;You are a helpful assistant&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;common-options&#34;&gt;Common options
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--allow-requantize&lt;/code&gt;: allows requantizing an already quantized model, usually not ideal for quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--leave-output-tensor&lt;/code&gt;: keeps the output layer unquantized, increasing size but sometimes helping quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--pure&lt;/code&gt;: disables mixed quantization and uses a more uniform quant type&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--imatrix&lt;/code&gt;: uses an importance matrix to improve quantization quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--keep-split&lt;/code&gt;: keeps the original shard layout instead of producing one merged file&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you just want a practical starting point, this is often enough:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;how-to-choose-a-quant&#34;&gt;How to choose a quant
&lt;/h2&gt;&lt;p&gt;You can think of quant levels as a tradeoff between size, speed, and quality:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q8_0&lt;/code&gt;: larger, but usually safer for quality&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K&lt;/code&gt; / &lt;code&gt;Q5_K_M&lt;/code&gt;: common balanced choices&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;: a very common default with a good size-quality balance&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3&lt;/code&gt; / &lt;code&gt;Q2&lt;/code&gt;: useful when hardware is very limited, but quality loss is more visible&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The practical goal is usually not to pick the biggest quant you can fit, but the one that runs reliably on your hardware while keeping acceptable quality.&lt;/p&gt;
&lt;h2 id=&#34;practical-takeaway&#34;&gt;Practical takeaway
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;start with &lt;code&gt;Q4_K_M&lt;/code&gt; or &lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;move up to &lt;code&gt;Q6_K&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt; if quality matters more&lt;/li&gt;
&lt;li&gt;move down to &lt;code&gt;Q3&lt;/code&gt; or &lt;code&gt;Q2&lt;/code&gt; if memory is tight&lt;/li&gt;
&lt;li&gt;compare versions with the same prompt set&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, &lt;code&gt;llama-quantize&lt;/code&gt; is useful because it makes GGUF models easier to run on local hardware, not just because it makes files smaller.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Choosing Llama GGUF Quantization on Hugging Face: Practical Advice from Q8 to Q2</title>
        <link>https://knightli.com/en/2026/04/11/llama-gguf-quantization-selection/</link>
        <pubDate>Sat, 11 Apr 2026 20:07:29 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/11/llama-gguf-quantization-selection/</guid>
        <description>&lt;p&gt;When selecting a Llama GGUF model on Hugging Face, you can think of quantization levels like resolution: lower levels need less VRAM/RAM, but quality drops gradually.&lt;/p&gt;
&lt;h2 id=&#34;understand-32-16-and-q-levels-first&#34;&gt;Understand 32, 16, and Q levels first
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32&lt;/code&gt;: closest to original/uncompressed quality, but hardware demand is extreme.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;16&lt;/code&gt;: still very close to original quality, around half the size of &lt;code&gt;32&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q8&lt;/code&gt;: common entry point for quantized models (&lt;code&gt;Q8_0&lt;/code&gt; or &lt;code&gt;Q8&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6&lt;/code&gt;, &lt;code&gt;Q5&lt;/code&gt;, &lt;code&gt;Q4&lt;/code&gt;, &lt;code&gt;Q3&lt;/code&gt;, &lt;code&gt;Q2&lt;/code&gt;: lower number means lower resource use and higher quality loss risk.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;what-k_m--k_s-means&#34;&gt;What &lt;code&gt;K_M&lt;/code&gt; / &lt;code&gt;K_S&lt;/code&gt; means
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;K_M&lt;/code&gt; and &lt;code&gt;K_S&lt;/code&gt; are mixed quantization variants:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;most weights stay at the target quantization level&lt;/li&gt;
&lt;li&gt;important parts keep higher precision&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So at the same level, &lt;code&gt;Qx_K_M&lt;/code&gt; or &lt;code&gt;Qx_K_S&lt;/code&gt; is usually slightly better than plain &lt;code&gt;Qx&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;practical-picking-strategy&#34;&gt;Practical picking strategy
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;If hardware allows, start with &lt;code&gt;Q8&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If memory is tight, step down through &lt;code&gt;Q6&lt;/code&gt; / &lt;code&gt;Q5&lt;/code&gt; / &lt;code&gt;Q4&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Try not to go below &lt;code&gt;Q4&lt;/code&gt;; &lt;code&gt;Q4_K_M&lt;/code&gt; is a common lower bound.&lt;/li&gt;
&lt;li&gt;Below &lt;code&gt;Q4&lt;/code&gt;, quality degradation becomes increasingly visible.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quality-order-best-to-worst&#34;&gt;Quality order (best to worst)
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;32&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;16&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; Above this point, quality is effectively the same, but hardware requirements are extreme &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q8&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; This is the typical sweet spot &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&amp;ndash; Below this point, quality loss becomes visible &amp;ndash;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q3_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2_K_M&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2_K_S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want one short rule: start with &lt;code&gt;Q8&lt;/code&gt; or &lt;code&gt;Q6_K_M&lt;/code&gt;, then move down to &lt;code&gt;Q5&lt;/code&gt; or &lt;code&gt;Q4_K_M&lt;/code&gt; only when needed.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2</title>
        <link>https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/</link>
        <pubDate>Sun, 05 Apr 2026 22:09:11 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/</guid>
        <description>&lt;p&gt;The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.&lt;br&gt;
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.&lt;/p&gt;
&lt;h2 id=&#34;what-is-quantization&#34;&gt;What Is Quantization
&lt;/h2&gt;&lt;p&gt;Quantization means compressing model parameters from higher-precision formats (such as &lt;code&gt;FP16&lt;/code&gt;) into lower-bit formats (such as &lt;code&gt;Q8&lt;/code&gt; and &lt;code&gt;Q4&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;A simple analogy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Original model: like a high-quality photo, clear but large.&lt;/li&gt;
&lt;li&gt;Quantized model: like a compressed photo, slightly less detail but lighter and faster.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;common-quantization-formats&#34;&gt;Common Quantization Formats
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Quantization&lt;/th&gt;
          &lt;th&gt;Precision / Bit Width&lt;/th&gt;
          &lt;th&gt;Size&lt;/th&gt;
          &lt;th&gt;Quality Loss&lt;/th&gt;
          &lt;th&gt;Recommended Use&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;FP16&lt;/td&gt;
          &lt;td&gt;16-bit float&lt;/td&gt;
          &lt;td&gt;Largest&lt;/td&gt;
          &lt;td&gt;Almost none&lt;/td&gt;
          &lt;td&gt;Research, evaluation, max quality&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q8_0&lt;/td&gt;
          &lt;td&gt;8-bit integer&lt;/td&gt;
          &lt;td&gt;Larger&lt;/td&gt;
          &lt;td&gt;Almost none&lt;/td&gt;
          &lt;td&gt;High-end PCs, quality + performance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q5_K_M&lt;/td&gt;
          &lt;td&gt;5-bit mixed&lt;/td&gt;
          &lt;td&gt;Medium&lt;/td&gt;
          &lt;td&gt;Slight&lt;/td&gt;
          &lt;td&gt;Daily driver, balanced choice&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
          &lt;td&gt;4-bit mixed&lt;/td&gt;
          &lt;td&gt;Smaller&lt;/td&gt;
          &lt;td&gt;Acceptable&lt;/td&gt;
          &lt;td&gt;General default, strong value&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q3_K_M&lt;/td&gt;
          &lt;td&gt;3-bit mixed&lt;/td&gt;
          &lt;td&gt;Very small&lt;/td&gt;
          &lt;td&gt;Noticeable&lt;/td&gt;
          &lt;td&gt;Low-spec devices, run-first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Q2_K&lt;/td&gt;
          &lt;td&gt;2-bit mixed&lt;/td&gt;
          &lt;td&gt;Smallest&lt;/td&gt;
          &lt;td&gt;Significant&lt;/td&gt;
          &lt;td&gt;Extreme resource limits, fallback&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;quantization-naming-rules&#34;&gt;Quantization Naming Rules
&lt;/h2&gt;&lt;p&gt;Take &lt;code&gt;gemma-4:4b-q4_k_m&lt;/code&gt; as an example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gemma-4:4b&lt;/code&gt;: model name and parameter scale.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;q4&lt;/code&gt;: 4-bit quantization.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;k&lt;/code&gt;: K-quants (an improved quantization method).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;m&lt;/code&gt;: medium level (common options also include &lt;code&gt;s&lt;/code&gt;/small and &lt;code&gt;l&lt;/code&gt;/large).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quick-selection-by-vram&#34;&gt;Quick Selection by VRAM
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;RAM / VRAM&lt;/th&gt;
          &lt;th&gt;Recommended Quantization&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;4 GB&lt;/td&gt;
          &lt;td&gt;Q3_K_M / Q2_K&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;8 GB&lt;/td&gt;
          &lt;td&gt;Q4_K_M&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;16 GB&lt;/td&gt;
          &lt;td&gt;Q5_K_M / Q8_0&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;32 GB+&lt;/td&gt;
          &lt;td&gt;FP16 / Q8_0&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.&lt;/p&gt;
&lt;h2 id=&#34;practical-tips&#34;&gt;Practical Tips
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Start with &lt;code&gt;Q4_K_M&lt;/code&gt; by default and test real tasks first.&lt;/li&gt;
&lt;li&gt;If response quality is not enough, move up to &lt;code&gt;Q5_K_M&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If VRAM or speed is the main bottleneck, move down to &lt;code&gt;Q3_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Use the same test set every time you switch quantization formats.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Quality first: &lt;code&gt;FP16&lt;/code&gt; or &lt;code&gt;Q8_0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Balance first: &lt;code&gt;Q5_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;General default: &lt;code&gt;Q4_K_M&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Low-spec fallback: &lt;code&gt;Q3_K_M&lt;/code&gt; or &lt;code&gt;Q2_K&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key is not &amp;ldquo;bigger is always better&amp;rdquo;, but &amp;ldquo;the most stable and usable result under your hardware limits.&amp;rdquo;&lt;/p&gt;
&lt;!-- ollama-related-links:start --&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Gemma 4 Model Comparison and Selection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/uninstall-ollama-on-linux/&#34; &gt;Completely Uninstall Ollama on Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/ollama-model-storage-path-and-migration/&#34; &gt;Ollama Model Storage Path and Migration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/&#34; &gt;How to Check Whether Ollama Uses GPU&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- ollama-related-links:end --&gt;
</description>
        </item>
        
    </channel>
</rss>
