<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>MoE on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/moe/</link>
        <description>Recent content in MoE on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Sun, 17 May 2026 08:53:29 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/moe/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>LLM Architecture Evolution from 2023 to 2026: Tokenizers, Positional Encoding, Attention, MoE, Normalization, and Activation Functions</title>
        <link>https://knightli.com/en/2026/05/17/llm-architecture-evolution-2023-2026/</link>
        <pubDate>Sun, 17 May 2026 08:53:29 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/17/llm-architecture-evolution-2023-2026/</guid>
        <description>&lt;p&gt;From 2023 to 2026, LLM architecture seemed to change in many directions: tokenizers got larger, positional encoding shifted toward RoPE, attention evolved from MHA to GQA, sliding windows, and MLA, MoE became mainstream again, and normalization and activation functions moved toward combinations like RMSNorm and SwiGLU.&lt;/p&gt;
&lt;p&gt;But the main story is not that Transformer was overturned. The main story is that the Transformer core stayed in place, while almost every component around it was optimized for longer context, lower inference cost, higher training efficiency, and stronger multilingual capability.&lt;/p&gt;
&lt;h2 id=&#34;start-with-the-big-picture&#34;&gt;Start with the Big Picture
&lt;/h2&gt;&lt;p&gt;An LLM can be roughly divided into several parts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tokenizer: turns text into tokens the model can understand.&lt;/li&gt;
&lt;li&gt;Positional encoding: tells the model where each token is in the sequence.&lt;/li&gt;
&lt;li&gt;Attention mechanism: decides which context each token should look at.&lt;/li&gt;
&lt;li&gt;Feed-forward network: applies more complex nonlinear transformations at each position.&lt;/li&gt;
&lt;li&gt;Normalization: keeps training more stable.&lt;/li&gt;
&lt;li&gt;Activation function: gives the network nonlinear expressive power.&lt;/li&gt;
&lt;li&gt;MoE: splits part of the feed-forward network into multiple experts and activates only a few at a time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The 2023-2026 evolution is basically these components being optimized one by one.&lt;/p&gt;
&lt;h2 id=&#34;tokenizers-from-can-split-text-to-uses-fewer-tokens&#34;&gt;Tokenizers: From “Can Split Text” to “Uses Fewer Tokens”
&lt;/h2&gt;&lt;p&gt;The tokenizer turns natural language into token sequences. The model does not see text directly; it sees token IDs.&lt;/p&gt;
&lt;p&gt;Earlier tokenizers were often more efficient for English and less efficient for Chinese, code, and multilingual text. If the same sentence is split into too many small pieces, it consumes more context window and increases both training and inference cost.&lt;/p&gt;
&lt;p&gt;One clear trend in recent years is larger vocabularies and better multilingual support. Llama 3 uses a 128K-token vocabulary, and Meta explicitly says this encodes language more efficiently and improves model performance. Qwen, DeepSeek, and other models also pay close attention to token efficiency for Chinese, code, and multilingual scenarios.&lt;/p&gt;
&lt;p&gt;For beginners, think of it this way: the better the tokenizer, the less fragmented the same text becomes, and the more useful information the model can fit into the same context length.&lt;/p&gt;
&lt;h2 id=&#34;positional-encoding-rope-became-mainstream&#34;&gt;Positional Encoding: RoPE Became Mainstream
&lt;/h2&gt;&lt;p&gt;Language has order. “Dog bites man” and “man bites dog” contain similar words, but the order changes the meaning. Positional encoding injects that order information into the model.&lt;/p&gt;
&lt;p&gt;Early Transformers used absolute positional encodings, where position 1, position 2, and position 3 each had their own vector. Later LLMs more often used RoPE, or Rotary Positional Embedding. RoPE integrates position information into attention computation and is friendlier to long-context extension.&lt;/p&gt;
&lt;p&gt;From the Llama family to many open models, RoPE has become one of the de facto standards. To support longer context, models may also adjust the RoPE base frequency, apply RoPE scaling, or combine it with sliding-window or chunked attention.&lt;/p&gt;
&lt;p&gt;Simply put, RoPE does not make a model “suddenly smarter,” but it helps the model handle relative position relationships better in longer text.&lt;/p&gt;
&lt;h2 id=&#34;attention-from-mha-to-gqa-sliding-windows-and-mla&#34;&gt;Attention: From MHA to GQA, Sliding Windows, and MLA
&lt;/h2&gt;&lt;p&gt;Attention is the core of Transformer. It lets each token look at the most relevant tokens in the context for the current task.&lt;/p&gt;
&lt;p&gt;The classic version is MHA, or Multi-Head Attention. It has multiple attention heads, each learning a different way to focus. The problem is that as models and contexts grow, KV cache becomes expensive and inference cost rises.&lt;/p&gt;
&lt;p&gt;After 2023, the main direction of attention optimization was reducing inference cost.&lt;/p&gt;
&lt;p&gt;GQA, or Grouped-Query Attention, is an important step. It lets multiple query heads share fewer key/value heads, reducing KV cache pressure. Meta explicitly adopted GQA in Llama 3 to improve inference efficiency.&lt;/p&gt;
&lt;p&gt;Mistral 7B represents another direction: sliding-window attention. It does not require every token to attend to the entire history, but focuses mainly on a nearby window, reducing long-sequence computation pressure. For many tasks, local context already carries much of the useful information.&lt;/p&gt;
&lt;p&gt;DeepSeek-V2/V3 pushed attention optimization further with MLA, or Multi-head Latent Attention. Its focus is compressing KV cache to reduce inference memory pressure. The DeepSeek-V3 technical report lists MLA and DeepSeekMoE as core architectural features.&lt;/p&gt;
&lt;p&gt;You can understand these methods together:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;MHA: the classic approach, strong but expensive.&lt;/li&gt;
&lt;li&gt;GQA: greatly reduces KV cache cost with little loss in expressiveness.&lt;/li&gt;
&lt;li&gt;Sliding-window attention: reduces full-attention cost in long context.&lt;/li&gt;
&lt;li&gt;MLA: further compresses attention cache for efficient inference.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;moe-many-parameters-but-only-some-are-used-each-time&#34;&gt;MoE: Many Parameters, but Only Some Are Used Each Time
&lt;/h2&gt;&lt;p&gt;MoE means Mixture of Experts.&lt;/p&gt;
&lt;p&gt;A normal dense model activates all parameters for every token. MoE puts many experts inside the model, but routes each token to only a few experts. This allows the total parameter count to be large while the number of active parameters per inference step stays smaller.&lt;/p&gt;
&lt;p&gt;Mixtral 8x7B, released at the end of 2023, was an important moment that brought MoE back into broad attention. Mistral’s paper explains that Mixtral 8x7B largely follows the Mistral 7B architecture, but replaces each feed-forward block with 8 experts and uses sparse routing to select part of them for computation.&lt;/p&gt;
&lt;p&gt;DeepSeek-V3 later made MoE a core route. It has a very large total parameter count, but activates only a subset for each token, using DeepSeekMoE to reduce training and inference cost. Qwen3 and other model families also provide both dense and MoE variants, showing that MoE has moved from a research trick to a mainstream engineering option.&lt;/p&gt;
&lt;p&gt;For beginners, a dense model is like a company where everyone attends every meeting. MoE is like dividing the company into expert teams and calling only the most relevant teams for each problem.&lt;/p&gt;
&lt;p&gt;MoE also has clear difficulties:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The router must learn to send tokens to suitable experts.&lt;/li&gt;
&lt;li&gt;Expert load must be balanced, so not all tokens crowd into a few experts.&lt;/li&gt;
&lt;li&gt;Distributed training and inference become more complex.&lt;/li&gt;
&lt;li&gt;Large total parameters do not automatically make deployment cheap.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;normalization-rmsnorm-became-common&#34;&gt;Normalization: RMSNorm Became Common
&lt;/h2&gt;&lt;p&gt;Normalization stabilizes the distribution of intermediate values inside the neural network. When training large models, unstable values make convergence harder and training less reliable.&lt;/p&gt;
&lt;p&gt;Early Transformers commonly used LayerNorm. Many Llama-style models later switched to RMSNorm. RMSNorm is simpler than LayerNorm: it does not compute the mean and focuses on root-mean-square scaling. It is lighter and stable enough in practice.&lt;/p&gt;
&lt;p&gt;You do not need to memorize the formula. Just remember that RMSNorm is a lighter stabilizer. It does not determine model capability by itself, but it affects training stability, speed, and engineering implementation.&lt;/p&gt;
&lt;h2 id=&#34;activation-functions-from-relugelu-to-swiglu&#34;&gt;Activation Functions: From ReLU/GELU to SwiGLU
&lt;/h2&gt;&lt;p&gt;Activation functions add nonlinear capability to neural networks. Without them, a deep network would easily collapse into a linear transformation.&lt;/p&gt;
&lt;p&gt;Earlier Transformers often used GELU. In modern LLMs such as Llama, Mistral, Qwen, and DeepSeek, SwiGLU or similar GLU variants are more common. SwiGLU usually appears inside the feed-forward network and controls information flow through a gating mechanism.&lt;/p&gt;
&lt;p&gt;A rough analogy: a normal activation function is like a fixed switch, while SwiGLU is more like a learnable valve. It does not just decide whether information passes through; it can learn which information should be amplified.&lt;/p&gt;
&lt;p&gt;SwiGLU makes the feed-forward layer slightly more complex, but in large-model practice it has become a common high-performance component.&lt;/p&gt;
&lt;h2 id=&#34;the-overall-trend-from-2023-to-2026&#34;&gt;The Overall Trend from 2023 to 2026
&lt;/h2&gt;&lt;p&gt;The timeline can be summarized like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;2023: Llama, Mistral 7B, Mixtral, and other open models popularized combinations such as RoPE, RMSNorm, SwiGLU, GQA, sliding-window attention, and MoE.&lt;/li&gt;
&lt;li&gt;2024: Llama 3, Qwen2.5, DeepSeek-V2/V3, and others expanded vocabularies, improved long context, strengthened inference efficiency, and made MoE and efficient attention central topics.&lt;/li&gt;
&lt;li&gt;2025: DeepSeek-V3/R1 made more people pay attention to MLA, DeepSeekMoE, FP8, MTP, and the deep connection between architecture optimization and system engineering.&lt;/li&gt;
&lt;li&gt;2026: The trend remains efficiency and engineering maturity: dense models continue to pursue stable general capability, MoE models expand capacity, and efficient attention reduces long-context cost.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The most important change was not one component replacing Transformer. It was the realization that adding parameters alone is not enough: architecture, data, training systems, and inference services must be optimized together.&lt;/p&gt;
&lt;h2 id=&#34;how-beginners-should-learn-this&#34;&gt;How Beginners Should Learn This
&lt;/h2&gt;&lt;p&gt;If you are starting from zero, do not begin by forcing yourself through every paper. A better order is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Understand the basic Transformer structure: tokens, embeddings, attention, and FFN.&lt;/li&gt;
&lt;li&gt;Learn why RoPE, RMSNorm, and SwiGLU became common.&lt;/li&gt;
&lt;li&gt;Study GQA and KV cache to understand why inference consumes so much memory.&lt;/li&gt;
&lt;li&gt;Learn MoE, focusing on the difference between total parameters and active parameters.&lt;/li&gt;
&lt;li&gt;Finally, read model reports such as DeepSeek-V3, Mixtral, and Llama 3 to place these components back into real models.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Do not treat these terms as isolated facts. Most of them answer the same question: how can models become stronger while remaining trainable, deployable, and fast enough to serve?&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;The 2023-2026 evolution of LLM architecture can be seen as the engineering maturation of Transformer. Tokenizers reduce token waste, RoPE represents position more effectively, GQA, sliding-window attention, and MLA reduce attention cost, MoE expands capacity while controlling active computation, and RMSNorm plus SwiGLU make training and representation more stable and efficient.&lt;/p&gt;
&lt;p&gt;For beginners, the key is not memorizing terms. The key is understanding the main tradeoff: almost every modern LLM architecture change is about cost, efficiency, context length, and scalability.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://ai.meta.com/blog/meta-llama-3/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Meta: Introducing Meta Llama 3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://mistral.ai/en/news/mixtral-of-experts&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Mistral AI: Mixtral of experts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://arxiv.org/abs/2401.04088&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;arXiv: Mixtral of Experts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://arxiv.org/abs/2412.19437&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;arXiv: DeepSeek-V3 Technical Report&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V3&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hugging Face: DeepSeek-V3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Running DeepSeek V4 Locally: VRAM Estimates for Pro, Flash, and Base Versions</title>
        <link>https://knightli.com/en/2026/05/01/deepseek-v4-local-vram-quantization-table/</link>
        <pubDate>Fri, 01 May 2026 11:55:25 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/deepseek-v4-local-vram-quantization-table/</guid>
        <description>&lt;p&gt;DeepSeek V4 and Gemma 4 are not in the same class for local deployment.
With Gemma 4, it still makes sense to discuss how to run 26B or 31B models on 24GB or 32GB GPUs. DeepSeek V4 is a huge MoE model, and full local deployment quickly moves into multi-GPU workstation or server territory.&lt;/p&gt;
&lt;p&gt;The official DeepSeek V4 Preview release mainly includes two inference models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Pro&lt;/code&gt;: &lt;code&gt;1.6T total / 49B active params&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Flash&lt;/code&gt;: &lt;code&gt;284B total / 13B active params&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The official Hugging Face collection also includes two Base models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Pro-Base&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Flash-Base&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This article only discusses rough VRAM requirements when the full model weights are loaded.
For MoE models, &lt;code&gt;active params&lt;/code&gt; mainly affects per-token compute. It does not mean only those parameters need to be loaded.
Without expert-on-demand loading, CPU/NVMe offload, distributed inference, or specialized runtime optimizations, VRAM should still be estimated from the full weight size.&lt;/p&gt;
&lt;h2 id=&#34;quick-summary&#34;&gt;Quick Summary
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;VRAM Scale&lt;/th&gt;
          &lt;th&gt;What Is Realistic&lt;/th&gt;
          &lt;th&gt;Do Not Expect&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;24GB&lt;/td&gt;
          &lt;td&gt;Cannot fully run DeepSeek V4; use smaller distilled models or API&lt;/td&gt;
          &lt;td&gt;Full V4-Flash / V4-Pro local loading&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;48GB&lt;/td&gt;
          &lt;td&gt;Still not suitable for full loading; good for small models or remote API clients&lt;/td&gt;
          &lt;td&gt;Stable V4-Flash Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;80GB&lt;/td&gt;
          &lt;td&gt;Theoretically try V4-Flash Q2/Q3 or heavy offload&lt;/td&gt;
          &lt;td&gt;V4-Pro&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;128GB&lt;/td&gt;
          &lt;td&gt;V4-Flash Q4 becomes more realistic; Q5/Q6 still tight&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;192GB&lt;/td&gt;
          &lt;td&gt;V4-Flash FP8/Q6 is more comfortable; Pro Q2 enters experimental range&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;256GB&lt;/td&gt;
          &lt;td&gt;V4-Flash FP8 is fairly comfortable; Pro Q2/Q3 can be tested&lt;/td&gt;
          &lt;td&gt;V4-Pro Q5 and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;512GB&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4 starts to become discussable&lt;/td&gt;
          &lt;td&gt;V4-Pro FP8&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;1TB+&lt;/td&gt;
          &lt;td&gt;V4-Pro FP8 and low-bit Pro-Base are more realistic&lt;/td&gt;
          &lt;td&gt;Low-cost single-machine deployment&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;2TB+&lt;/td&gt;
          &lt;td&gt;Pro-Base FP8 class&lt;/td&gt;
          &lt;td&gt;Ordinary workstation deployment&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If your goal is to run a model on a personal computer, DeepSeek V4 is not the right target.
More realistic options are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use the official DeepSeek API or compatible services.&lt;/li&gt;
&lt;li&gt;Wait for stable community GGUF/EXL2/MLX quantizations and inference support.&lt;/li&gt;
&lt;li&gt;Use smaller DeepSeek distilled models.&lt;/li&gt;
&lt;li&gt;Use local models in the 7B to 70B range from Qwen, Gemma, Llama, and similar families.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;official-weight-sizes&#34;&gt;Official Weight Sizes
&lt;/h2&gt;&lt;p&gt;The following figures come from &lt;code&gt;model.safetensors.index.json&lt;/code&gt; in the official Hugging Face repositories.
They reflect current public weight file sizes, not full runtime VRAM use under long context.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model&lt;/th&gt;
          &lt;th&gt;Parameter Scale&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Official Weight Size&lt;/th&gt;
          &lt;th&gt;Notes&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Flash&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;284B total / 13B active&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;159.61GB&lt;/td&gt;
          &lt;td&gt;Inference model, smallest in this group&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Pro&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;1.6T total / 49B active&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;864.70GB&lt;/td&gt;
          &lt;td&gt;Inference model, stronger but enormous&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Flash-Base&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;284B total&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;294.67GB&lt;/td&gt;
          &lt;td&gt;Base model, closer to full FP8 weight size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Pro-Base&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;1.6T total&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1606.03GB&lt;/td&gt;
          &lt;td&gt;Base model, about 1.6TB&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Even the smallest &lt;code&gt;V4-Flash&lt;/code&gt; is already close to 160GB of official weights.
That is why it should not be treated like a 13B model just because it has &lt;code&gt;13B active params&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-flash-vram-estimate&#34;&gt;DeepSeek V4 Flash VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Flash&lt;/code&gt; is the most approachable DeepSeek V4 variant for local experiments.
But that only means “more approachable than Pro”; it is still not a consumer single-GPU model.&lt;/p&gt;
&lt;p&gt;The table below uses the official 159.61GB weight size as the baseline.
Q4/Q3/Q2 rows are bit-width estimates and do not imply that stable official GGUF versions currently exist.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;159.61GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td&gt;Multi-GPU servers, inference service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;120GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td&gt;Quality-first quantization tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;100GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td&gt;More realistic starting point for Flash&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;60GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td&gt;Large-VRAM single GPU or multi-GPU tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td&gt;Extreme low-bit experiments with clear quality risk&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If mature &lt;code&gt;V4-Flash Q4&lt;/code&gt; builds appear later, it still probably will not be a 24GB GPU model.
A more realistic starting point is 96GB to 128GB total VRAM, or CPU/offload setups that trade speed for capacity.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-pro-vram-estimate&#34;&gt;DeepSeek V4 Pro VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Pro&lt;/code&gt; is the flagship inference model, with official weights around 864.70GB.
Even at 4-bit quantization, the full weights remain in the hundreds of GB.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;864.70GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB+&lt;/td&gt;
          &lt;td&gt;Multi-node or multi-GPU inference service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;648GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td&gt;High-quality quantized service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;540GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td&gt;Quality/cost balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;432GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td&gt;Lowest practical quality line for Pro&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;324GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;384GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td&gt;Low-bit experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;216GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;320GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments with high quality and stability risk&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For individual users, &lt;code&gt;V4-Pro&lt;/code&gt; is better consumed through an API.
If the goal is full local deployment, treat it as a multi-GPU server model, not a 4090, 5090, or RTX PRO single-GPU model.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-flash-base-vram-estimate&#34;&gt;DeepSeek V4 Flash-Base VRAM Estimate
&lt;/h2&gt;&lt;p&gt;Base models are usually for research, fine-tuning, or continued training, not ordinary chat deployment.
&lt;code&gt;V4-Flash-Base&lt;/code&gt; has official weights of about 294.67GB.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;294.67GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;384GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td&gt;Research, preprocessing, evaluation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;221GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;320GB&lt;/td&gt;
          &lt;td&gt;High-quality quantization research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;184GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;224GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;147GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;224GB&lt;/td&gt;
          &lt;td&gt;Lower-cost Base experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;111GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td&gt;Low-bit experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;74GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you only want to use DeepSeek V4 capabilities, do not start with the Base model.
Base models cost more to deploy and tune; most applications should use the inference model or API.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-pro-base-vram-estimate&#34;&gt;DeepSeek V4 Pro-Base VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Pro-Base&lt;/code&gt; is the heaviest variant, with official weights around 1606.03GB.
That is already a 1.6TB-class model file.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1606.03GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.4TB+&lt;/td&gt;
          &lt;td&gt;Large-scale research clusters&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1205GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.5TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2TB&lt;/td&gt;
          &lt;td&gt;High-quality quantization research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1004GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.5TB&lt;/td&gt;
          &lt;td&gt;Research and evaluation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;803GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB&lt;/td&gt;
          &lt;td&gt;Low-bit research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;602GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td&gt;Extreme low-bit research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;402GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This kind of model should not be discussed in the framework of “can a home GPU run it?”
Even Q4 is already beyond the comfortable range of most single-machine workstations.&lt;/p&gt;
&lt;h2 id=&#34;why-active-params-are-not-enough&#34;&gt;Why Active Params Are Not Enough
&lt;/h2&gt;&lt;p&gt;DeepSeek V4 is an MoE model.
MoE means each token activates only part of the experts, so compute is much lower than the total parameter count.
But this does not mean VRAM only needs to hold the active parameters.&lt;/p&gt;
&lt;p&gt;Full local inference also depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether all expert weights must stay resident on GPU.&lt;/li&gt;
&lt;li&gt;Whether on-demand expert loading is supported.&lt;/li&gt;
&lt;li&gt;CPU memory to GPU memory transfer costs.&lt;/li&gt;
&lt;li&gt;NVMe offload latency.&lt;/li&gt;
&lt;li&gt;KV cache growth under long context.&lt;/li&gt;
&lt;li&gt;Extra runtime overhead under 1M context.&lt;/li&gt;
&lt;li&gt;Multi-node and multi-GPU communication cost.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;V4-Pro&lt;/code&gt; with &lt;code&gt;49B active&lt;/code&gt; should not be deployed like a 49B model.
&lt;code&gt;V4-Flash&lt;/code&gt; with &lt;code&gt;13B active&lt;/code&gt; should not be treated like a 13B small model either.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How to Choose
&lt;/h2&gt;&lt;p&gt;If you are an ordinary individual user:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Do not try to fully self-host DeepSeek V4.&lt;/li&gt;
&lt;li&gt;Use the official API when you need DeepSeek V4 capabilities.&lt;/li&gt;
&lt;li&gt;For private local deployment, first check whether you have mature inference infrastructure or internal multi-GPU servers.&lt;/li&gt;
&lt;li&gt;With only 24GB to 48GB VRAM, 7B, 14B, 32B, or 70B quantized models are more practical.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you have 128GB to 256GB total VRAM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Watch for stable community implementations of &lt;code&gt;V4-Flash Q4/Q5&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Do not treat &lt;code&gt;V4-Pro&lt;/code&gt; as your main local model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you have 512GB+ total VRAM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;V4-Pro Q4&lt;/code&gt; starts to become an engineering validation target.&lt;/li&gt;
&lt;li&gt;You still need to care about inference framework support, expert scheduling, KV cache, throughput, and concurrency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key question for DeepSeek V4 local deployment is not “which quantized file should I download?”
It is “do I have the system-level inference capacity for this model?”
It is closer to a server model than a desktop model.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://api-docs.deepseek.com/news/news260424&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DeepSeek V4 Preview Release - DeepSeek API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/collections/deepseek-ai/deepseek-v4&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DeepSeek-V4 collection - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Pro - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Flash - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-Base&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Pro-Base - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Flash-Base - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        
    </channel>
</rss>
