<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>DeepSeek-V4 on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/deepseek-v4/</link>
        <description>Recent content in DeepSeek-V4 on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Mon, 18 May 2026 18:38:26 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/deepseek-v4/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>DeepSeek-V4 KV Cache Explained: Why 1M Context Uses Less VRAM</title>
        <link>https://knightli.com/en/2026/05/18/deepseek-v4-kv-cache-compressed-attention/</link>
        <pubDate>Mon, 18 May 2026 18:38:26 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/18/deepseek-v4-kv-cache-compressed-attention/</guid>
        <description>&lt;p&gt;The real cost of long-context models is often not whether they can accept one million tokens, but how much VRAM the KV Cache consumes during inference.&lt;/p&gt;
&lt;p&gt;During Transformer decoding, every newly generated token needs access to the Key and Value states of previous tokens. The longer the context, the larger the KV Cache. A larger KV Cache puts pressure on VRAM, memory bandwidth, time to first token, and throughput.&lt;/p&gt;
&lt;p&gt;DeepSeek-V4 is interesting because it does not only reduce cache along the attention-head dimension. It pushes compression into the sequence-length dimension. According to Hugging Face&amp;rsquo;s discussion of DeepSeek-V4, in a 1M-token setting, DeepSeek-V4-Pro&amp;rsquo;s KV Cache is about 10% of DeepSeek-V3.2, and about 2% of a common bf16 GQA architecture.&lt;/p&gt;
&lt;p&gt;That is the key difference: DeepSeek-V4 does not merely store each KV entry in a smaller format. It reduces the number of KV entries that must be kept and searched over long history.&lt;/p&gt;
&lt;h2 id=&#34;several-generations-of-kv-cache-optimization&#34;&gt;Several generations of KV Cache optimization
&lt;/h2&gt;&lt;p&gt;KV Cache optimization has evolved through several routes.&lt;/p&gt;
&lt;p&gt;The first is traditional MHA, or Multi-Head Attention. Each Query head typically has its own Key/Value heads. The structure is direct, but under long context the cache grows linearly with sequence length, making VRAM pressure heavy.&lt;/p&gt;
&lt;p&gt;The second is GQA, or Grouped Query Attention. Multiple Query heads share fewer Key/Value heads. Many modern models such as LLaMA, Mistral, and Qwen use similar ideas. It significantly reduces KV head count and is now a common long-context optimization.&lt;/p&gt;
&lt;p&gt;The third is MLA, or Multi-head Latent Attention. DeepSeek-V2 and DeepSeek-V3 use this route, compressing Key/Value into low-rank latent representations and further reducing cache along the attention-head dimension.&lt;/p&gt;
&lt;p&gt;The fourth is DeepSeek-V4&amp;rsquo;s hybrid compressed attention. It focuses on sequence length: instead of only reducing how much KV each token stores, it compresses multiple historical tokens into fewer KV entries and retrieves them through sparse or dense attention.&lt;/p&gt;
&lt;p&gt;Roughly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;MHA: every head remembers separately.&lt;/li&gt;
&lt;li&gt;GQA: multiple Query heads share memory.&lt;/li&gt;
&lt;li&gt;MLA: each token&amp;rsquo;s KV representation is compressed into a latent vector.&lt;/li&gt;
&lt;li&gt;DeepSeek-V4: many historical tokens are aggregated into fewer compressed memory blocks.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;key-change-from-head-compression-to-sequence-compression&#34;&gt;Key change: from head compression to sequence compression
&lt;/h2&gt;&lt;p&gt;GQA and MLA mainly optimize how much KV each token stores. That works well, but when context reaches 1M tokens, the token count itself becomes the problem.&lt;/p&gt;
&lt;p&gt;DeepSeek-V4 compresses old context into blocks. The model does not necessarily preserve full KV for every distant token. Instead, multiple tokens form compressed entries.&lt;/p&gt;
&lt;p&gt;It is a bit like reading a very long book: you remember recent pages in detail, while earlier chapters are stored more as summaries, themes, and key clues. DeepSeek-V4&amp;rsquo;s attention design follows a similar split: keep detail nearby, use compressed representation farther away.&lt;/p&gt;
&lt;h2 id=&#34;csa-4x-compression-plus-sparse-retrieval&#34;&gt;CSA: 4x compression plus sparse retrieval
&lt;/h2&gt;&lt;p&gt;CSA stands for Compressed Sparse Attention. It is the finer-grained long-context compression mechanism.&lt;/p&gt;
&lt;p&gt;In CSA, the model compresses neighboring tokens into fewer KV entries. The Hugging Face Transformers documentation gives a default compression ratio of &lt;code&gt;m=4&lt;/code&gt;, meaning roughly every four tokens become one compressed entry.&lt;/p&gt;
&lt;p&gt;But it is not simple averaging. CSA uses a learned compression pool and overlapping windows so the model can preserve more useful information. After compression, the query does not attend to all compressed blocks directly. It first uses a Lightning Indexer to score them, selects the most relevant top-k compressed blocks, and then performs the core attention computation.&lt;/p&gt;
&lt;p&gt;This gives two benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The number of historical KV entries becomes smaller.&lt;/li&gt;
&lt;li&gt;Each query only looks at a relevant subset of compressed blocks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;CSA is suitable for long-range context where details still matter, such as codebases, long documents, and tool-call histories.&lt;/p&gt;
&lt;h2 id=&#34;hca-128x-compression-plus-dense-attention&#34;&gt;HCA: 128x compression plus dense attention
&lt;/h2&gt;&lt;p&gt;HCA stands for Heavily Compressed Attention, and it is more aggressive.&lt;/p&gt;
&lt;p&gt;The Transformers documentation gives a default compression ratio of &lt;code&gt;m&#39;=128&lt;/code&gt;. HCA compresses a much longer context span into one compressed entry. Because the compressed sequence becomes very short, it does not need sparse top-k retrieval like CSA. The query can simply perform dense attention over all HCA compressed entries.&lt;/p&gt;
&lt;p&gt;HCA acts more like a global summary. It does not try to preserve every detail. Instead, it covers very long history at extremely low cost, helping the model stay aware of global context, long-range topics, and far-away information.&lt;/p&gt;
&lt;p&gt;If CSA is &amp;ldquo;searchable compressed notes,&amp;rdquo; HCA is closer to a &amp;ldquo;global table of contents and summary.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;sliding-window-recent-context-keeps-details&#34;&gt;Sliding window: recent context keeps details
&lt;/h2&gt;&lt;p&gt;DeepSeek-V4 does not compress everything.&lt;/p&gt;
&lt;p&gt;In addition to CSA and HCA, it keeps a sliding-window branch for the most recent uncompressed context. The Transformers documentation notes that DeepSeek-V4 attention blocks concatenate long-range compressed branches with sliding-window K/V.&lt;/p&gt;
&lt;p&gt;This matters. When generating the next token, the nearest context is often the most important: variable names, function signatures, the current sentence, fresh tool outputs, or the user&amp;rsquo;s latest instruction. If recent context were over-compressed, output quality would suffer.&lt;/p&gt;
&lt;p&gt;So the design is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Nearby context: preserve uncompressed details.&lt;/li&gt;
&lt;li&gt;Mid-to-long context: use CSA for searchable compression.&lt;/li&gt;
&lt;li&gt;Farther context: use HCA for heavily compressed global summary.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;hybrid-layer-stack-different-layers-use-different-attention&#34;&gt;Hybrid layer stack: different layers use different attention
&lt;/h2&gt;&lt;p&gt;DeepSeek-V4 does not use one attention mechanism in every layer.&lt;/p&gt;
&lt;p&gt;The Hugging Face DeepSeek-V4 article notes that V4-Pro&amp;rsquo;s 61-layer structure uses HCA in the first two layers, alternates CSA and HCA afterward, and uses a sliding-window MTP block at the end. The Transformers documentation also describes V4-Pro as using two HCA bootstrap layers followed by alternating CSA/HCA layers.&lt;/p&gt;
&lt;p&gt;This shows that DeepSeek-V4 treats attention as a layered system. Different layers handle different information roles: some favor global compression, some favor sparse retrieval, and some preserve local windows.&lt;/p&gt;
&lt;p&gt;Compared with using one attention type everywhere, this hybrid structure is more complex but better suited to 1M-token context.&lt;/p&gt;
&lt;h2 id=&#34;fp8-and-fp4-further-reduce-cache-cost&#34;&gt;FP8 and FP4 further reduce cache cost
&lt;/h2&gt;&lt;p&gt;DeepSeek-V4&amp;rsquo;s savings do not come only from compression ratio.&lt;/p&gt;
&lt;p&gt;The Hugging Face article notes that most KV entries in V4 use FP8 storage, RoPE-related dimensions remain BF16, and the Lightning Indexer in CSA uses FP4. Compression ratio, low-precision storage, and sparse retrieval together create very low KV Cache usage.&lt;/p&gt;
&lt;p&gt;This is a reminder: do not only look at the headline context length. Deployment feasibility is determined by VRAM usage, bandwidth pressure, latency, and implementation quality under long context.&lt;/p&gt;
&lt;h2 id=&#34;differences-from-other-models&#34;&gt;Differences from other models
&lt;/h2&gt;&lt;p&gt;Compared with traditional MHA, DeepSeek-V4 no longer keeps full attention memory for every token in long history, so cache pressure drops sharply.&lt;/p&gt;
&lt;p&gt;Compared with GQA, DeepSeek-V4 does not merely reduce the number of KV heads. It also reduces the number of KV entries for long history. GQA still accumulates cache linearly with sequence length; V4 compresses distant context into blocks.&lt;/p&gt;
&lt;p&gt;Compared with DeepSeek-V3&amp;rsquo;s MLA, V4 extends optimization from &amp;ldquo;making each token representation more compact&amp;rdquo; to &amp;ldquo;compressing the number of historical token entries.&amp;rdquo; MLA already lowers per-token KV cost significantly, but under million-token context, sequence length remains a bottleneck.&lt;/p&gt;
&lt;p&gt;Compared with ordinary sparse attention, CSA compresses first and then performs sparse retrieval over a shorter compressed sequence. HCA goes further, using 128x compression so dense attention becomes cheap.&lt;/p&gt;
&lt;h2 id=&#34;what-it-means-for-agents-and-long-tasks&#34;&gt;What it means for agents and long tasks
&lt;/h2&gt;&lt;p&gt;Agent workflows are especially hungry for long context. They read files, call tools, receive tool results, generate plans, revise plans, and call tools again. The longer the context, the more likely KV Cache becomes the bottleneck.&lt;/p&gt;
&lt;p&gt;DeepSeek-V4&amp;rsquo;s cache design may help in several ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Easier handling of long codebases, long documents, and multi-round tool histories.&lt;/li&gt;
&lt;li&gt;Less pressure on time to first token and throughput from KV Cache.&lt;/li&gt;
&lt;li&gt;Longer context or more concurrent requests on the same hardware.&lt;/li&gt;
&lt;li&gt;Million-token context becomes closer to practical deployment, not just a benchmark number.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But compressed attention is not free. Compressing historical tokens into blocks involves information trade-offs. The model must balance saving VRAM with preserving retrievable details. Real performance depends on the task: code navigation, legal documents, long-form QA, and agent toolchains all have different detail-recall needs.&lt;/p&gt;
&lt;h2 id=&#34;do-not-read-2-as-2-of-all-cost&#34;&gt;Do not read 2% as 2% of all cost
&lt;/h2&gt;&lt;p&gt;&amp;ldquo;KV Cache is about 2% of GQA&amp;rdquo; is easy to misread.&lt;/p&gt;
&lt;p&gt;It mainly refers to KV Cache memory size. It does not mean total inference cost drops to 2%, or that every scenario becomes 50x faster. Inference still includes model weight reads, MoE routing, feed-forward networks, attention computation, scheduling, and communication overhead.&lt;/p&gt;
&lt;p&gt;The Hugging Face article separates two numbers: in 1M-token context, DeepSeek-V4-Pro&amp;rsquo;s per-token inference FLOPs are 27% of DeepSeek-V3.2, while KV Cache is 10%. Cache and compute are different dimensions.&lt;/p&gt;
&lt;p&gt;The safer statement is: DeepSeek-V4 greatly reduces KV Cache pressure for ultra-long context, improving deployment feasibility for million-token scenarios. Actual latency and throughput still depend on implementation, hardware, batching, quantization, and inference framework.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;The biggest difference between DeepSeek-V4 and other large models is that it moves KV Cache optimization from the attention-head dimension into the sequence-length dimension.&lt;/p&gt;
&lt;p&gt;GQA stores fewer KV heads. MLA makes each token&amp;rsquo;s KV representation more compact. DeepSeek-V4 further aggregates distant tokens into compressed blocks and combines CSA, HCA, sliding windows, and low-precision storage so million-token context is not immediately blocked by KV Cache.&lt;/p&gt;
&lt;p&gt;This is not a single trick. It is a long-context inference architecture: preserve details nearby, compress distant context, retrieve details when needed, and summarize globally when possible.&lt;/p&gt;
&lt;p&gt;For developers and agent applications, the meaning is direct: long context is not just about accepting more input. It must be runnable, stable, and affordable. That is what DeepSeek-V4 changes.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/blog/deepseekv4&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hugging Face: DeepSeek-V4: a million-token context that agents can actually use&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/docs/transformers/model_doc/deepseek_v4&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Hugging Face Transformers: DeepSeek-V4 model documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://arxiv.org/abs/2412.19437&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DeepSeek-V3 Technical Report&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>DeepSeek V4 Flash for a Godot Game Demo: How Far Can a Few Cents Go?</title>
        <link>https://knightli.com/en/2026/05/06/deepseek-v4-flash-godot-game-demo/</link>
        <pubDate>Wed, 06 May 2026 09:22:18 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/06/deepseek-v4-flash-godot-game-demo/</guid>
        <description>&lt;p&gt;Can &lt;code&gt;DeepSeek V4 Flash&lt;/code&gt; handle Godot game demo development?&lt;/p&gt;
&lt;p&gt;The focus is simple: can it create a small Godot demo that runs, can be observed, and includes physics effects?&lt;/p&gt;
&lt;p&gt;The short answer is yes. The quality is not commercial-grade, but it is already enough for gameplay prototyping and physics interaction demos. More importantly, the cost is very low, which makes it suitable for quickly validating ideas.&lt;/p&gt;
&lt;h2 id=&#34;demo-performance&#34;&gt;Demo Performance
&lt;/h2&gt;&lt;p&gt;The focus of this demo is physics interaction.&lt;/p&gt;
&lt;p&gt;Several visible effects include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The rope can be cut.&lt;/li&gt;
&lt;li&gt;The box falls to the ground.&lt;/li&gt;
&lt;li&gt;After increasing the mass, box collisions become more forceful.&lt;/li&gt;
&lt;li&gt;The rope shows noticeable elasticity.&lt;/li&gt;
&lt;li&gt;After adjusting friction and elasticity, the box shows clear sliding and bouncing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;From what it presents, this is no longer just &amp;ldquo;a few generated Godot scripts&amp;rdquo;. It is a small prototype that can run and show observable physics behavior.&lt;/p&gt;
&lt;h2 id=&#34;usability&#34;&gt;Usability
&lt;/h2&gt;&lt;p&gt;The value of this demo is that it can run, be viewed, and be modified. It is not a complete game, nor an engineering project ready for direct commercialization, but it already demonstrates several things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DeepSeek V4 Flash&lt;/code&gt; can understand the basic goal of a Godot demo.&lt;/li&gt;
&lt;li&gt;An AI Agent can turn requirements into a runnable project.&lt;/li&gt;
&lt;li&gt;Non-web tasks such as Godot physics interaction are entering a low-cost prototyping stage.&lt;/li&gt;
&lt;li&gt;For individual developers, it can quickly turn an idea into something visible.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the goal is to build a formal game, it is obviously not enough. But if the goal is to verify whether a gameplay idea is interesting or whether the rough physics effect can be made, this demo is already usable.&lt;/p&gt;
&lt;h2 id=&#34;cost-significance&#34;&gt;Cost Significance
&lt;/h2&gt;&lt;p&gt;The most notable part is not how polished the visuals are, but the cost.&lt;/p&gt;
&lt;p&gt;If a Godot physics demo can produce a runnable version with model costs at the level of a few cents, its significance is not replacing professional game development. It is sharply reducing the cost of prototype trial and error.&lt;/p&gt;
&lt;p&gt;In the past, validating a small game idea usually required knowing Godot, writing scripts, setting up scenes, and adjusting physics parameters. Now an AI Agent can first generate a runnable version, and humans can judge whether the direction makes sense.&lt;/p&gt;
&lt;p&gt;For indie developers, this kind of low-cost experimentation is useful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Quickly validate gameplay concepts.&lt;/li&gt;
&lt;li&gt;Generate temporary demos for others to see.&lt;/li&gt;
&lt;li&gt;Explore Godot APIs and the physics system.&lt;/li&gt;
&lt;li&gt;Turn ideas into an initial runnable project.&lt;/li&gt;
&lt;li&gt;Reduce handwritten code cost before the direction is clear.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;deepseek-v4-flashs-performance&#34;&gt;DeepSeek V4 Flash&amp;rsquo;s Performance
&lt;/h2&gt;&lt;p&gt;What is worth noting is that the model used here is &lt;code&gt;DeepSeek V4 Flash&lt;/code&gt;, not a more expensive and heavier flagship model.&lt;/p&gt;
&lt;p&gt;It performs well in the role of a low-cost prototype model. It is not the strongest, most stable, or most suitable model for delivering production engineering, but it is attractive in budget-sensitive scenarios where the goal is to quickly test a direction.&lt;/p&gt;
&lt;h2 id=&#34;suitable-scenarios&#34;&gt;Suitable Scenarios
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;DeepSeek V4 Flash + Agent + Godot&lt;/code&gt; is better suited to these tasks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Small gameplay prototypes.&lt;/li&gt;
&lt;li&gt;Physics effect demos.&lt;/li&gt;
&lt;li&gt;UI or interaction concept validation.&lt;/li&gt;
&lt;li&gt;Teaching examples.&lt;/li&gt;
&lt;li&gt;Helping understand Godot project structure.&lt;/li&gt;
&lt;li&gt;Generating a first runnable project.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is less suitable for directly taking on these tasks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Large game architecture.&lt;/li&gt;
&lt;li&gt;Complex character controllers.&lt;/li&gt;
&lt;li&gt;Network synchronization.&lt;/li&gt;
&lt;li&gt;Core code for commercial projects.&lt;/li&gt;
&lt;li&gt;High-precision physics simulation.&lt;/li&gt;
&lt;li&gt;Automated submission without human testing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, it is suitable as a first draft and testbed, not as the owner of production engineering.&lt;/p&gt;
&lt;h2 id=&#34;what-this-shows&#34;&gt;What This Shows
&lt;/h2&gt;&lt;p&gt;This shows that AI coding is continuing to expand from websites, scripts, and backend APIs into game development and interactive prototyping.&lt;/p&gt;
&lt;p&gt;Game development used to have a high barrier to entry, especially when engines, scripts, asset management, and physics systems were mixed together. Beginners could easily get stuck. Now models plus Agent tools can first set up the project, letting developers focus on gameplay judgment and effect tuning.&lt;/p&gt;
&lt;p&gt;This may bring three changes:&lt;/p&gt;
&lt;p&gt;First, game prototypes become cheaper. Many ideas no longer need to wait until full development to be validated; they can first become runnable demos.&lt;/p&gt;
&lt;p&gt;Second, indie developers may become more willing to experiment. People who do not know Godot can still use AI to touch the project structure and basic workflow.&lt;/p&gt;
&lt;p&gt;Third, model stability becomes more important. Game development is not just about code running. The effect also needs to be reasonable, the feel needs to be normal, and parameters need to be controllable. In the future, models that better combine actual visuals and runtime state will be more suitable for this kind of task.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;DeepSeek V4 Flash for a Godot demo can be summarized in one sentence: &lt;strong&gt;not perfect, but cheap enough, fast enough, and suitable enough for prototyping.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;It is still far from commercial games, but if the goal is to validate a small game idea at extremely low cost, it is already valuable.&lt;/p&gt;
&lt;p&gt;For individual developers, the most realistic use is not handing the whole game to AI, but letting AI first produce a runnable project while humans handle judgment, trade-offs, and polishing. Used this way, low-cost models such as DeepSeek V4 Flash become genuinely appealing.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Running DeepSeek V4 Locally: VRAM Estimates for Pro, Flash, and Base Versions</title>
        <link>https://knightli.com/en/2026/05/01/deepseek-v4-local-vram-quantization-table/</link>
        <pubDate>Fri, 01 May 2026 11:55:25 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/01/deepseek-v4-local-vram-quantization-table/</guid>
        <description>&lt;p&gt;DeepSeek V4 and Gemma 4 are not in the same class for local deployment.
With Gemma 4, it still makes sense to discuss how to run 26B or 31B models on 24GB or 32GB GPUs. DeepSeek V4 is a huge MoE model, and full local deployment quickly moves into multi-GPU workstation or server territory.&lt;/p&gt;
&lt;p&gt;The official DeepSeek V4 Preview release mainly includes two inference models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Pro&lt;/code&gt;: &lt;code&gt;1.6T total / 49B active params&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Flash&lt;/code&gt;: &lt;code&gt;284B total / 13B active params&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The official Hugging Face collection also includes two Base models:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Pro-Base&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DeepSeek-V4-Flash-Base&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This article only discusses rough VRAM requirements when the full model weights are loaded.
For MoE models, &lt;code&gt;active params&lt;/code&gt; mainly affects per-token compute. It does not mean only those parameters need to be loaded.
Without expert-on-demand loading, CPU/NVMe offload, distributed inference, or specialized runtime optimizations, VRAM should still be estimated from the full weight size.&lt;/p&gt;
&lt;h2 id=&#34;quick-summary&#34;&gt;Quick Summary
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;VRAM Scale&lt;/th&gt;
          &lt;th&gt;What Is Realistic&lt;/th&gt;
          &lt;th&gt;Do Not Expect&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;24GB&lt;/td&gt;
          &lt;td&gt;Cannot fully run DeepSeek V4; use smaller distilled models or API&lt;/td&gt;
          &lt;td&gt;Full V4-Flash / V4-Pro local loading&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;48GB&lt;/td&gt;
          &lt;td&gt;Still not suitable for full loading; good for small models or remote API clients&lt;/td&gt;
          &lt;td&gt;Stable V4-Flash Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;80GB&lt;/td&gt;
          &lt;td&gt;Theoretically try V4-Flash Q2/Q3 or heavy offload&lt;/td&gt;
          &lt;td&gt;V4-Pro&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;128GB&lt;/td&gt;
          &lt;td&gt;V4-Flash Q4 becomes more realistic; Q5/Q6 still tight&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;192GB&lt;/td&gt;
          &lt;td&gt;V4-Flash FP8/Q6 is more comfortable; Pro Q2 enters experimental range&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;256GB&lt;/td&gt;
          &lt;td&gt;V4-Flash FP8 is fairly comfortable; Pro Q2/Q3 can be tested&lt;/td&gt;
          &lt;td&gt;V4-Pro Q5 and above&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;512GB&lt;/td&gt;
          &lt;td&gt;V4-Pro Q4 starts to become discussable&lt;/td&gt;
          &lt;td&gt;V4-Pro FP8&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;1TB+&lt;/td&gt;
          &lt;td&gt;V4-Pro FP8 and low-bit Pro-Base are more realistic&lt;/td&gt;
          &lt;td&gt;Low-cost single-machine deployment&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;2TB+&lt;/td&gt;
          &lt;td&gt;Pro-Base FP8 class&lt;/td&gt;
          &lt;td&gt;Ordinary workstation deployment&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If your goal is to run a model on a personal computer, DeepSeek V4 is not the right target.
More realistic options are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use the official DeepSeek API or compatible services.&lt;/li&gt;
&lt;li&gt;Wait for stable community GGUF/EXL2/MLX quantizations and inference support.&lt;/li&gt;
&lt;li&gt;Use smaller DeepSeek distilled models.&lt;/li&gt;
&lt;li&gt;Use local models in the 7B to 70B range from Qwen, Gemma, Llama, and similar families.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;official-weight-sizes&#34;&gt;Official Weight Sizes
&lt;/h2&gt;&lt;p&gt;The following figures come from &lt;code&gt;model.safetensors.index.json&lt;/code&gt; in the official Hugging Face repositories.
They reflect current public weight file sizes, not full runtime VRAM use under long context.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Model&lt;/th&gt;
          &lt;th&gt;Parameter Scale&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Official Weight Size&lt;/th&gt;
          &lt;th&gt;Notes&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Flash&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;284B total / 13B active&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;159.61GB&lt;/td&gt;
          &lt;td&gt;Inference model, smallest in this group&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Pro&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;1.6T total / 49B active&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;864.70GB&lt;/td&gt;
          &lt;td&gt;Inference model, stronger but enormous&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Flash-Base&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;284B total&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;294.67GB&lt;/td&gt;
          &lt;td&gt;Base model, closer to full FP8 weight size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;DeepSeek-V4-Pro-Base&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;1.6T total&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1606.03GB&lt;/td&gt;
          &lt;td&gt;Base model, about 1.6TB&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Even the smallest &lt;code&gt;V4-Flash&lt;/code&gt; is already close to 160GB of official weights.
That is why it should not be treated like a 13B model just because it has &lt;code&gt;13B active params&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-flash-vram-estimate&#34;&gt;DeepSeek V4 Flash VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Flash&lt;/code&gt; is the most approachable DeepSeek V4 variant for local experiments.
But that only means “more approachable than Pro”; it is still not a consumer single-GPU model.&lt;/p&gt;
&lt;p&gt;The table below uses the official 159.61GB weight size as the baseline.
Q4/Q3/Q2 rows are bit-width estimates and do not imply that stable official GGUF versions currently exist.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;159.61GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td&gt;Multi-GPU servers, inference service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;120GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td&gt;Quality-first quantization tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;100GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td&gt;More realistic starting point for Flash&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;60GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td&gt;Large-VRAM single GPU or multi-GPU tests&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;40GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64GB&lt;/td&gt;
          &lt;td&gt;Extreme low-bit experiments with clear quality risk&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If mature &lt;code&gt;V4-Flash Q4&lt;/code&gt; builds appear later, it still probably will not be a 24GB GPU model.
A more realistic starting point is 96GB to 128GB total VRAM, or CPU/offload setups that trade speed for capacity.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-pro-vram-estimate&#34;&gt;DeepSeek V4 Pro VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Pro&lt;/code&gt; is the flagship inference model, with official weights around 864.70GB.
Even at 4-bit quantization, the full weights remain in the hundreds of GB.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;864.70GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB+&lt;/td&gt;
          &lt;td&gt;Multi-node or multi-GPU inference service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;648GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td&gt;High-quality quantized service&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;540GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td&gt;Quality/cost balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;432GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td&gt;Lowest practical quality line for Pro&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;324GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;384GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td&gt;Low-bit experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;216GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;320GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments with high quality and stability risk&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For individual users, &lt;code&gt;V4-Pro&lt;/code&gt; is better consumed through an API.
If the goal is full local deployment, treat it as a multi-GPU server model, not a 4090, 5090, or RTX PRO single-GPU model.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-flash-base-vram-estimate&#34;&gt;DeepSeek V4 Flash-Base VRAM Estimate
&lt;/h2&gt;&lt;p&gt;Base models are usually for research, fine-tuning, or continued training, not ordinary chat deployment.
&lt;code&gt;V4-Flash-Base&lt;/code&gt; has official weights of about 294.67GB.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;294.67GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;384GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td&gt;Research, preprocessing, evaluation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;221GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;320GB&lt;/td&gt;
          &lt;td&gt;High-quality quantization research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;184GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;224GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;256GB&lt;/td&gt;
          &lt;td&gt;Quality/size balance&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;147GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;224GB&lt;/td&gt;
          &lt;td&gt;Lower-cost Base experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;111GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160GB&lt;/td&gt;
          &lt;td&gt;Low-bit experiments&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;74GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you only want to use DeepSeek V4 capabilities, do not start with the Base model.
Base models cost more to deploy and tune; most applications should use the inference model or API.&lt;/p&gt;
&lt;h2 id=&#34;deepseek-v4-pro-base-vram-estimate&#34;&gt;DeepSeek V4 Pro-Base VRAM Estimate
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;V4-Pro-Base&lt;/code&gt; is the heaviest variant, with official weights around 1606.03GB.
That is already a 1.6TB-class model file.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Version / Quantization&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Estimated Weight Size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Minimum VRAM&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Safer VRAM&lt;/th&gt;
          &lt;th&gt;Best For&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;FP8 / official weights&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1606.03GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.4TB+&lt;/td&gt;
          &lt;td&gt;Large-scale research clusters&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q6&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1205GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.5TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2TB&lt;/td&gt;
          &lt;td&gt;High-quality quantization research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q5&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1004GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.5TB&lt;/td&gt;
          &lt;td&gt;Research and evaluation&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q4&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;803GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1.2TB&lt;/td&gt;
          &lt;td&gt;Low-bit research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q3&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;602GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;768GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1TB&lt;/td&gt;
          &lt;td&gt;Extreme low-bit research&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;code&gt;Q2&lt;/code&gt;&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;402GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;512GB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;640GB&lt;/td&gt;
          &lt;td&gt;Extreme experiments&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This kind of model should not be discussed in the framework of “can a home GPU run it?”
Even Q4 is already beyond the comfortable range of most single-machine workstations.&lt;/p&gt;
&lt;h2 id=&#34;why-active-params-are-not-enough&#34;&gt;Why Active Params Are Not Enough
&lt;/h2&gt;&lt;p&gt;DeepSeek V4 is an MoE model.
MoE means each token activates only part of the experts, so compute is much lower than the total parameter count.
But this does not mean VRAM only needs to hold the active parameters.&lt;/p&gt;
&lt;p&gt;Full local inference also depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether all expert weights must stay resident on GPU.&lt;/li&gt;
&lt;li&gt;Whether on-demand expert loading is supported.&lt;/li&gt;
&lt;li&gt;CPU memory to GPU memory transfer costs.&lt;/li&gt;
&lt;li&gt;NVMe offload latency.&lt;/li&gt;
&lt;li&gt;KV cache growth under long context.&lt;/li&gt;
&lt;li&gt;Extra runtime overhead under 1M context.&lt;/li&gt;
&lt;li&gt;Multi-node and multi-GPU communication cost.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;V4-Pro&lt;/code&gt; with &lt;code&gt;49B active&lt;/code&gt; should not be deployed like a 49B model.
&lt;code&gt;V4-Flash&lt;/code&gt; with &lt;code&gt;13B active&lt;/code&gt; should not be treated like a 13B small model either.&lt;/p&gt;
&lt;h2 id=&#34;how-to-choose&#34;&gt;How to Choose
&lt;/h2&gt;&lt;p&gt;If you are an ordinary individual user:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Do not try to fully self-host DeepSeek V4.&lt;/li&gt;
&lt;li&gt;Use the official API when you need DeepSeek V4 capabilities.&lt;/li&gt;
&lt;li&gt;For private local deployment, first check whether you have mature inference infrastructure or internal multi-GPU servers.&lt;/li&gt;
&lt;li&gt;With only 24GB to 48GB VRAM, 7B, 14B, 32B, or 70B quantized models are more practical.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you have 128GB to 256GB total VRAM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Watch for stable community implementations of &lt;code&gt;V4-Flash Q4/Q5&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Do not treat &lt;code&gt;V4-Pro&lt;/code&gt; as your main local model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you have 512GB+ total VRAM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;V4-Pro Q4&lt;/code&gt; starts to become an engineering validation target.&lt;/li&gt;
&lt;li&gt;You still need to care about inference framework support, expert scheduling, KV cache, throughput, and concurrency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key question for DeepSeek V4 local deployment is not “which quantized file should I download?”
It is “do I have the system-level inference capacity for this model?”
It is closer to a server model than a desktop model.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://api-docs.deepseek.com/news/news260424&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DeepSeek V4 Preview Release - DeepSeek API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/collections/deepseek-ai/deepseek-v4&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DeepSeek-V4 collection - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Pro - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Flash - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-Base&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Pro-Base - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;deepseek-ai/DeepSeek-V4-Flash-Base - Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Choose Between GPT 5.5, Claude Opus 4.7, DeepSeek V4, and Qwen 3.6 Max</title>
        <link>https://knightli.com/en/2026/04/28/coding-ai-benchmark-gpt55-claude-opus47-deepseek-v4-qwen36max/</link>
        <pubDate>Tue, 28 Apr 2026 22:18:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/28/coding-ai-benchmark-gpt55-claude-opus47-deepseek-v4-qwen36max/</guid>
        <description>&lt;p&gt;If you only want the short answer, remember this version first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If you want the most reliable option and the least wasted time, start with &lt;code&gt;GPT 5.5&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;If you care most about page presentation, creativity, and visual polish, &lt;code&gt;Claude Opus 4.7&lt;/code&gt; is still strong&lt;/li&gt;
&lt;li&gt;If you want to know which domestic model is closest to the top tier, &lt;code&gt;Qwen 3.6 Max&lt;/code&gt; is highly competitive now&lt;/li&gt;
&lt;li&gt;&lt;code&gt;DeepSeek V4&lt;/code&gt; is not weak, but its output is more uneven than the others&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When people ask which coding AI is the strongest right now, they are usually not really asking about a leaderboard. They are asking something more practical:&lt;br&gt;
&lt;strong&gt;If I need to build a page, make a demo, generate a small tool, or add interaction, which model is most likely to give me something usable on the first try?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;From that angle, the differences between these models are already pretty clear.&lt;/p&gt;
&lt;h2 id=&#34;the-overall-verdict&#34;&gt;The Overall Verdict
&lt;/h2&gt;&lt;p&gt;If you put &lt;code&gt;GPT 5.5&lt;/code&gt;, &lt;code&gt;Claude Opus 4.7&lt;/code&gt;, &lt;code&gt;DeepSeek V4&lt;/code&gt;, and &lt;code&gt;Qwen 3.6 Max&lt;/code&gt; side by side, the most consistent all-around choice is still &lt;code&gt;GPT 5.5&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is not always the flashiest one, but it rarely leaves you clearly disappointed. It is fast, the first draft usually comes out with high completion, and it handles logic, interaction, motion, and small games with a steady hand.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Claude Opus 4.7&lt;/code&gt; feels different. Its biggest strength is not pure stability. It is page atmosphere, UI organization, and presentation. A lot of the time, you open what it made and your first reaction is simply that it looks polished. If visual presentation matters more to you, it is still very worth considering.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Qwen 3.6 Max&lt;/code&gt; is the one that most deserves a fresh look. It is no longer just &amp;ldquo;usable for a domestic model.&amp;rdquo; In some scenarios, it can genuinely go head-to-head with &lt;code&gt;GPT 5.5&lt;/code&gt; on output quality. In frontend pages, visual completeness, and realism, it has started to build real presence.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;DeepSeek V4&lt;/code&gt; is not failing because it cannot do the work. The issue is that it is less predictable. When it works, it can be perfectly solid, and sometimes surprisingly good. But the gap between its better and weaker outputs is still more obvious than it is with the others.&lt;/p&gt;
&lt;h2 id=&#34;where-gpt-55-is-strongest&#34;&gt;Where &lt;code&gt;GPT 5.5&lt;/code&gt; Is Strongest
&lt;/h2&gt;&lt;p&gt;If the things you do most often look like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Generate a complete webpage&lt;/li&gt;
&lt;li&gt;Build a small demo with motion&lt;/li&gt;
&lt;li&gt;Create an interactive page with some logic&lt;/li&gt;
&lt;li&gt;Generate a small game or a multi-state interaction&lt;/li&gt;
&lt;li&gt;Keep rework to a minimum&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then &lt;code&gt;GPT 5.5&lt;/code&gt; is still the safest default answer.&lt;/p&gt;
&lt;p&gt;Its advantages are mostly these:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fast code generation&lt;/li&gt;
&lt;li&gt;High first-draft usability&lt;/li&gt;
&lt;li&gt;Fewer hard mistakes in logic and interaction&lt;/li&gt;
&lt;li&gt;Stable performance on mixed tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To put it more simply, &lt;code&gt;GPT 5.5&lt;/code&gt; feels like the model most likely to get the foundation right on the first pass.&lt;br&gt;
What many people actually need is not the most dazzling result in one category. They need the first version not to break. On that front, it is still the least stressful choice.&lt;/p&gt;
&lt;p&gt;Of course, it is not without weaknesses.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;On highly visual pages, it is not always the most surprising&lt;/li&gt;
&lt;li&gt;Sometimes it is so stable that it leaves less of a design impression&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So if you want one default recommendation, it is still &lt;code&gt;GPT 5.5&lt;/code&gt;.&lt;br&gt;
That does not mean it is the only one worth looking at.&lt;/p&gt;
&lt;h2 id=&#34;who-claude-opus-47-fits-best&#34;&gt;Who &lt;code&gt;Claude Opus 4.7&lt;/code&gt; Fits Best
&lt;/h2&gt;&lt;p&gt;The appeal of &lt;code&gt;Claude Opus 4.7&lt;/code&gt; comes more from how the page feels.&lt;/p&gt;
&lt;p&gt;Its strengths are usually:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cleaner UI structure&lt;/li&gt;
&lt;li&gt;More complete visual presentation&lt;/li&gt;
&lt;li&gt;Stronger presentation quality on some pages&lt;/li&gt;
&lt;li&gt;More noticeable creativity in visualization and design&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the model is helping you build things like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Demo pages&lt;/li&gt;
&lt;li&gt;Data presentation pages&lt;/li&gt;
&lt;li&gt;Small pages where visual feel matters a lot&lt;/li&gt;
&lt;li&gt;Outputs that should look polished immediately&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then &lt;code&gt;Claude&lt;/code&gt; still deserves a place near the top.&lt;/p&gt;
&lt;p&gt;Its weaknesses are also fairly clear:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It is not as stable as &lt;code&gt;GPT 5.5&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Sometimes it looks good, but the detailed logic drifts&lt;/li&gt;
&lt;li&gt;In some cases the code runs, yet the core experience is not quite right&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;Claude&lt;/code&gt; feels more like a frontend-leaning model with extra aesthetic instinct.&lt;br&gt;
If your first priority is how the page looks, it has real advantages. If your biggest fear is a logic mistake in the first output, you need to be a bit more careful.&lt;/p&gt;
&lt;h2 id=&#34;why-qwen-36-max-deserves-serious-attention&#34;&gt;Why &lt;code&gt;Qwen 3.6 Max&lt;/code&gt; Deserves Serious Attention
&lt;/h2&gt;&lt;p&gt;Among these models, &lt;code&gt;Qwen 3.6 Max&lt;/code&gt; gives the strongest sense of momentum.&lt;/p&gt;
&lt;p&gt;Not long ago, many people looked at domestic coding AI mainly by asking whether it could keep up at all. With &lt;code&gt;Qwen 3.6 Max&lt;/code&gt;, the question is already different:&lt;br&gt;
&lt;strong&gt;In frontend-first output scenarios, can it directly compete with the top overseas models?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Its strongest areas right now include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Good-looking page output&lt;/li&gt;
&lt;li&gt;Solid motion and realistic visual effects in some cases&lt;/li&gt;
&lt;li&gt;Outputs that feel more complete&lt;/li&gt;
&lt;li&gt;Results that can sometimes approach or stay close to &lt;code&gt;GPT 5.5&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That says something important.&lt;br&gt;
If your use case leans toward webpages, frontend work, and presentation-heavy output, &lt;code&gt;Qwen 3.6 Max&lt;/code&gt; is no longer just a backup option. It can be treated as a serious main candidate.&lt;/p&gt;
&lt;p&gt;It still has some weaknesses, though.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;On interaction-heavy logic tasks, it can still lose a bit of completeness&lt;/li&gt;
&lt;li&gt;Some pages look very good, while some tasks fall flatter than expected&lt;/li&gt;
&lt;li&gt;Its variance is still higher than &lt;code&gt;GPT 5.5&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even so, its current presence is already very strong.&lt;br&gt;
If you want to know which domestic model deserves the most attention right now, it is hard to look past &lt;code&gt;Qwen 3.6 Max&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;where-deepseek-v4-stands-right-now&#34;&gt;Where &lt;code&gt;DeepSeek V4&lt;/code&gt; Stands Right Now
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;DeepSeek V4&lt;/code&gt; is a little more complicated to place.&lt;/p&gt;
&lt;p&gt;The issue is not that it cannot do the work. The issue is that it is harder to predict where a given result will land.&lt;br&gt;
Sometimes it can finish the task with decent visuals and working functionality. Sometimes, once the task asks for animation, logic, and data presentation at the same time, it becomes more likely to stumble.&lt;/p&gt;
&lt;p&gt;Right now it feels more like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It has real ability&lt;/li&gt;
&lt;li&gt;It is not weak&lt;/li&gt;
&lt;li&gt;It can still hand in acceptable results on some tasks&lt;/li&gt;
&lt;li&gt;But its stability is not yet reassuring enough&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That shapes who it suits best.&lt;/p&gt;
&lt;p&gt;If you do not mind trying a few times, can tolerate an occasional restart, or already plan to check and edit the code yourself, &lt;code&gt;DeepSeek V4&lt;/code&gt; is still worth using.&lt;br&gt;
But if your top priority is reducing friction and maximizing first-pass success, it is not yet the safest option.&lt;/p&gt;
&lt;h2 id=&#34;so-what-should-an-ordinary-user-pick&#34;&gt;So What Should an Ordinary User Pick?
&lt;/h2&gt;&lt;p&gt;If you are not benchmarking models for fun and actually want to get work done, the easiest way is to choose by use case.&lt;/p&gt;
&lt;h3 id=&#34;1-you-want-less-hassle-and-a-higher-first-pass-success-rate&#34;&gt;1. You want less hassle and a higher first-pass success rate
&lt;/h3&gt;&lt;p&gt;Pick &lt;code&gt;GPT 5.5&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is best at this workflow: &amp;ldquo;Here is my requirement, give me a usable first version.&amp;rdquo;&lt;br&gt;
That matters even more when you do not have the time to keep iterating and fixing.&lt;/p&gt;
&lt;h3 id=&#34;2-you-care-more-about-presentation-and-visual-finish&#34;&gt;2. You care more about presentation and visual finish
&lt;/h3&gt;&lt;p&gt;Pick &lt;code&gt;Claude Opus 4.7&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If what you want is a page that already looks more like a finished product, or your work is more demo-oriented and presentation-oriented, &lt;code&gt;Claude&lt;/code&gt; shows its value more easily.&lt;/p&gt;
&lt;h3 id=&#34;3-you-want-the-strongest-domestic-model-for-frontend-first-output&#34;&gt;3. You want the strongest domestic model for frontend-first output
&lt;/h3&gt;&lt;p&gt;Start with &lt;code&gt;Qwen 3.6 Max&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is no longer something you use only as a compromise. It can now be compared directly and seriously.&lt;br&gt;
If your tasks lean toward webpages, motion, and presentation, its competitiveness is already very real.&lt;/p&gt;
&lt;h3 id=&#34;4-you-can-tolerate-some-variance-and-want-to-keep-watching-domestic-progress&#34;&gt;4. You can tolerate some variance and want to keep watching domestic progress
&lt;/h3&gt;&lt;p&gt;Keep an eye on &lt;code&gt;DeepSeek V4&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Its problem is not lack of ability. It is that the level of execution still varies too much.&lt;br&gt;
If the stability keeps improving, it could become much more important.&lt;/p&gt;
&lt;h2 id=&#34;one-last-line&#34;&gt;One Last Line
&lt;/h2&gt;&lt;p&gt;The difference between these mainstream coding AIs is no longer about who can code and who cannot. It is about who is steadier, who looks better, and who fits your kind of work.&lt;/p&gt;
&lt;p&gt;If you want the simplest answer, &lt;code&gt;GPT 5.5&lt;/code&gt; is still the first choice.&lt;br&gt;
If you want stronger presentation quality, &lt;code&gt;Claude Opus 4.7&lt;/code&gt; still has real flavor.&lt;br&gt;
If you care about which domestic model deserves the closest attention, &lt;code&gt;Qwen 3.6 Max&lt;/code&gt; is already near the front.&lt;br&gt;
&lt;code&gt;DeepSeek V4&lt;/code&gt; feels more like a strong contender that is still working on consistency.&lt;/p&gt;
&lt;p&gt;If you want the shortest possible conclusion:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For stability, pick &lt;code&gt;GPT 5.5&lt;/code&gt;. For presentation, pick &lt;code&gt;Claude&lt;/code&gt;. Among domestic models, the one most worth watching is &lt;code&gt;Qwen 3.6 Max&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
