DeepSeek-V4 on KnightLi Blog

DeepSeek-V4 KV Cache Explained: Why 1M Context Uses Less VRAM

Mon, 18 May 2026 18:38:26 +0800

The real cost of long-context models is often not whether they can accept one million tokens, but how much VRAM the KV Cache consumes during inference.

During Transformer decoding, every newly generated token needs access to the Key and Value states of previous tokens. The longer the context, the larger the KV Cache. A larger KV Cache puts pressure on VRAM, memory bandwidth, time to first token, and throughput.

DeepSeek-V4 is interesting because it does not only reduce cache along the attention-head dimension. It pushes compression into the sequence-length dimension. According to Hugging Face’s discussion of DeepSeek-V4, in a 1M-token setting, DeepSeek-V4-Pro’s KV Cache is about 10% of DeepSeek-V3.2, and about 2% of a common bf16 GQA architecture.

That is the key difference: DeepSeek-V4 does not merely store each KV entry in a smaller format. It reduces the number of KV entries that must be kept and searched over long history.

Several generations of KV Cache optimization

KV Cache optimization has evolved through several routes.

The first is traditional MHA, or Multi-Head Attention. Each Query head typically has its own Key/Value heads. The structure is direct, but under long context the cache grows linearly with sequence length, making VRAM pressure heavy.

The second is GQA, or Grouped Query Attention. Multiple Query heads share fewer Key/Value heads. Many modern models such as LLaMA, Mistral, and Qwen use similar ideas. It significantly reduces KV head count and is now a common long-context optimization.

The third is MLA, or Multi-head Latent Attention. DeepSeek-V2 and DeepSeek-V3 use this route, compressing Key/Value into low-rank latent representations and further reducing cache along the attention-head dimension.

The fourth is DeepSeek-V4’s hybrid compressed attention. It focuses on sequence length: instead of only reducing how much KV each token stores, it compresses multiple historical tokens into fewer KV entries and retrieves them through sparse or dense attention.

Roughly:

MHA: every head remembers separately.
GQA: multiple Query heads share memory.
MLA: each token’s KV representation is compressed into a latent vector.
DeepSeek-V4: many historical tokens are aggregated into fewer compressed memory blocks.

Key change: from head compression to sequence compression

GQA and MLA mainly optimize how much KV each token stores. That works well, but when context reaches 1M tokens, the token count itself becomes the problem.

DeepSeek-V4 compresses old context into blocks. The model does not necessarily preserve full KV for every distant token. Instead, multiple tokens form compressed entries.

It is a bit like reading a very long book: you remember recent pages in detail, while earlier chapters are stored more as summaries, themes, and key clues. DeepSeek-V4’s attention design follows a similar split: keep detail nearby, use compressed representation farther away.

CSA: 4x compression plus sparse retrieval

CSA stands for Compressed Sparse Attention. It is the finer-grained long-context compression mechanism.

In CSA, the model compresses neighboring tokens into fewer KV entries. The Hugging Face Transformers documentation gives a default compression ratio of m=4, meaning roughly every four tokens become one compressed entry.

But it is not simple averaging. CSA uses a learned compression pool and overlapping windows so the model can preserve more useful information. After compression, the query does not attend to all compressed blocks directly. It first uses a Lightning Indexer to score them, selects the most relevant top-k compressed blocks, and then performs the core attention computation.

This gives two benefits:

The number of historical KV entries becomes smaller.
Each query only looks at a relevant subset of compressed blocks.

CSA is suitable for long-range context where details still matter, such as codebases, long documents, and tool-call histories.

HCA: 128x compression plus dense attention

HCA stands for Heavily Compressed Attention, and it is more aggressive.

The Transformers documentation gives a default compression ratio of m'=128. HCA compresses a much longer context span into one compressed entry. Because the compressed sequence becomes very short, it does not need sparse top-k retrieval like CSA. The query can simply perform dense attention over all HCA compressed entries.

HCA acts more like a global summary. It does not try to preserve every detail. Instead, it covers very long history at extremely low cost, helping the model stay aware of global context, long-range topics, and far-away information.

If CSA is “searchable compressed notes,” HCA is closer to a “global table of contents and summary.”

Sliding window: recent context keeps details

DeepSeek-V4 does not compress everything.

In addition to CSA and HCA, it keeps a sliding-window branch for the most recent uncompressed context. The Transformers documentation notes that DeepSeek-V4 attention blocks concatenate long-range compressed branches with sliding-window K/V.

This matters. When generating the next token, the nearest context is often the most important: variable names, function signatures, the current sentence, fresh tool outputs, or the user’s latest instruction. If recent context were over-compressed, output quality would suffer.

So the design is:

Nearby context: preserve uncompressed details.
Mid-to-long context: use CSA for searchable compression.
Farther context: use HCA for heavily compressed global summary.

Hybrid layer stack: different layers use different attention

DeepSeek-V4 does not use one attention mechanism in every layer.

The Hugging Face DeepSeek-V4 article notes that V4-Pro’s 61-layer structure uses HCA in the first two layers, alternates CSA and HCA afterward, and uses a sliding-window MTP block at the end. The Transformers documentation also describes V4-Pro as using two HCA bootstrap layers followed by alternating CSA/HCA layers.

This shows that DeepSeek-V4 treats attention as a layered system. Different layers handle different information roles: some favor global compression, some favor sparse retrieval, and some preserve local windows.

Compared with using one attention type everywhere, this hybrid structure is more complex but better suited to 1M-token context.

FP8 and FP4 further reduce cache cost

DeepSeek-V4’s savings do not come only from compression ratio.

The Hugging Face article notes that most KV entries in V4 use FP8 storage, RoPE-related dimensions remain BF16, and the Lightning Indexer in CSA uses FP4. Compression ratio, low-precision storage, and sparse retrieval together create very low KV Cache usage.

This is a reminder: do not only look at the headline context length. Deployment feasibility is determined by VRAM usage, bandwidth pressure, latency, and implementation quality under long context.

Differences from other models

Compared with traditional MHA, DeepSeek-V4 no longer keeps full attention memory for every token in long history, so cache pressure drops sharply.

Compared with GQA, DeepSeek-V4 does not merely reduce the number of KV heads. It also reduces the number of KV entries for long history. GQA still accumulates cache linearly with sequence length; V4 compresses distant context into blocks.

Compared with DeepSeek-V3’s MLA, V4 extends optimization from “making each token representation more compact” to “compressing the number of historical token entries.” MLA already lowers per-token KV cost significantly, but under million-token context, sequence length remains a bottleneck.

Compared with ordinary sparse attention, CSA compresses first and then performs sparse retrieval over a shorter compressed sequence. HCA goes further, using 128x compression so dense attention becomes cheap.

What it means for agents and long tasks

Agent workflows are especially hungry for long context. They read files, call tools, receive tool results, generate plans, revise plans, and call tools again. The longer the context, the more likely KV Cache becomes the bottleneck.

DeepSeek-V4’s cache design may help in several ways:

Easier handling of long codebases, long documents, and multi-round tool histories.
Less pressure on time to first token and throughput from KV Cache.
Longer context or more concurrent requests on the same hardware.
Million-token context becomes closer to practical deployment, not just a benchmark number.

But compressed attention is not free. Compressing historical tokens into blocks involves information trade-offs. The model must balance saving VRAM with preserving retrievable details. Real performance depends on the task: code navigation, legal documents, long-form QA, and agent toolchains all have different detail-recall needs.

Do not read 2% as 2% of all cost

“KV Cache is about 2% of GQA” is easy to misread.

It mainly refers to KV Cache memory size. It does not mean total inference cost drops to 2%, or that every scenario becomes 50x faster. Inference still includes model weight reads, MoE routing, feed-forward networks, attention computation, scheduling, and communication overhead.

The Hugging Face article separates two numbers: in 1M-token context, DeepSeek-V4-Pro’s per-token inference FLOPs are 27% of DeepSeek-V3.2, while KV Cache is 10%. Cache and compute are different dimensions.

The safer statement is: DeepSeek-V4 greatly reduces KV Cache pressure for ultra-long context, improving deployment feasibility for million-token scenarios. Actual latency and throughput still depend on implementation, hardware, batching, quantization, and inference framework.

Summary

The biggest difference between DeepSeek-V4 and other large models is that it moves KV Cache optimization from the attention-head dimension into the sequence-length dimension.

GQA stores fewer KV heads. MLA makes each token’s KV representation more compact. DeepSeek-V4 further aggregates distant tokens into compressed blocks and combines CSA, HCA, sliding windows, and low-precision storage so million-token context is not immediately blocked by KV Cache.

This is not a single trick. It is a long-context inference architecture: preserve details nearby, compress distant context, retrieve details when needed, and summarize globally when possible.

For developers and agent applications, the meaning is direct: long context is not just about accepting more input. It must be runnable, stable, and affordable. That is what DeepSeek-V4 changes.

References

DeepSeek V4 Flash for a Godot Game Demo: How Far Can a Few Cents Go?

Wed, 06 May 2026 09:22:18 +0800

Can DeepSeek V4 Flash handle Godot game demo development?

The focus is simple: can it create a small Godot demo that runs, can be observed, and includes physics effects?

The short answer is yes. The quality is not commercial-grade, but it is already enough for gameplay prototyping and physics interaction demos. More importantly, the cost is very low, which makes it suitable for quickly validating ideas.

Demo Performance

The focus of this demo is physics interaction.

Several visible effects include:

The rope can be cut.
The box falls to the ground.
After increasing the mass, box collisions become more forceful.
The rope shows noticeable elasticity.
After adjusting friction and elasticity, the box shows clear sliding and bouncing.

From what it presents, this is no longer just “a few generated Godot scripts”. It is a small prototype that can run and show observable physics behavior.

Usability

The value of this demo is that it can run, be viewed, and be modified. It is not a complete game, nor an engineering project ready for direct commercialization, but it already demonstrates several things:

DeepSeek V4 Flash can understand the basic goal of a Godot demo.
An AI Agent can turn requirements into a runnable project.
Non-web tasks such as Godot physics interaction are entering a low-cost prototyping stage.
For individual developers, it can quickly turn an idea into something visible.

If the goal is to build a formal game, it is obviously not enough. But if the goal is to verify whether a gameplay idea is interesting or whether the rough physics effect can be made, this demo is already usable.

Cost Significance

The most notable part is not how polished the visuals are, but the cost.

If a Godot physics demo can produce a runnable version with model costs at the level of a few cents, its significance is not replacing professional game development. It is sharply reducing the cost of prototype trial and error.

In the past, validating a small game idea usually required knowing Godot, writing scripts, setting up scenes, and adjusting physics parameters. Now an AI Agent can first generate a runnable version, and humans can judge whether the direction makes sense.

For indie developers, this kind of low-cost experimentation is useful:

Quickly validate gameplay concepts.
Generate temporary demos for others to see.
Explore Godot APIs and the physics system.
Turn ideas into an initial runnable project.
Reduce handwritten code cost before the direction is clear.

DeepSeek V4 Flash’s Performance

What is worth noting is that the model used here is DeepSeek V4 Flash, not a more expensive and heavier flagship model.

It performs well in the role of a low-cost prototype model. It is not the strongest, most stable, or most suitable model for delivering production engineering, but it is attractive in budget-sensitive scenarios where the goal is to quickly test a direction.

Suitable Scenarios

DeepSeek V4 Flash + Agent + Godot is better suited to these tasks:

Small gameplay prototypes.
Physics effect demos.
UI or interaction concept validation.
Teaching examples.
Helping understand Godot project structure.
Generating a first runnable project.

It is less suitable for directly taking on these tasks:

Large game architecture.
Complex character controllers.
Network synchronization.
Core code for commercial projects.
High-precision physics simulation.
Automated submission without human testing.

In other words, it is suitable as a first draft and testbed, not as the owner of production engineering.

What This Shows

This shows that AI coding is continuing to expand from websites, scripts, and backend APIs into game development and interactive prototyping.

Game development used to have a high barrier to entry, especially when engines, scripts, asset management, and physics systems were mixed together. Beginners could easily get stuck. Now models plus Agent tools can first set up the project, letting developers focus on gameplay judgment and effect tuning.

This may bring three changes:

First, game prototypes become cheaper. Many ideas no longer need to wait until full development to be validated; they can first become runnable demos.

Second, indie developers may become more willing to experiment. People who do not know Godot can still use AI to touch the project structure and basic workflow.

Third, model stability becomes more important. Game development is not just about code running. The effect also needs to be reasonable, the feel needs to be normal, and parameters need to be controllable. In the future, models that better combine actual visuals and runtime state will be more suitable for this kind of task.

Summary

DeepSeek V4 Flash for a Godot demo can be summarized in one sentence: not perfect, but cheap enough, fast enough, and suitable enough for prototyping.

It is still far from commercial games, but if the goal is to validate a small game idea at extremely low cost, it is already valuable.

For individual developers, the most realistic use is not handing the whole game to AI, but letting AI first produce a runnable project while humans handle judgment, trade-offs, and polishing. Used this way, low-cost models such as DeepSeek V4 Flash become genuinely appealing.

Running DeepSeek V4 Locally: VRAM Estimates for Pro, Flash, and Base Versions

Fri, 01 May 2026 11:55:25 +0800

DeepSeek V4 and Gemma 4 are not in the same class for local deployment. With Gemma 4, it still makes sense to discuss how to run 26B or 31B models on 24GB or 32GB GPUs. DeepSeek V4 is a huge MoE model, and full local deployment quickly moves into multi-GPU workstation or server territory.

The official DeepSeek V4 Preview release mainly includes two inference models:

DeepSeek-V4-Pro: 1.6T total / 49B active params
DeepSeek-V4-Flash: 284B total / 13B active params

The official Hugging Face collection also includes two Base models:

DeepSeek-V4-Pro-Base
DeepSeek-V4-Flash-Base

This article only discusses rough VRAM requirements when the full model weights are loaded. For MoE models, active params mainly affects per-token compute. It does not mean only those parameters need to be loaded. Without expert-on-demand loading, CPU/NVMe offload, distributed inference, or specialized runtime optimizations, VRAM should still be estimated from the full weight size.

Quick Summary

VRAM Scale	What Is Realistic	Do Not Expect
24GB	Cannot fully run DeepSeek V4; use smaller distilled models or API	Full V4-Flash / V4-Pro local loading
48GB	Still not suitable for full loading; good for small models or remote API clients	Stable V4-Flash Q4
80GB	Theoretically try V4-Flash Q2/Q3 or heavy offload	V4-Pro
128GB	V4-Flash Q4 becomes more realistic; Q5/Q6 still tight	V4-Pro Q4
192GB	V4-Flash FP8/Q6 is more comfortable; Pro Q2 enters experimental range	V4-Pro Q4
256GB	V4-Flash FP8 is fairly comfortable; Pro Q2/Q3 can be tested	V4-Pro Q5 and above
512GB	V4-Pro Q4 starts to become discussable	V4-Pro FP8
1TB+	V4-Pro FP8 and low-bit Pro-Base are more realistic	Low-cost single-machine deployment
2TB+	Pro-Base FP8 class	Ordinary workstation deployment

If your goal is to run a model on a personal computer, DeepSeek V4 is not the right target. More realistic options are:

Use the official DeepSeek API or compatible services.
Wait for stable community GGUF/EXL2/MLX quantizations and inference support.
Use smaller DeepSeek distilled models.
Use local models in the 7B to 70B range from Qwen, Gemma, Llama, and similar families.

Official Weight Sizes

The following figures come from model.safetensors.index.json in the official Hugging Face repositories. They reflect current public weight file sizes, not full runtime VRAM use under long context.

Model	Parameter Scale	Official Weight Size	Notes
`DeepSeek-V4-Flash`	284B total / 13B active	159.61GB	Inference model, smallest in this group
`DeepSeek-V4-Pro`	1.6T total / 49B active	864.70GB	Inference model, stronger but enormous
`DeepSeek-V4-Flash-Base`	284B total	294.67GB	Base model, closer to full FP8 weight size
`DeepSeek-V4-Pro-Base`	1.6T total	1606.03GB	Base model, about 1.6TB

Even the smallest V4-Flash is already close to 160GB of official weights. That is why it should not be treated like a 13B model just because it has 13B active params.

DeepSeek V4 Flash VRAM Estimate

V4-Flash is the most approachable DeepSeek V4 variant for local experiments. But that only means “more approachable than Pro”; it is still not a consumer single-GPU model.

The table below uses the official 159.61GB weight size as the baseline. Q4/Q3/Q2 rows are bit-width estimates and do not imply that stable official GGUF versions currently exist.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	159.61GB	192GB	256GB	Multi-GPU servers, inference service
`Q6`	120GB	160GB	192GB	Quality-first quantization tests
`Q5`	100GB	128GB	160GB	Quality/size balance
`Q4`	80GB	96GB	128GB	More realistic starting point for Flash
`Q3`	60GB	80GB	96GB	Large-VRAM single GPU or multi-GPU tests
`Q2`	40GB	48GB	64GB	Extreme low-bit experiments with clear quality risk

If mature V4-Flash Q4 builds appear later, it still probably will not be a 24GB GPU model. A more realistic starting point is 96GB to 128GB total VRAM, or CPU/offload setups that trade speed for capacity.

DeepSeek V4 Pro VRAM Estimate

V4-Pro is the flagship inference model, with official weights around 864.70GB. Even at 4-bit quantization, the full weights remain in the hundreds of GB.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	864.70GB	1TB	1.2TB+	Multi-node or multi-GPU inference service
`Q6`	648GB	768GB	1TB	High-quality quantized service
`Q5`	540GB	640GB	768GB	Quality/cost balance
`Q4`	432GB	512GB	640GB	Lowest practical quality line for Pro
`Q3`	324GB	384GB	512GB	Low-bit experiments
`Q2`	216GB	256GB	320GB	Extreme experiments with high quality and stability risk

For individual users, V4-Pro is better consumed through an API. If the goal is full local deployment, treat it as a multi-GPU server model, not a 4090, 5090, or RTX PRO single-GPU model.

DeepSeek V4 Flash-Base VRAM Estimate

Base models are usually for research, fine-tuning, or continued training, not ordinary chat deployment. V4-Flash-Base has official weights of about 294.67GB.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	294.67GB	384GB	512GB	Research, preprocessing, evaluation
`Q6`	221GB	256GB	320GB	High-quality quantization research
`Q5`	184GB	224GB	256GB	Quality/size balance
`Q4`	147GB	192GB	224GB	Lower-cost Base experiments
`Q3`	111GB	128GB	160GB	Low-bit experiments
`Q2`	74GB	96GB	128GB	Extreme experiments

If you only want to use DeepSeek V4 capabilities, do not start with the Base model. Base models cost more to deploy and tune; most applications should use the inference model or API.

DeepSeek V4 Pro-Base VRAM Estimate

V4-Pro-Base is the heaviest variant, with official weights around 1606.03GB. That is already a 1.6TB-class model file.

Version / Quantization	Estimated Weight Size	Minimum VRAM	Safer VRAM	Best For
`FP8 / official weights`	1606.03GB	2TB	2.4TB+	Large-scale research clusters
`Q6`	1205GB	1.5TB	2TB	High-quality quantization research
`Q5`	1004GB	1.2TB	1.5TB	Research and evaluation
`Q4`	803GB	1TB	1.2TB	Low-bit research
`Q3`	602GB	768GB	1TB	Extreme low-bit research
`Q2`	402GB	512GB	640GB	Extreme experiments

This kind of model should not be discussed in the framework of “can a home GPU run it?” Even Q4 is already beyond the comfortable range of most single-machine workstations.

Why Active Params Are Not Enough

DeepSeek V4 is an MoE model. MoE means each token activates only part of the experts, so compute is much lower than the total parameter count. But this does not mean VRAM only needs to hold the active parameters.

Full local inference also depends on:

Whether all expert weights must stay resident on GPU.
Whether on-demand expert loading is supported.
CPU memory to GPU memory transfer costs.
NVMe offload latency.
KV cache growth under long context.
Extra runtime overhead under 1M context.
Multi-node and multi-GPU communication cost.

So V4-Pro with 49B active should not be deployed like a 49B model. V4-Flash with 13B active should not be treated like a 13B small model either.

How to Choose

If you are an ordinary individual user:

Do not try to fully self-host DeepSeek V4.
Use the official API when you need DeepSeek V4 capabilities.
For private local deployment, first check whether you have mature inference infrastructure or internal multi-GPU servers.
With only 24GB to 48GB VRAM, 7B, 14B, 32B, or 70B quantized models are more practical.

If you have 128GB to 256GB total VRAM:

Watch for stable community implementations of V4-Flash Q4/Q5.
Do not treat V4-Pro as your main local model.

If you have 512GB+ total VRAM:

V4-Pro Q4 starts to become an engineering validation target.
You still need to care about inference framework support, expert scheduling, KV cache, throughput, and concurrency.

The key question for DeepSeek V4 local deployment is not “which quantized file should I download?” It is “do I have the system-level inference capacity for this model?” It is closer to a server model than a desktop model.

References

How to Choose Between GPT 5.5, Claude Opus 4.7, DeepSeek V4, and Qwen 3.6 Max

Tue, 28 Apr 2026 22:18:00 +0800

If you only want the short answer, remember this version first:

If you want the most reliable option and the least wasted time, start with GPT 5.5
If you care most about page presentation, creativity, and visual polish, Claude Opus 4.7 is still strong
If you want to know which domestic model is closest to the top tier, Qwen 3.6 Max is highly competitive now
DeepSeek V4 is not weak, but its output is more uneven than the others

When people ask which coding AI is the strongest right now, they are usually not really asking about a leaderboard. They are asking something more practical:
If I need to build a page, make a demo, generate a small tool, or add interaction, which model is most likely to give me something usable on the first try?

From that angle, the differences between these models are already pretty clear.

The Overall Verdict

If you put GPT 5.5, Claude Opus 4.7, DeepSeek V4, and Qwen 3.6 Max side by side, the most consistent all-around choice is still GPT 5.5.

It is not always the flashiest one, but it rarely leaves you clearly disappointed. It is fast, the first draft usually comes out with high completion, and it handles logic, interaction, motion, and small games with a steady hand.

Claude Opus 4.7 feels different. Its biggest strength is not pure stability. It is page atmosphere, UI organization, and presentation. A lot of the time, you open what it made and your first reaction is simply that it looks polished. If visual presentation matters more to you, it is still very worth considering.

Qwen 3.6 Max is the one that most deserves a fresh look. It is no longer just “usable for a domestic model.” In some scenarios, it can genuinely go head-to-head with GPT 5.5 on output quality. In frontend pages, visual completeness, and realism, it has started to build real presence.

DeepSeek V4 is not failing because it cannot do the work. The issue is that it is less predictable. When it works, it can be perfectly solid, and sometimes surprisingly good. But the gap between its better and weaker outputs is still more obvious than it is with the others.

Where `GPT 5.5` Is Strongest

If the things you do most often look like this:

Generate a complete webpage
Build a small demo with motion
Create an interactive page with some logic
Generate a small game or a multi-state interaction
Keep rework to a minimum

Then GPT 5.5 is still the safest default answer.

Its advantages are mostly these:

Fast code generation
High first-draft usability
Fewer hard mistakes in logic and interaction
Stable performance on mixed tasks

To put it more simply, GPT 5.5 feels like the model most likely to get the foundation right on the first pass.
What many people actually need is not the most dazzling result in one category. They need the first version not to break. On that front, it is still the least stressful choice.

Of course, it is not without weaknesses.

On highly visual pages, it is not always the most surprising
Sometimes it is so stable that it leaves less of a design impression

So if you want one default recommendation, it is still GPT 5.5.
That does not mean it is the only one worth looking at.

Who `Claude Opus 4.7` Fits Best

The appeal of Claude Opus 4.7 comes more from how the page feels.

Its strengths are usually:

Cleaner UI structure
More complete visual presentation
Stronger presentation quality on some pages
More noticeable creativity in visualization and design

If the model is helping you build things like:

Demo pages
Data presentation pages
Small pages where visual feel matters a lot
Outputs that should look polished immediately

Then Claude still deserves a place near the top.

Its weaknesses are also fairly clear:

It is not as stable as GPT 5.5
Sometimes it looks good, but the detailed logic drifts
In some cases the code runs, yet the core experience is not quite right

So Claude feels more like a frontend-leaning model with extra aesthetic instinct.
If your first priority is how the page looks, it has real advantages. If your biggest fear is a logic mistake in the first output, you need to be a bit more careful.

Why `Qwen 3.6 Max` Deserves Serious Attention

Among these models, Qwen 3.6 Max gives the strongest sense of momentum.

Not long ago, many people looked at domestic coding AI mainly by asking whether it could keep up at all. With Qwen 3.6 Max, the question is already different:
In frontend-first output scenarios, can it directly compete with the top overseas models?

Its strongest areas right now include:

Good-looking page output
Solid motion and realistic visual effects in some cases
Outputs that feel more complete
Results that can sometimes approach or stay close to GPT 5.5

That says something important.
If your use case leans toward webpages, frontend work, and presentation-heavy output, Qwen 3.6 Max is no longer just a backup option. It can be treated as a serious main candidate.

It still has some weaknesses, though.

On interaction-heavy logic tasks, it can still lose a bit of completeness
Some pages look very good, while some tasks fall flatter than expected
Its variance is still higher than GPT 5.5

Even so, its current presence is already very strong.
If you want to know which domestic model deserves the most attention right now, it is hard to look past Qwen 3.6 Max.

Where `DeepSeek V4` Stands Right Now

DeepSeek V4 is a little more complicated to place.

The issue is not that it cannot do the work. The issue is that it is harder to predict where a given result will land.
Sometimes it can finish the task with decent visuals and working functionality. Sometimes, once the task asks for animation, logic, and data presentation at the same time, it becomes more likely to stumble.

Right now it feels more like this:

It has real ability
It is not weak
It can still hand in acceptable results on some tasks
But its stability is not yet reassuring enough

That shapes who it suits best.

If you do not mind trying a few times, can tolerate an occasional restart, or already plan to check and edit the code yourself, DeepSeek V4 is still worth using.
But if your top priority is reducing friction and maximizing first-pass success, it is not yet the safest option.

So What Should an Ordinary User Pick?

If you are not benchmarking models for fun and actually want to get work done, the easiest way is to choose by use case.

1. You want less hassle and a higher first-pass success rate

Pick GPT 5.5.

It is best at this workflow: “Here is my requirement, give me a usable first version.”
That matters even more when you do not have the time to keep iterating and fixing.

2. You care more about presentation and visual finish

Pick Claude Opus 4.7.

If what you want is a page that already looks more like a finished product, or your work is more demo-oriented and presentation-oriented, Claude shows its value more easily.

3. You want the strongest domestic model for frontend-first output

Start with Qwen 3.6 Max.

It is no longer something you use only as a compromise. It can now be compared directly and seriously.
If your tasks lean toward webpages, motion, and presentation, its competitiveness is already very real.

4. You can tolerate some variance and want to keep watching domestic progress

Keep an eye on DeepSeek V4.

Its problem is not lack of ability. It is that the level of execution still varies too much.
If the stability keeps improving, it could become much more important.

One Last Line

The difference between these mainstream coding AIs is no longer about who can code and who cannot. It is about who is steadier, who looks better, and who fits your kind of work.

If you want the simplest answer, GPT 5.5 is still the first choice.
If you want stronger presentation quality, Claude Opus 4.7 still has real flavor.
If you care about which domestic model deserves the closest attention, Qwen 3.6 Max is already near the front.
DeepSeek V4 feels more like a strong contender that is still working on consistency.

If you want the shortest possible conclusion:

For stability, pick GPT 5.5. For presentation, pick Claude. Among domestic models, the one most worth watching is Qwen 3.6 Max.

DeepSeek-V4 on KnightLi Blog

DeepSeek-V4 KV Cache Explained: Why 1M Context Uses Less VRAM

Several generations of KV Cache optimization

Key change: from head compression to sequence compression

CSA: 4x compression plus sparse retrieval

HCA: 128x compression plus dense attention

Sliding window: recent context keeps details

Hybrid layer stack: different layers use different attention

FP8 and FP4 further reduce cache cost

Differences from other models

What it means for agents and long tasks

Do not read 2% as 2% of all cost

Summary

References

DeepSeek V4 Flash for a Godot Game Demo: How Far Can a Few Cents Go?

Demo Performance

Usability

Cost Significance

DeepSeek V4 Flash’s Performance

Suitable Scenarios

What This Shows

Summary

Running DeepSeek V4 Locally: VRAM Estimates for Pro, Flash, and Base Versions

Quick Summary

Official Weight Sizes

DeepSeek V4 Flash VRAM Estimate

DeepSeek V4 Pro VRAM Estimate

DeepSeek V4 Flash-Base VRAM Estimate

DeepSeek V4 Pro-Base VRAM Estimate

Why Active Params Are Not Enough

How to Choose

References

How to Choose Between GPT 5.5, Claude Opus 4.7, DeepSeek V4, and Qwen 3.6 Max

The Overall Verdict

Where GPT 5.5 Is Strongest

Who Claude Opus 4.7 Fits Best

Why Qwen 3.6 Max Deserves Serious Attention

Where DeepSeek V4 Stands Right Now

So What Should an Ordinary User Pick?

1. You want less hassle and a higher first-pass success rate

2. You care more about presentation and visual finish

3. You want the strongest domestic model for frontend-first output

4. You can tolerate some variance and want to keep watching domestic progress

One Last Line

Where `GPT 5.5` Is Strongest

Who `Claude Opus 4.7` Fits Best

Why `Qwen 3.6 Max` Deserves Serious Attention

Where `DeepSeek V4` Stands Right Now