KV Cache on KnightLi Blog

DeepSeek-V4 KV Cache Explained: Why 1M Context Uses Less VRAM

Mon, 18 May 2026 18:38:26 +0800

The real cost of long-context models is often not whether they can accept one million tokens, but how much VRAM the KV Cache consumes during inference.

During Transformer decoding, every newly generated token needs access to the Key and Value states of previous tokens. The longer the context, the larger the KV Cache. A larger KV Cache puts pressure on VRAM, memory bandwidth, time to first token, and throughput.

DeepSeek-V4 is interesting because it does not only reduce cache along the attention-head dimension. It pushes compression into the sequence-length dimension. According to Hugging Face’s discussion of DeepSeek-V4, in a 1M-token setting, DeepSeek-V4-Pro’s KV Cache is about 10% of DeepSeek-V3.2, and about 2% of a common bf16 GQA architecture.

That is the key difference: DeepSeek-V4 does not merely store each KV entry in a smaller format. It reduces the number of KV entries that must be kept and searched over long history.

Several generations of KV Cache optimization

KV Cache optimization has evolved through several routes.

The first is traditional MHA, or Multi-Head Attention. Each Query head typically has its own Key/Value heads. The structure is direct, but under long context the cache grows linearly with sequence length, making VRAM pressure heavy.

The second is GQA, or Grouped Query Attention. Multiple Query heads share fewer Key/Value heads. Many modern models such as LLaMA, Mistral, and Qwen use similar ideas. It significantly reduces KV head count and is now a common long-context optimization.

The third is MLA, or Multi-head Latent Attention. DeepSeek-V2 and DeepSeek-V3 use this route, compressing Key/Value into low-rank latent representations and further reducing cache along the attention-head dimension.

The fourth is DeepSeek-V4’s hybrid compressed attention. It focuses on sequence length: instead of only reducing how much KV each token stores, it compresses multiple historical tokens into fewer KV entries and retrieves them through sparse or dense attention.

Roughly:

MHA: every head remembers separately.
GQA: multiple Query heads share memory.
MLA: each token’s KV representation is compressed into a latent vector.
DeepSeek-V4: many historical tokens are aggregated into fewer compressed memory blocks.

Key change: from head compression to sequence compression

GQA and MLA mainly optimize how much KV each token stores. That works well, but when context reaches 1M tokens, the token count itself becomes the problem.

DeepSeek-V4 compresses old context into blocks. The model does not necessarily preserve full KV for every distant token. Instead, multiple tokens form compressed entries.

It is a bit like reading a very long book: you remember recent pages in detail, while earlier chapters are stored more as summaries, themes, and key clues. DeepSeek-V4’s attention design follows a similar split: keep detail nearby, use compressed representation farther away.

CSA: 4x compression plus sparse retrieval

CSA stands for Compressed Sparse Attention. It is the finer-grained long-context compression mechanism.

In CSA, the model compresses neighboring tokens into fewer KV entries. The Hugging Face Transformers documentation gives a default compression ratio of m=4, meaning roughly every four tokens become one compressed entry.

But it is not simple averaging. CSA uses a learned compression pool and overlapping windows so the model can preserve more useful information. After compression, the query does not attend to all compressed blocks directly. It first uses a Lightning Indexer to score them, selects the most relevant top-k compressed blocks, and then performs the core attention computation.

This gives two benefits:

The number of historical KV entries becomes smaller.
Each query only looks at a relevant subset of compressed blocks.

CSA is suitable for long-range context where details still matter, such as codebases, long documents, and tool-call histories.

HCA: 128x compression plus dense attention

HCA stands for Heavily Compressed Attention, and it is more aggressive.

The Transformers documentation gives a default compression ratio of m'=128. HCA compresses a much longer context span into one compressed entry. Because the compressed sequence becomes very short, it does not need sparse top-k retrieval like CSA. The query can simply perform dense attention over all HCA compressed entries.

HCA acts more like a global summary. It does not try to preserve every detail. Instead, it covers very long history at extremely low cost, helping the model stay aware of global context, long-range topics, and far-away information.

If CSA is “searchable compressed notes,” HCA is closer to a “global table of contents and summary.”

Sliding window: recent context keeps details

DeepSeek-V4 does not compress everything.

In addition to CSA and HCA, it keeps a sliding-window branch for the most recent uncompressed context. The Transformers documentation notes that DeepSeek-V4 attention blocks concatenate long-range compressed branches with sliding-window K/V.

This matters. When generating the next token, the nearest context is often the most important: variable names, function signatures, the current sentence, fresh tool outputs, or the user’s latest instruction. If recent context were over-compressed, output quality would suffer.

So the design is:

Nearby context: preserve uncompressed details.
Mid-to-long context: use CSA for searchable compression.
Farther context: use HCA for heavily compressed global summary.

Hybrid layer stack: different layers use different attention

DeepSeek-V4 does not use one attention mechanism in every layer.

The Hugging Face DeepSeek-V4 article notes that V4-Pro’s 61-layer structure uses HCA in the first two layers, alternates CSA and HCA afterward, and uses a sliding-window MTP block at the end. The Transformers documentation also describes V4-Pro as using two HCA bootstrap layers followed by alternating CSA/HCA layers.

This shows that DeepSeek-V4 treats attention as a layered system. Different layers handle different information roles: some favor global compression, some favor sparse retrieval, and some preserve local windows.

Compared with using one attention type everywhere, this hybrid structure is more complex but better suited to 1M-token context.

FP8 and FP4 further reduce cache cost

DeepSeek-V4’s savings do not come only from compression ratio.

The Hugging Face article notes that most KV entries in V4 use FP8 storage, RoPE-related dimensions remain BF16, and the Lightning Indexer in CSA uses FP4. Compression ratio, low-precision storage, and sparse retrieval together create very low KV Cache usage.

This is a reminder: do not only look at the headline context length. Deployment feasibility is determined by VRAM usage, bandwidth pressure, latency, and implementation quality under long context.

Differences from other models

Compared with traditional MHA, DeepSeek-V4 no longer keeps full attention memory for every token in long history, so cache pressure drops sharply.

Compared with GQA, DeepSeek-V4 does not merely reduce the number of KV heads. It also reduces the number of KV entries for long history. GQA still accumulates cache linearly with sequence length; V4 compresses distant context into blocks.

Compared with DeepSeek-V3’s MLA, V4 extends optimization from “making each token representation more compact” to “compressing the number of historical token entries.” MLA already lowers per-token KV cost significantly, but under million-token context, sequence length remains a bottleneck.

Compared with ordinary sparse attention, CSA compresses first and then performs sparse retrieval over a shorter compressed sequence. HCA goes further, using 128x compression so dense attention becomes cheap.

What it means for agents and long tasks

Agent workflows are especially hungry for long context. They read files, call tools, receive tool results, generate plans, revise plans, and call tools again. The longer the context, the more likely KV Cache becomes the bottleneck.

DeepSeek-V4’s cache design may help in several ways:

Easier handling of long codebases, long documents, and multi-round tool histories.
Less pressure on time to first token and throughput from KV Cache.
Longer context or more concurrent requests on the same hardware.
Million-token context becomes closer to practical deployment, not just a benchmark number.

But compressed attention is not free. Compressing historical tokens into blocks involves information trade-offs. The model must balance saving VRAM with preserving retrievable details. Real performance depends on the task: code navigation, legal documents, long-form QA, and agent toolchains all have different detail-recall needs.

Do not read 2% as 2% of all cost

“KV Cache is about 2% of GQA” is easy to misread.

It mainly refers to KV Cache memory size. It does not mean total inference cost drops to 2%, or that every scenario becomes 50x faster. Inference still includes model weight reads, MoE routing, feed-forward networks, attention computation, scheduling, and communication overhead.

The Hugging Face article separates two numbers: in 1M-token context, DeepSeek-V4-Pro’s per-token inference FLOPs are 27% of DeepSeek-V3.2, while KV Cache is 10%. Cache and compute are different dimensions.

The safer statement is: DeepSeek-V4 greatly reduces KV Cache pressure for ultra-long context, improving deployment feasibility for million-token scenarios. Actual latency and throughput still depend on implementation, hardware, batching, quantization, and inference framework.

Summary

The biggest difference between DeepSeek-V4 and other large models is that it moves KV Cache optimization from the attention-head dimension into the sequence-length dimension.

GQA stores fewer KV heads. MLA makes each token’s KV representation more compact. DeepSeek-V4 further aggregates distant tokens into compressed blocks and combines CSA, HCA, sliding windows, and low-precision storage so million-token context is not immediately blocked by KV Cache.

This is not a single trick. It is a long-context inference architecture: preserve details nearby, compress distant context, retrieve details when needed, and summarize globally when possible.

For developers and agent applications, the meaning is direct: long context is not just about accepting more input. It must be runnable, stable, and affordable. That is what DeepSeek-V4 changes.

References

How to Tune llama.cpp on 8GB VRAM: Why 32K Is Safer and 64K Needs KV Cache Quantization

Thu, 23 Apr 2026 12:13:04 +0800

Whether 8GB of VRAM is enough to run local LLMs smoothly, especially under long-context workloads, is one of the most common questions people run into when using llama.cpp.

There are three key takeaways worth remembering first:

On 8GB VRAM, 32K context is usually the safer balance point
If you really want to run 64K, KV Cache quantization is often essential
In full-GPU inference, blindly increasing CPU thread count can actually make performance worse

1. First, what do 32K, 64K, and KV Cache actually mean?

For many readers, these are the three terms that cause the most confusion.

32K and 64K refer to context length, meaning how many tokens the model can process at one time. Here, K means thousand, so 32K is about 32000 tokens, and 64K is about 64000 tokens. The longer the context, the more prior content the model can see at once, which is useful for long-document QA, long conversations, and multi-step analysis.

KV Cache is an intermediate-result cache that the model keeps in order to speed up autoregressive generation. You can think of it like this: once the model has already read and computed part of the context, it does not need to recompute everything from scratch every time. Instead, it stores key intermediate information and reuses it. The K and V come from Key and Value in the Transformer architecture.

Why do these three terms always appear together? Because:

32K and 64K define how much content you want the model to remember at once
KV Cache determines how much extra VRAM is needed to maintain that memory
The longer the context, the larger the KV Cache usually becomes, and the higher the VRAM pressure gets

So when long-context inference slows down, the root problem is often not that the model is “bad at computing”, but that the cache has grown large enough to push VRAM to its limit.

2. Why does 32K perform so differently from 64K?

Using roughly 30000 Chinese characters from The Three-Body Problem as a stress-test input, the comparison between 32K and 64K context can look dramatic: with similar document size, 64K can become much slower and total runtime can increase significantly.

The reason is not that the model suddenly becomes worse. The real issue is hitting the VRAM boundary.

At 32K, model weights plus cache may still fit within 8GB VRAM, so most data traffic stays on the GPU’s own memory bandwidth. But once you move to 64K, the cache grows further, total memory use approaches or exceeds the VRAM ceiling, and part of the data gets pushed into shared or system memory.

At that point, what collapses is not raw compute, but bandwidth.

In other words, what looks like “context doubled and performance crashed” is often really a case of the data path falling out of VRAM and into much slower memory.

3. If you want 64K, KV Cache quantization matters a lot

One of the most important conclusions for 8GB VRAM users is that KV Cache quantization matters a great deal.

Without changing the model itself, quantizing only the cache can directly reduce cache memory usage under long context. That means some of the data that previously spilled out of VRAM can move back into VRAM. As a result, 64K is still heavier than 32K, but it is less likely to fall into the slowest performance zone.

Put simply:

32K is the more practical default range for 8GB VRAM
64K is not impossible
But without cache quantization, performance can drop from “usable” to “hard to use”

If your goal is stable long-context inference, the usual priority should be:

Check whether VRAM is already near its ceiling
Decide whether to enable KV Cache quantization
Only then continue experimenting with more aggressive throughput settings

4. Low GPU utilization does not mean the GPU is idle

This is a point that often breaks intuition.

When people see only 20% or 30% GPU usage in Task Manager, they often assume:

the parameters must be wrong
the model is not really running on the GPU
the GPU is not being used fully

But the more likely explanation in llama.cpp inference is that the bottleneck is not core compute, but memory reads and writes.

That means GPU cores may finish a batch of computation quickly, then spend the rest of the time waiting for the next batch of weights or cached data to arrive.

So what you see becomes:

core utilization is not especially high
but end-to-end speed still fails to improve

This is not the GPU being lazy. It is the data path being too narrow.

That is why you should not look only at GPU Usage when judging local LLM performance. VRAM capacity, memory bandwidth, and cache spillover often matter more.

5. Increasing throughput parameters can help, but only if VRAM can handle it

Another useful idea is this: if GPU cores are not fully saturated, maybe you can increase throughput-related parameters so the GPU processes more data at once and uses its parallelism more effectively.

This can indeed improve speed.

But there is an important condition: VRAM must still have headroom.

Because once you increase throughput-related settings, you often also increase VRAM usage. If you are already in a 64K scenario with large cache and VRAM near exhaustion, pushing those parameters further can lead to two outcomes:

a crash
or a fallback into much slower shared-memory behavior

So the safer sequence is usually not “max out the knobs first”, but:

protect the VRAM boundary first
then try throughput optimization
after every change, check both speed and stability again

6. More CPU threads are not always better

This is one of the easiest traps to remember.

It is very natural to assume that more threads should mean better speed. But in practice, once the model is already running mostly on the GPU, forcing CPU thread count higher can make performance noticeably worse.

The reason is straightforward.

In full-GPU inference, the CPU is more of a scheduler and preprocessing helper than the main compute engine. If you open too many threads, CPU-side thread contention, scheduling overhead, and context-switching costs all become heavier, which can disrupt the data flow that should have stayed smooth.

The result is:

the CPU looks busier
but overall speed gets slower

So in this kind of setup, default settings or lower thread counts are often more reliable than simply maxing everything out.

7. A more practical approach for 8GB VRAM users

If we compress the conclusions above into a practical workflow, it looks roughly like this:

1. Treat 32K as the default goal

If you only have an 8GB GPU, do not rush to chase 64K. 32K is usually the more realistic balance between speed, stability, and memory usage.

2. If you want 64K, deal with the cache first

Do not start by asking whether you can squeeze out a little more speed. First confirm whether KV Cache is quantized and whether VRAM is already near the limit.

3. Do not judge everything by GPU utilization

Low utilization does not necessarily mean the settings are wrong. It may simply mean memory bandwidth is the real bottleneck.

4. Throughput optimization is valid, but do not cross the VRAM boundary

These parameters can help, but only if there is still enough VRAM headroom.

5. Be conservative with CPU threads first

If the model is already running mostly on the GPU, higher CPU thread counts are not automatically better. Start with defaults or lower thread counts, then test gradually.

Conclusion

The most valuable part of this whole discussion is not just a few benchmark numbers, but the fact that it makes one easily overlooked truth much clearer:

Local LLM tuning is often not about pushing every setting to the maximum. It is about understanding whether your real bottleneck is compute, VRAM capacity, memory bandwidth, or CPU scheduling.

For 8GB VRAM users, the safer strategy is usually not to force the longest possible context, but to protect the VRAM boundary first and only then decide how far to push further.

If you only remember one sentence, make it this:

32K is often the more stable working range for 8GB VRAM; 64K is possible, but only if you have already brought KV Cache and VRAM usage under control.