What is Token Efficiency? DeepSeek V4, big-model planning, and small-model execution

Fri, 15 May 2026 08:59:33 +0800

The next important metric for AI coding may not be who has the strongest model, but who can complete more verifiable work with fewer tokens, lower cost, and a more stable process.

That is the value of Token Efficiency.

Many people hear Token Efficiency and think only about cheaper models, longer context, or cheaper cache hits. Those are base conditions. Real productivity comes from model division of labor, task orchestration, context budgeting, and evaluation.

In other words, Token Efficiency is not a cost-saving trick. It is an engineering method for turning tokens into output.

DeepSeek V4: productizing the split between planner and executor

The missing background in this topic is the positioning of DeepSeek V4.

DeepSeek V4 is not just another stronger model. It splits the two capabilities needed for Token Efficiency into V4 Pro and V4 Flash: V4 Pro is better suited for planning, reasoning, architecture judgment, and critical review, while V4 Flash fits high-frequency execution, batch rewriting, code completion, data organization, and ordinary agent-loop nodes.

That maps directly to two roles in AI coding:

V4 Pro: planner / consultant for requirement breakdown, technical design, complex bug analysis, architecture review, and final acceptance.
V4 Flash: executor for file scanning, simple implementation, test completion, documentation, candidate generation, and repetitive work.

DeepSeek’s API documentation shows that both V4 Flash and V4 Pro support 1M context, JSON Output, Tool Calls, Chat Prefix Completion, and FIM Completion. The pricing page also prices cache-hit input separately and notes that input cache-hit prices have been reduced to one tenth of the launch price.

Together, these are why it matters for Token Efficiency: 1M context reduces compression in complex agent tasks; low cache-hit pricing lowers the cost of repeatedly loading prompts, project docs, code, and history; the Flash / Pro split solves the problem of using a flagship model for every step or an unstable small model for every step.

DeepSeek V4 should therefore be understood in three ways:

Cheap execution layer: many agent nodes can run on V4 Flash.
Usable judgment layer: key steps can still call V4 Pro.
Long-chain friendly: 1M context and cache pricing make codebases, docs, and tool history easier to keep in the usable window.

Its significance for AI coding is not just another model option. It offers a realistic cost structure for the “consultant model + executor model + harness orchestration” pattern.

Do not let the strongest model do everything

The old approach was to pick the smartest model and let it handle requirement analysis, code, tests, and summaries end to end.

That is simple but not always efficient. Many tasks do not need frontier reasoning. Expensive models should behave more like consultants, architects, or planners that appear only at key decision points.

A better structure is:

Big models break down problems and make key decisions.
Small models execute, batch-process, and repeat edits.
Tools and harnesses manage process, state, context, and validation.
Humans define product goals, accept results, and make tradeoffs.

This prevents frontier reasoning from being wasted on mechanical execution.

Context is not always better when larger

Long context matters for coding agents because code, docs, chat history, test output, and logs all consume the window. When the window fills up, compression, forgetting, and misjudgment appear.

But long context does not mean dumping everything into the model.

Token Efficiency means each task should fit inside a clear, controlled context window:

Bring only necessary files.
Include only decision-relevant documents.
Keep only the current state from history.
Give each node clear input and output.
Compress completed work into structured summaries for the next node.

Cheap context can tempt people to include noise. Noise does not make a model smarter.

Harness matters more than a single model

Connecting Claude Code, Codex, or another coding agent to a cheap model is not enough. Small models drift in long-chain tasks unless a stronger process controls them.

A harness is a scheduling system. It decides how to split tasks, run nodes, choose models, validate results, retry failures, and pass context.

A useful orchestration system should answer:

Which tasks need planning?
Which tasks can execute directly?
Which nodes can run in parallel?
Which nodes must be serial?
Which nodes use big models or small models?
What is the context budget for each node?
What structured output does each node produce?
Who reviews and decides whether to continue?

Without this software layer, small models are merely cheap. With it, they can become leverage.

Split tasks with DAGs

A good approach is to split complex work into a directed acyclic graph.

A feature task might become:

Requirement clarification
Technical design
Task decomposition
Implementation
Test completion
Code Review
Fixes
PR submission

Each node can be an independent agent with its own role, prompt, tools, permissions, and output format. Nodes should pass structured results, not long chat transcripts.

This makes each node shorter, easier for small models, and easier to measure.

Run multiple task replicas

When tokens are cheap enough, the same task does not have to run only once.

You can run the same task with different models, prompts, or orchestrations, then pick the best result or merge useful parts. This is suitable for design proposals, copy, test cases, bug hypotheses, refactor options, and code review.

It is not suitable for tasks with external side effects, shared mutable state, or unclear acceptance criteria.

The goal is not gambling. It is collecting comparable samples that can improve orchestration, model selection, and node skills.

Build an evaluation system

Token Efficiency cannot be judged only by price. A cheap model with a high failure rate can consume more human time and become more expensive.

Start recording:

Completion rate
Human interventions
Tool-call failure rate
Test pass rate
Review findings
Token cost per task
Time per task
Rework count
Differences between model combinations

With this data, you can decide which tasks fit small models, which require big models, and which should stay human-led.

Make business workflows atomic

Most users do not need to build a full harness today. But they can start decomposing their business workflow into atomic nodes.

Content production can become topic selection, research, outline, draft, fact check, style rewrite, SEO title, translation, and publishing check.

Software development can become requirement confirmation, technical design, data structure, API change, unit tests, implementation, migration script, documentation, and review.

Each node should have clear input, output, acceptance, and context limits. When harness tools mature, these workflows can plug in directly.

Hardware is not the first priority

Many discussions of Token Efficiency jump to local deployment and GPUs. For most people, API should still be the first choice.

Before the economic model works, local hardware is only prepaid cost. A safer sequence is:

Use API to validate the workflow.
Record task evaluation and cost.
Find stable high-frequency execution nodes.
Consider which nodes should be localized.
Then calculate hardware, power, maintenance, and depreciation.

For personal productivity, API is often enough. For startups exploring inference frameworks and model boundaries, local CUDA platforms can be useful. For production workloads with clear unit economics, multi-GPU deployment becomes worth discussing.

Summary

Token Efficiency is not replacing expensive models with cheap ones. It is redesigning the AI workflow.

Big models make key judgments, small models execute in bulk, the harness schedules and validates, and humans define goals and acceptance. Only when these layers work together can tokens reliably become productivity.

Models will get cheaper, context windows will grow, and small models will improve. The future gap may not be who calls the strongest model, but who can use the same tokens to produce more real output.

Drop MCP? Why CLI Is Becoming the Default Tool Layer for Agents

Fri, 10 Apr 2026 21:55:12 +0800

Over the last year, debates about agent toolchains have increasingly centered on one question:

Does MCP (Model Context Protocol) make tool calling simpler, or does it make simple tasks more complex?

For most day-to-day engineering tasks, CLI is becoming the more practical default.

Cost gap is not a UX issue, but an order-of-magnitude issue

The biggest practical pressure in MCP is token overhead.

In common scenarios, MCP often has to load large tool schemas before actual execution. Using a GitHub MCP Server as an example, initialization alone can consume tens of thousands of tokens. For long tasks, this directly squeezes context budget.

Community benchmarks keep pointing to the same conclusion:

Single MCP calls commonly cost several to dozens of times more than CLI
Retry recovery is also more expensive (reconnect plus context reload)

This is not just “a little slower.” It scales into API cost, latency, and reliability issues.

Why models are naturally better at CLI

A frequently overlooked fact is training distribution.

LLMs have seen massive amounts of terminal text during training: commands, outputs, errors, scripts, and man pages. In other words, CLI interaction is already close to the model’s native input pattern.

By contrast, MCP’s JSON-RPC and tool schema style became widespread only in recent years. Models can learn it, but familiarity and compression efficiency are often still weaker than long-established CLI patterns.

That also explains why, in many cases:

for the same goal, CLI commands are shorter
outputs are easier to continue reasoning over
error recovery paths are more stable

Security and isolation: MCP still has catching up to do

MCP is not incapable of security, but its ecosystem is still early.

Common concerns today include:

Tool Poisoning in descriptions
behavior drift (Rug Pull)
same-name tool override (Shadowing)

CLI also has security risks (injection, privilege misuse, path risks), but its process model, permission boundaries, and audit chain have been validated through decades of engineering practice. In production, that predictability matters.

This does not mean MCP has no value

I do not think MCP should be abandoned.

A more reasonable positioning is:

CLI handles the execution layer (local, low-latency, high-frequency calls)
MCP handles the connection layer (remote service discovery, unified auth, audit, and multitenancy)

That is the commonly discussed hybrid architecture: CLI + MCP Gateway.

When integrating many remote systems and enforcing unified governance and compliance, MCP still has clear value. But for helping agents complete engineering work quickly, CLI-first usually better matches current model capability boundaries.

In today’s engineering reality, CLI is closer to an agent’s working native language; MCP is better positioned as a connection protocol rather than the only execution protocol.

Token Efficiency on KnightLi Blog