Token on KnightLi Blog

How Much Extra Token Usage Do subagents Cost? Multi-Agent Costs and Usage Strategy

Sun, 31 May 2026 14:17:42 +0800

Using subagents or a multi-agent workflow usually increases token usage. The question is not whether it costs more, but how much more it costs, and whether the parallel speed or extra stability is worth it.

For small tasks, it is usually cheaper to let the main agent handle the work directly. Subagents become more useful when the task can be clearly split, or when an independent review is valuable.

A subagent is not a cheaper parallel thread

When people first see subagents, it is easy to think of them as parallel threads: the main agent handles one part, the subagent handles another part, and the task finishes faster, so it must be more efficient.

That is not how it works. A subagent is still a separate model call. It needs to read the task, understand the context, inspect files, reason through the problem, and produce an output. It is not a free copy of the main agent; it is an additional reasoning path.

So the key question is not “can this run in parallel?” The real question is: “Is the time saved or quality gained worth the extra token cost?”

Why token usage increases

A subagent call usually adds token usage from several places:

the task description written by the main agent;
the context passed to the subagent;
the files and details the subagent reads;
the subagent’s own reasoning and output;
the main agent’s follow-up review, integration, and verification.

If multiple agents read the same large files, the waste becomes more obvious. This is especially true for codebase analysis, long-document translation, and batch content cleanup. If the task is split poorly, many tokens are spent on repeatedly understanding the same context.

Re-reading context is the biggest token waste

The biggest waste is often not “opening one more agent.” It is having multiple agents read the same material again and again.

For example, suppose a task needs to process 6 posts. If 4 agents all begin by reading the full site structure, the full skill instructions, and the full article list before handling a small slice, the parallelism becomes expensive. A better approach is for the main agent to define the boundaries first, then let each subagent read only the article directory it owns.

The cheaper split usually looks like this:

each agent owns one clear directory;
the context passed to each subagent is as short as possible;
multiple agents do not repeat the same exploration;
the main agent performs one final review instead of asking every agent to run a full review;
checks that can be scripted are handled once by scripts, not repeated by several agents.

In other words, controlling subagent cost is mostly about boundaries, not just the number of agents.

Rough cost multipliers

The following is a rough estimate. Actual usage depends on context length, file size, task complexity, and the number of agents.

Scenario	Token increase
One subagent handles a small task	Around `1.2x - 2x`
2-4 agents handle a clearly split task in parallel	Around `2x - 5x`
Multiple agents each read many files and do long analysis	Possibly `5x+`
Main agent and subagents read the same large files repeatedly	The most obvious waste

This is not an exact billing formula. It is only a practical range. Real usage also depends on whether each agent needs to read full files, perform long reasoning, or repeatedly wait for more context.

How to write a more token-efficient subagent task

The broader the instruction, the more likely the subagent is to explore on its own, which increases token usage. A better prompt defines the boundaries clearly.

A good subagent task should include:

which files or directories it may handle;
which files are read-only and which files may be written;
whether existing files may be overwritten;
which fields must be preserved, such as date, slug, and aliases;
what the final report should include;
what should not be done, such as running a full build or editing unrelated files.

For translation, do not just say “translate this post into multiple languages.” A more efficient instruction is: “Only process content/post/2026/05/240; read index.zh-cn.md; only create missing index.en.md, index.zh-tw.md, index.ja.md, and index.es.md; skip files that already exist; preserve date and slug.”

That instruction is a little longer, but it reduces guessing and repeated exploration. It is often cheaper overall.

Splitting by file or directory is cheaper than splitting by language or step

For batch post translation, splitting by article directory is usually better than splitting by language.

Suppose 6 posts each need English, Traditional Chinese, Japanese, and Spanish versions. It is usually better to let one agent handle all languages inside one article directory, rather than assigning one agent to all English files and another agent to all Japanese files.

The reason is simple: front matter, code blocks, links, tables, and semantic context only need to be read once for a single post. If you split by language, several agents read the same source post repeatedly, increasing token usage.

The same logic applies to code tasks. Prefer splitting by module, directory, or component rather than by steps such as “analyze first, implement second, test third.” Step-based splitting often forces every agent to reread the same context.

When it is worth using subagents

The value of subagents mainly comes from two things: parallelism and an independent perspective.

Good use cases include:

translating multiple posts in batches;
editing several independent directories;
splitting frontend, backend, and test work cleanly;
one agent implements while another reviews risk;
high-risk changes that need a second perspective.

In these cases, token usage increases, but total elapsed time may drop noticeably. Each agent can also focus on one slice of the work.

When one review agent is worth it

A review agent is not always worth the cost. It is most useful when the task is risky, broad in impact, or easy for the main agent to miss edge cases.

Cases where a review agent is worth considering include:

changes involving login, payment, permissions, or data deletion;
multilingual content that affects categories, URLs, or internal links;
broad refactors that need independent regression review;
user requests for code review or risk review;
the main agent has implemented a change and needs a second view on edge cases.

Cases where a review agent is not worth it are also clear: single-file edits, title tweaks, simple front matter fixes, or running one command. The main agent can usually self-check those.

When it is not worth using subagents

Subagents are often not worth it for:

small single-file edits;
simple Q&A;
running one command;
very small changes;
tasks that cannot be split clearly;
tasks where the subagent must repeatedly wait for the main agent to provide context.

In these cases, using a subagent mostly adds overhead. The main agent is faster and cheaper.

My default strategy: prioritize token savings and add review only for risk

If the goal is to save tokens, a conservative default strategy works well:

Small tasks: do not use subagents.
Medium tasks: do not use subagents.
Large batch tasks: still avoid subagents by default unless the user explicitly wants parallel speed.
High-risk tasks: consider one extra agent for review, trading tokens for stability.

This strategy gives up some parallel speed, but it reduces repeated context reading and repeated reasoning.

If a task is large but not high risk, I would first look for scripts, batch checks, and structured local processing. Multiple agents make more sense when the split is very clear, or when the user explicitly wants parallel speed.

A more balanced strategy

If you want to control cost without completely giving up parallelism, a balanced strategy is:

default to the main agent doing the work directly;
consider subagents only when the task can be clearly split by file or directory;
each subagent reads only the files it owns;
do not let multiple agents read the same large files;
the main agent performs the final review of key fields, test results, and Git diff;
add one independent review agent only for high-risk tasks.

This avoids parallelism for its own sake. Subagents should serve a clear speed or quality goal, not become the default action.

Summary

Subagents and multi-agent workflows always increase token usage. One subagent may add only a little, but several agents running in parallel can multiply the cost.

Whether it is worth it depends on the task. If the work can be clearly split, or if the risk is high enough to need independent review, the extra tokens may be justified. For small single-file edits, simple Q&A, or routine checks, it is cheaper to let the main agent handle the task directly.

In one sentence: save tokens on small tasks, split only when the work has clear boundaries, and use extra agents for stability only when risk justifies it.

Why LLM APIs Charge by Tokens: A Clear Guide to Input, Output, and Context Costs

Sat, 25 Apr 2026 08:44:32 +0800

One of the easiest things to find confusing about LLM API billing is why almost every platform eventually comes down to one unit: token. The real question is simple: why do LLMs charge by token, and why can different tokens have different prices?

For many people who are just starting to use model APIs, the most confusing part is not model capability but the bill. Why does the cost rise so quickly even when you only ask a few questions? Why is input cheaper than output? Why does the bill start growing much faster once context becomes long?

A simple way to think about it is this: you are not paying for “one answer.” You are paying for the compute and bandwidth consumed throughout the whole inference process.

1. What is a token

In LLM billing, a token is neither a character count nor a word count. It is the unit a model uses when processing text.

A token might be:

A single Chinese character
Part of an English word
A punctuation mark
A short chunk of frequently seen text

That is why API platforms usually do not charge per sentence or per request. They charge according to how many tokens the model actually reads and generates.
This is much more reasonable than charging by request count, because one request might contain 20 characters, while another might include 200,000 tokens of context. The resource consumption is nowhere near the same.

2. Why input and output are priced separately

Most model APIs today split pricing into two parts:

Input token price
Output token price

And in many cases, output tokens cost more than input tokens.

The reason is not hard to understand.

When a model processes input, it is mainly reading and encoding existing content. But when it generates output, it has to predict the next token, then the next one, then the next one. This is not just reading. It is an ongoing process of inference and sampling, which usually costs more compute.

You can think of it roughly like this:

Input: handing materials to the model
Output: asking the model to write the answer on the spot

Writing on the spot usually costs more than reading the materials once, so it is very common for output pricing to be higher.

3. Why long context makes costs easier to lose control of

Many people think they are only adding a bit more background information, but from the model billing perspective, the impact is often much bigger than expected.

The reason is that each model call usually has to process the full context included in that request again.

That means if your request currently contains:

A system prompt
Conversation history
Tool return values
Long document chunks
Source code files

all of that goes into input token billing.

So what really makes bills grow is often not the final question itself, but the long chain of context attached before it.
As the number of conversation turns increases, tool calls accumulate, and prior messages keep getting fed back in, token cost grows round after round.

4. Why tool calls are especially likely to inflate token usage

In scenarios like agents, coding assistants, and workflow automation, token usage is often much higher than in ordinary chat.

The issue is not just that the model wrote a paragraph. It is that the workflow keeps producing content like:

Reading files
Inspecting logs
Calling APIs
Returning JSON
Feeding tool results back into the model

As long as the result of each tool call gets inserted into the next round of context, it becomes a new source of input tokens.

That is why many developers eventually realize:
the model’s unit price is not always the real problem. The workflow itself may be stacking token cost layer by layer.

For example, imagine a coding agent doing the following:

Read the project structure
Open several source files
Run a test suite
Feed the error logs back into the model
Read more related files

Each step can make later requests carry even more context. Even if the unit price does not change, the total bill can rise quickly.

5. Why the same kind of model can have very different prices

Differences in token pricing between models are not only about vendors wanting to charge more. They are usually tied directly to several factors:

Model size
Inference efficiency
Context length
Deployment cost
Target market

The larger the model, the more active parameters it uses, and the more complex its inference path is, the higher the cost of generating one token usually becomes.
If the model also supports ultra-long context, more complex reasoning, or better tool use, the infrastructure pressure increases even more.

So pricing is really covering several kinds of cost:

GPU or accelerator resources
VRAM usage
Inference latency
Network and service stability
Peak concurrency capacity

A cheaper model is not necessarily bad, and a more expensive model is not necessarily the right choice for every task. In many cases, the price gap reflects how much infrastructure cost a certain level of capability requires.

6. Why cached input is cheaper

Many model platforms now offer features such as:

cached input
prompt caching
prefix caching

The shared idea behind them is simple: if a large chunk of input has already been processed once, do not keep recomputing it from scratch at full price.

For example, if you repeatedly send the same system prompt, the same tool instructions, or the same long document prefix, the platform may be able to cache part of that computation. Then even though it is still input token usage, the cached portion can be billed at a lower rate.

This also explains why many API pricing pages show three or more price tiers:

Standard input
Cached input
Output

The difference is not that the text means different things. It is that the underlying computation may or may not be reusable.

7. Why “cheap tokens” do not automatically mean lower total cost

When people see a model advertised as “very cheap per million tokens,” the first instinct is often that total cost must also be lower. In reality, not always.

That is because total cost is roughly:

token unit price × actual token volume

And actual token volume can be amplified by many things:

Prompts that are too long
Conversation history that is never trimmed
Too much tool output fed back in
Overly verbose model output
Repeated retries for the same task

So the real bill is not determined by price alone. It is usually determined by:

Model unit price
Input length per round
Output length per round
Number of calls
Workflow design

That is also why a “low-cost model” can still end up expensive in some agent workflows. It may need more rounds, more supplemental context, and more retry cycles.

8. How developers should estimate token cost

If you want better budget control in a real project, a simple way to estimate cost is:

Measure average input tokens per request
Measure average output tokens per request
Estimate how many rounds one complete task requires
Multiply by the model’s pricing

For example:

8k tokens of input per round
1k tokens of output per round
10 rounds for one task

Then what you are really consuming is not “one Q&A exchange,” but:

About 80k tokens of input
About 10k tokens of output

And if logs, tool results, and file contents keep being added along the way, the total grows even further.

That is why budget planning should not only look at a single round. It should look at how many tokens a full task loop will consume from start to finish.

9. How to control the bill in practice

If you are already using APIs or agents, the following methods are usually the most effective:

Shorten the system prompt and cut repeated wording
Trim old conversation history regularly
Keep only necessary fields from tool outputs
Retrieve first, then send only relevant parts of long documents
Limit output length and avoid unbounded expansion
Use expensive models for high-value tasks and cheaper ones for lower-value tasks

In many cases, the best way to save money is not to switch blindly to a cheaper model. It is to remove unnecessary token consumption from the workflow first.

10. How to think about all of this

At the end of the day, token pricing is a way of charging for how much the model had to read, infer, and write.

It is not like traditional software pricing, where per-account, per-request, or monthly billing is enough to describe resource use. A model call is a dynamic computation process. The amount of context you send, the tools you invoke, and the output length you request all directly affect cost.

So the most important thing is not memorizing price tables. It is building the right intuition:

Long context increases input cost
Long output increases generation cost
Tool chains amplify total token usage
Caching and workflow design can change the bill significantly

Once those points are clear, the pricing structure of most LLM APIs becomes much easier to understand.

AI Terms Explained: Agent, MCP, RAG, and Token in Plain Language

Thu, 23 Apr 2026 13:13:40 +0800

When people first get into AI, what pushes them away is often not the models themselves, but the long list of terms that keeps showing up in every discussion. Agent, MCP, RAG, AIGC, and Token all look familiar, but without a simple explanation, many people only recognize the words without really understanding them.

This article follows a common beginner-friendly line of explanation and condenses 10 high-frequency AI terms into a set of meanings that is easier to remember. The goal is not to sound academic. It is to help you build a basic mental model that lets you follow everyday AI conversations.

10 common AI terms and what they mean

1. Agent: an AI that does more than chat

Agent can be understood as an AI assistant that actually gets work done.

A normal chatbot usually works in a simple question-and-answer pattern. An Agent goes a step further. It can break a task into steps, arrange a process, call tools, and return a finished result. If you ask it to organize materials, look something up, or generate a document, it may do more than give advice. It may actually chain those actions together and complete them.

That is why the key point of an Agent is not whether it can talk, but whether it can act.

2. OpenClaw: an AI assistant that stays on your computer

Here, OpenClaw is described as a kind of AI assistant that lives on your computer.

You can think of this type of tool as a more desktop-oriented AI helper. It does not only receive text. It may also observe the interface, call local tools, and execute tasks step by step. Compared with a normal web chat interface, this kind of tool emphasizes operational ability much more.

If Agent is the abstract idea of an execution-oriented AI, this kind of desktop assistant is a more concrete personal-computer version of that idea.

3. Skills: capability packs added to an Agent

Skills can be understood as functional modules or operating instructions for an Agent.

The same Agent can behave very differently depending on which Skills it has. Some may focus on copywriting, some on data organization, and some on code-related work. They are a bit like apps on a phone, and a bit like reusable workflows.

So in many cases, it is not that the model suddenly became smarter. It is that a clearer set of rules, tools, and steps was added behind it.

4. MCP: a unified way for AI to connect to tools

MCP stands for Model Context Protocol.

In everyday terms, it is a bit like a Type-C connector for the AI world. In the past, connecting a model to different tools often meant building separate integrations one by one. With a unified protocol, the way those tools connect becomes more standardized and easier to reuse.

For most users, the most important thing to remember is this: MCP is not about whether a model can answer a question. It is about how a model can connect to external tools and resources in a safe and stable way.

5. Gacha: AI output is inherently random

The term “gacha” often appears in AI image generation, video generation, and creative work.

The idea is simple. Even with the same prompt and the same general direction, the result can still be different each time. Sometimes the output is great. Sometimes it falls apart. That is why people compare repeated generation attempts to pulling gacha in a game.

What this really reminds us is that AI generation is not a fixed formula. It is a probabilistic process with variation.

6. API: the connection between an app and a model

API stands for Application Programming Interface.

You can think of it as the standard entry point through which programs communicate. When you call a model service from your own app, script, or editor, you are essentially using an API to send a request and receive a result.

If you compare a model service to a restaurant, then:

the menu is like the API documentation
placing an order is like making an API request
the kitchen sending back the dish is like the model returning a result

That is why many tools may look different on the surface while still calling some form of API underneath.

7. Multimodality: AI handles more than text

Multimodality means AI no longer only reads and writes text. It can process multiple kinds of input and output.

For example, it may be able to read images, understand voice, interpret video, generate pictures, or even support real-time voice and video interaction. Compared with early text-only models, multimodal models are much closer to having the combined abilities to see, hear, speak, and write.

That is also why many AI products are no longer centered around a single text box.

8. RAG: retrieve information first, then generate an answer

RAG stands for Retrieval-Augmented Generation.

It is useful for solving a practical problem: a model’s training data has a time boundary, and it does not automatically know your company’s newest documents, customer-service records, or business rules. The idea behind RAG is to retrieve relevant material from specified sources first, and then generate an answer based on that material.

Its value usually shows up in three ways:

answers are more likely to stay close to real source material
you can trace where the answer came from
new documents can be added and reflected quickly

That is why many enterprise knowledge bases, AI customer-service systems, and internal Q&A tools rely on RAG.

9. AIGC: the general term for AI-generated content

AIGC stands for AI Generated Content.

It is not a single tool. It is a broad label for content produced by AI, including text, images, audio, video, and more. AI writing, AI illustration, AI short-form video generation, and AI voice synthesis all fit under the umbrella of AIGC.

What matters most about this term is that it describes a way of producing content, not one specific model.

10. Token: the unit used to measure model processing

Token can be understood as the basic unit a model uses to process text.

It is not exactly the same as one character or one word, but in practice, you can treat it as the common unit used for model computation and billing. Your input consumes Token, the model’s output consumes Token, and the context kept in memory also takes up Token.

That is why model services keep talking about context length, cost control, and prompt compression. At the core, all of those topics are tied to Token.