<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Token on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/token/</link>
        <description>Recent content in Token on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Sun, 31 May 2026 14:17:42 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/token/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>How Much Extra Token Usage Do subagents Cost? Multi-Agent Costs and Usage Strategy</title>
        <link>https://knightli.com/en/2026/05/31/subagent-multi-agent-token-cost/</link>
        <pubDate>Sun, 31 May 2026 14:17:42 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/31/subagent-multi-agent-token-cost/</guid>
        <description>&lt;p&gt;Using subagents or a multi-agent workflow usually increases token usage. The question is not whether it costs more, but how much more it costs, and whether the parallel speed or extra stability is worth it.&lt;/p&gt;
&lt;p&gt;For small tasks, it is usually cheaper to let the main agent handle the work directly. Subagents become more useful when the task can be clearly split, or when an independent review is valuable.&lt;/p&gt;
&lt;h2 id=&#34;a-subagent-is-not-a-cheaper-parallel-thread&#34;&gt;A subagent is not a cheaper parallel thread
&lt;/h2&gt;&lt;p&gt;When people first see subagents, it is easy to think of them as parallel threads: the main agent handles one part, the subagent handles another part, and the task finishes faster, so it must be more efficient.&lt;/p&gt;
&lt;p&gt;That is not how it works. A subagent is still a separate model call. It needs to read the task, understand the context, inspect files, reason through the problem, and produce an output. It is not a free copy of the main agent; it is an additional reasoning path.&lt;/p&gt;
&lt;p&gt;So the key question is not &amp;ldquo;can this run in parallel?&amp;rdquo; The real question is: &amp;ldquo;Is the time saved or quality gained worth the extra token cost?&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;why-token-usage-increases&#34;&gt;Why token usage increases
&lt;/h2&gt;&lt;p&gt;A subagent call usually adds token usage from several places:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the task description written by the main agent;&lt;/li&gt;
&lt;li&gt;the context passed to the subagent;&lt;/li&gt;
&lt;li&gt;the files and details the subagent reads;&lt;/li&gt;
&lt;li&gt;the subagent&amp;rsquo;s own reasoning and output;&lt;/li&gt;
&lt;li&gt;the main agent&amp;rsquo;s follow-up review, integration, and verification.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If multiple agents read the same large files, the waste becomes more obvious. This is especially true for codebase analysis, long-document translation, and batch content cleanup. If the task is split poorly, many tokens are spent on repeatedly understanding the same context.&lt;/p&gt;
&lt;h2 id=&#34;re-reading-context-is-the-biggest-token-waste&#34;&gt;Re-reading context is the biggest token waste
&lt;/h2&gt;&lt;p&gt;The biggest waste is often not &amp;ldquo;opening one more agent.&amp;rdquo; It is having multiple agents read the same material again and again.&lt;/p&gt;
&lt;p&gt;For example, suppose a task needs to process 6 posts. If 4 agents all begin by reading the full site structure, the full skill instructions, and the full article list before handling a small slice, the parallelism becomes expensive. A better approach is for the main agent to define the boundaries first, then let each subagent read only the article directory it owns.&lt;/p&gt;
&lt;p&gt;The cheaper split usually looks like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;each agent owns one clear directory;&lt;/li&gt;
&lt;li&gt;the context passed to each subagent is as short as possible;&lt;/li&gt;
&lt;li&gt;multiple agents do not repeat the same exploration;&lt;/li&gt;
&lt;li&gt;the main agent performs one final review instead of asking every agent to run a full review;&lt;/li&gt;
&lt;li&gt;checks that can be scripted are handled once by scripts, not repeated by several agents.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, controlling subagent cost is mostly about boundaries, not just the number of agents.&lt;/p&gt;
&lt;h2 id=&#34;rough-cost-multipliers&#34;&gt;Rough cost multipliers
&lt;/h2&gt;&lt;p&gt;The following is a rough estimate. Actual usage depends on context length, file size, task complexity, and the number of agents.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Scenario&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Token increase&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;One subagent handles a small task&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;Around &lt;code&gt;1.2x - 2x&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;2-4 agents handle a clearly split task in parallel&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;Around &lt;code&gt;2x - 5x&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Multiple agents each read many files and do long analysis&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;Possibly &lt;code&gt;5x+&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Main agent and subagents read the same large files repeatedly&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;The most obvious waste&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is not an exact billing formula. It is only a practical range. Real usage also depends on whether each agent needs to read full files, perform long reasoning, or repeatedly wait for more context.&lt;/p&gt;
&lt;h2 id=&#34;how-to-write-a-more-token-efficient-subagent-task&#34;&gt;How to write a more token-efficient subagent task
&lt;/h2&gt;&lt;p&gt;The broader the instruction, the more likely the subagent is to explore on its own, which increases token usage. A better prompt defines the boundaries clearly.&lt;/p&gt;
&lt;p&gt;A good subagent task should include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;which files or directories it may handle;&lt;/li&gt;
&lt;li&gt;which files are read-only and which files may be written;&lt;/li&gt;
&lt;li&gt;whether existing files may be overwritten;&lt;/li&gt;
&lt;li&gt;which fields must be preserved, such as &lt;code&gt;date&lt;/code&gt;, &lt;code&gt;slug&lt;/code&gt;, and &lt;code&gt;aliases&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;what the final report should include;&lt;/li&gt;
&lt;li&gt;what should not be done, such as running a full build or editing unrelated files.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For translation, do not just say &amp;ldquo;translate this post into multiple languages.&amp;rdquo; A more efficient instruction is: &amp;ldquo;Only process &lt;code&gt;content/post/2026/05/240&lt;/code&gt;; read &lt;code&gt;index.zh-cn.md&lt;/code&gt;; only create missing &lt;code&gt;index.en.md&lt;/code&gt;, &lt;code&gt;index.zh-tw.md&lt;/code&gt;, &lt;code&gt;index.ja.md&lt;/code&gt;, and &lt;code&gt;index.es.md&lt;/code&gt;; skip files that already exist; preserve &lt;code&gt;date&lt;/code&gt; and &lt;code&gt;slug&lt;/code&gt;.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;That instruction is a little longer, but it reduces guessing and repeated exploration. It is often cheaper overall.&lt;/p&gt;
&lt;h2 id=&#34;splitting-by-file-or-directory-is-cheaper-than-splitting-by-language-or-step&#34;&gt;Splitting by file or directory is cheaper than splitting by language or step
&lt;/h2&gt;&lt;p&gt;For batch post translation, splitting by article directory is usually better than splitting by language.&lt;/p&gt;
&lt;p&gt;Suppose 6 posts each need English, Traditional Chinese, Japanese, and Spanish versions. It is usually better to let one agent handle all languages inside one article directory, rather than assigning one agent to all English files and another agent to all Japanese files.&lt;/p&gt;
&lt;p&gt;The reason is simple: front matter, code blocks, links, tables, and semantic context only need to be read once for a single post. If you split by language, several agents read the same source post repeatedly, increasing token usage.&lt;/p&gt;
&lt;p&gt;The same logic applies to code tasks. Prefer splitting by module, directory, or component rather than by steps such as &amp;ldquo;analyze first, implement second, test third.&amp;rdquo; Step-based splitting often forces every agent to reread the same context.&lt;/p&gt;
&lt;h2 id=&#34;when-it-is-worth-using-subagents&#34;&gt;When it is worth using subagents
&lt;/h2&gt;&lt;p&gt;The value of subagents mainly comes from two things: parallelism and an independent perspective.&lt;/p&gt;
&lt;p&gt;Good use cases include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;translating multiple posts in batches;&lt;/li&gt;
&lt;li&gt;editing several independent directories;&lt;/li&gt;
&lt;li&gt;splitting frontend, backend, and test work cleanly;&lt;/li&gt;
&lt;li&gt;one agent implements while another reviews risk;&lt;/li&gt;
&lt;li&gt;high-risk changes that need a second perspective.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In these cases, token usage increases, but total elapsed time may drop noticeably. Each agent can also focus on one slice of the work.&lt;/p&gt;
&lt;h2 id=&#34;when-one-review-agent-is-worth-it&#34;&gt;When one review agent is worth it
&lt;/h2&gt;&lt;p&gt;A review agent is not always worth the cost. It is most useful when the task is risky, broad in impact, or easy for the main agent to miss edge cases.&lt;/p&gt;
&lt;p&gt;Cases where a review agent is worth considering include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;changes involving login, payment, permissions, or data deletion;&lt;/li&gt;
&lt;li&gt;multilingual content that affects categories, URLs, or internal links;&lt;/li&gt;
&lt;li&gt;broad refactors that need independent regression review;&lt;/li&gt;
&lt;li&gt;user requests for code review or risk review;&lt;/li&gt;
&lt;li&gt;the main agent has implemented a change and needs a second view on edge cases.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Cases where a review agent is not worth it are also clear: single-file edits, title tweaks, simple front matter fixes, or running one command. The main agent can usually self-check those.&lt;/p&gt;
&lt;h2 id=&#34;when-it-is-not-worth-using-subagents&#34;&gt;When it is not worth using subagents
&lt;/h2&gt;&lt;p&gt;Subagents are often not worth it for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;small single-file edits;&lt;/li&gt;
&lt;li&gt;simple Q&amp;amp;A;&lt;/li&gt;
&lt;li&gt;running one command;&lt;/li&gt;
&lt;li&gt;very small changes;&lt;/li&gt;
&lt;li&gt;tasks that cannot be split clearly;&lt;/li&gt;
&lt;li&gt;tasks where the subagent must repeatedly wait for the main agent to provide context.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In these cases, using a subagent mostly adds overhead. The main agent is faster and cheaper.&lt;/p&gt;
&lt;h2 id=&#34;my-default-strategy-prioritize-token-savings-and-add-review-only-for-risk&#34;&gt;My default strategy: prioritize token savings and add review only for risk
&lt;/h2&gt;&lt;p&gt;If the goal is to save tokens, a conservative default strategy works well:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Small tasks: do not use subagents.&lt;/li&gt;
&lt;li&gt;Medium tasks: do not use subagents.&lt;/li&gt;
&lt;li&gt;Large batch tasks: still avoid subagents by default unless the user explicitly wants parallel speed.&lt;/li&gt;
&lt;li&gt;High-risk tasks: consider one extra agent for review, trading tokens for stability.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This strategy gives up some parallel speed, but it reduces repeated context reading and repeated reasoning.&lt;/p&gt;
&lt;p&gt;If a task is large but not high risk, I would first look for scripts, batch checks, and structured local processing. Multiple agents make more sense when the split is very clear, or when the user explicitly wants parallel speed.&lt;/p&gt;
&lt;h2 id=&#34;a-more-balanced-strategy&#34;&gt;A more balanced strategy
&lt;/h2&gt;&lt;p&gt;If you want to control cost without completely giving up parallelism, a balanced strategy is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;default to the main agent doing the work directly;&lt;/li&gt;
&lt;li&gt;consider subagents only when the task can be clearly split by file or directory;&lt;/li&gt;
&lt;li&gt;each subagent reads only the files it owns;&lt;/li&gt;
&lt;li&gt;do not let multiple agents read the same large files;&lt;/li&gt;
&lt;li&gt;the main agent performs the final review of key fields, test results, and Git diff;&lt;/li&gt;
&lt;li&gt;add one independent review agent only for high-risk tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This avoids parallelism for its own sake. Subagents should serve a clear speed or quality goal, not become the default action.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;Subagents and multi-agent workflows always increase token usage. One subagent may add only a little, but several agents running in parallel can multiply the cost.&lt;/p&gt;
&lt;p&gt;Whether it is worth it depends on the task. If the work can be clearly split, or if the risk is high enough to need independent review, the extra tokens may be justified. For small single-file edits, simple Q&amp;amp;A, or routine checks, it is cheaper to let the main agent handle the task directly.&lt;/p&gt;
&lt;p&gt;In one sentence: &lt;strong&gt;save tokens on small tasks, split only when the work has clear boundaries, and use extra agents for stability only when risk justifies it.&lt;/strong&gt;&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Why LLM APIs Charge by Tokens: A Clear Guide to Input, Output, and Context Costs</title>
        <link>https://knightli.com/en/2026/04/25/llm-token-pricing-principles/</link>
        <pubDate>Sat, 25 Apr 2026 08:44:32 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/25/llm-token-pricing-principles/</guid>
        <description>&lt;p&gt;One of the easiest things to find confusing about LLM API billing is why almost every platform eventually comes down to one unit: &lt;code&gt;token&lt;/code&gt;. The real question is simple: &lt;strong&gt;why do LLMs charge by token, and why can different tokens have different prices?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For many people who are just starting to use model APIs, the most confusing part is not model capability but the bill. Why does the cost rise so quickly even when you only ask a few questions? Why is input cheaper than output? Why does the bill start growing much faster once context becomes long?&lt;/p&gt;
&lt;p&gt;A simple way to think about it is this: &lt;strong&gt;you are not paying for &amp;ldquo;one answer.&amp;rdquo; You are paying for the compute and bandwidth consumed throughout the whole inference process.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&#34;1-what-is-a-token&#34;&gt;1. What is a token
&lt;/h2&gt;&lt;p&gt;In LLM billing, a &lt;code&gt;token&lt;/code&gt; is neither a character count nor a word count. It is the unit a model uses when processing text.&lt;/p&gt;
&lt;p&gt;A token might be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A single Chinese character&lt;/li&gt;
&lt;li&gt;Part of an English word&lt;/li&gt;
&lt;li&gt;A punctuation mark&lt;/li&gt;
&lt;li&gt;A short chunk of frequently seen text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why API platforms usually do not charge per sentence or per request. They charge according to how many tokens the model actually reads and generates.&lt;br&gt;
This is much more reasonable than charging by request count, because one request might contain 20 characters, while another might include 200,000 tokens of context. The resource consumption is nowhere near the same.&lt;/p&gt;
&lt;h2 id=&#34;2-why-input-and-output-are-priced-separately&#34;&gt;2. Why input and output are priced separately
&lt;/h2&gt;&lt;p&gt;Most model APIs today split pricing into two parts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Input token price&lt;/li&gt;
&lt;li&gt;Output token price&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And in many cases, &lt;strong&gt;output tokens cost more than input tokens.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The reason is not hard to understand.&lt;/p&gt;
&lt;p&gt;When a model processes input, it is mainly reading and encoding existing content. But when it generates output, it has to predict the next token, then the next one, then the next one. This is not just reading. It is an ongoing process of inference and sampling, which usually costs more compute.&lt;/p&gt;
&lt;p&gt;You can think of it roughly like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Input: handing materials to the model&lt;/li&gt;
&lt;li&gt;Output: asking the model to write the answer on the spot&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Writing on the spot usually costs more than reading the materials once, so it is very common for output pricing to be higher.&lt;/p&gt;
&lt;h2 id=&#34;3-why-long-context-makes-costs-easier-to-lose-control-of&#34;&gt;3. Why long context makes costs easier to lose control of
&lt;/h2&gt;&lt;p&gt;Many people think they are only adding a bit more background information, but from the model billing perspective, the impact is often much bigger than expected.&lt;/p&gt;
&lt;p&gt;The reason is that &lt;strong&gt;each model call usually has to process the full context included in that request again.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;That means if your request currently contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A system prompt&lt;/li&gt;
&lt;li&gt;Conversation history&lt;/li&gt;
&lt;li&gt;Tool return values&lt;/li&gt;
&lt;li&gt;Long document chunks&lt;/li&gt;
&lt;li&gt;Source code files&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;all of that goes into input token billing.&lt;/p&gt;
&lt;p&gt;So what really makes bills grow is often not the final question itself, but the long chain of context attached before it.&lt;br&gt;
As the number of conversation turns increases, tool calls accumulate, and prior messages keep getting fed back in, token cost grows round after round.&lt;/p&gt;
&lt;h2 id=&#34;4-why-tool-calls-are-especially-likely-to-inflate-token-usage&#34;&gt;4. Why tool calls are especially likely to inflate token usage
&lt;/h2&gt;&lt;p&gt;In scenarios like agents, coding assistants, and workflow automation, token usage is often much higher than in ordinary chat.&lt;/p&gt;
&lt;p&gt;The issue is not just that the model wrote a paragraph. It is that the workflow keeps producing content like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reading files&lt;/li&gt;
&lt;li&gt;Inspecting logs&lt;/li&gt;
&lt;li&gt;Calling APIs&lt;/li&gt;
&lt;li&gt;Returning JSON&lt;/li&gt;
&lt;li&gt;Feeding tool results back into the model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As long as the result of each tool call gets inserted into the next round of context, it becomes a new source of input tokens.&lt;/p&gt;
&lt;p&gt;That is why many developers eventually realize:&lt;br&gt;
&lt;strong&gt;the model&amp;rsquo;s unit price is not always the real problem. The workflow itself may be stacking token cost layer by layer.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, imagine a coding agent doing the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Read the project structure&lt;/li&gt;
&lt;li&gt;Open several source files&lt;/li&gt;
&lt;li&gt;Run a test suite&lt;/li&gt;
&lt;li&gt;Feed the error logs back into the model&lt;/li&gt;
&lt;li&gt;Read more related files&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each step can make later requests carry even more context. Even if the unit price does not change, the total bill can rise quickly.&lt;/p&gt;
&lt;h2 id=&#34;5-why-the-same-kind-of-model-can-have-very-different-prices&#34;&gt;5. Why the same kind of model can have very different prices
&lt;/h2&gt;&lt;p&gt;Differences in token pricing between models are not only about vendors wanting to charge more. They are usually tied directly to several factors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model size&lt;/li&gt;
&lt;li&gt;Inference efficiency&lt;/li&gt;
&lt;li&gt;Context length&lt;/li&gt;
&lt;li&gt;Deployment cost&lt;/li&gt;
&lt;li&gt;Target market&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The larger the model, the more active parameters it uses, and the more complex its inference path is, the higher the cost of generating one token usually becomes.&lt;br&gt;
If the model also supports ultra-long context, more complex reasoning, or better tool use, the infrastructure pressure increases even more.&lt;/p&gt;
&lt;p&gt;So pricing is really covering several kinds of cost:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GPU or accelerator resources&lt;/li&gt;
&lt;li&gt;VRAM usage&lt;/li&gt;
&lt;li&gt;Inference latency&lt;/li&gt;
&lt;li&gt;Network and service stability&lt;/li&gt;
&lt;li&gt;Peak concurrency capacity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A cheaper model is not necessarily bad, and a more expensive model is not necessarily the right choice for every task. In many cases, the price gap reflects how much infrastructure cost a certain level of capability requires.&lt;/p&gt;
&lt;h2 id=&#34;6-why-cached-input-is-cheaper&#34;&gt;6. Why cached input is cheaper
&lt;/h2&gt;&lt;p&gt;Many model platforms now offer features such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;cached input&lt;/li&gt;
&lt;li&gt;prompt caching&lt;/li&gt;
&lt;li&gt;prefix caching&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The shared idea behind them is simple: if a large chunk of input has already been processed once, do not keep recomputing it from scratch at full price.&lt;/p&gt;
&lt;p&gt;For example, if you repeatedly send the same system prompt, the same tool instructions, or the same long document prefix, the platform may be able to cache part of that computation. Then even though it is still input token usage, the cached portion can be billed at a lower rate.&lt;/p&gt;
&lt;p&gt;This also explains why many API pricing pages show three or more price tiers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Standard input&lt;/li&gt;
&lt;li&gt;Cached input&lt;/li&gt;
&lt;li&gt;Output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The difference is not that the text means different things. It is that the underlying computation may or may not be reusable.&lt;/p&gt;
&lt;h2 id=&#34;7-why-cheap-tokens-do-not-automatically-mean-lower-total-cost&#34;&gt;7. Why &amp;ldquo;cheap tokens&amp;rdquo; do not automatically mean lower total cost
&lt;/h2&gt;&lt;p&gt;When people see a model advertised as &amp;ldquo;very cheap per million tokens,&amp;rdquo; the first instinct is often that total cost must also be lower. In reality, not always.&lt;/p&gt;
&lt;p&gt;That is because total cost is roughly:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;token unit price × actual token volume&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;And actual token volume can be amplified by many things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prompts that are too long&lt;/li&gt;
&lt;li&gt;Conversation history that is never trimmed&lt;/li&gt;
&lt;li&gt;Too much tool output fed back in&lt;/li&gt;
&lt;li&gt;Overly verbose model output&lt;/li&gt;
&lt;li&gt;Repeated retries for the same task&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the real bill is not determined by price alone. It is usually determined by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Model unit price&lt;/li&gt;
&lt;li&gt;Input length per round&lt;/li&gt;
&lt;li&gt;Output length per round&lt;/li&gt;
&lt;li&gt;Number of calls&lt;/li&gt;
&lt;li&gt;Workflow design&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is also why a &amp;ldquo;low-cost model&amp;rdquo; can still end up expensive in some agent workflows. It may need more rounds, more supplemental context, and more retry cycles.&lt;/p&gt;
&lt;h2 id=&#34;8-how-developers-should-estimate-token-cost&#34;&gt;8. How developers should estimate token cost
&lt;/h2&gt;&lt;p&gt;If you want better budget control in a real project, a simple way to estimate cost is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Measure average input tokens per request&lt;/li&gt;
&lt;li&gt;Measure average output tokens per request&lt;/li&gt;
&lt;li&gt;Estimate how many rounds one complete task requires&lt;/li&gt;
&lt;li&gt;Multiply by the model&amp;rsquo;s pricing&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;8k tokens&lt;/code&gt; of input per round&lt;/li&gt;
&lt;li&gt;&lt;code&gt;1k tokens&lt;/code&gt; of output per round&lt;/li&gt;
&lt;li&gt;&lt;code&gt;10&lt;/code&gt; rounds for one task&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then what you are really consuming is not &amp;ldquo;one Q&amp;amp;A exchange,&amp;rdquo; but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;About &lt;code&gt;80k tokens&lt;/code&gt; of input&lt;/li&gt;
&lt;li&gt;About &lt;code&gt;10k tokens&lt;/code&gt; of output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And if logs, tool results, and file contents keep being added along the way, the total grows even further.&lt;/p&gt;
&lt;p&gt;That is why budget planning should not only look at a single round. It should look at &lt;strong&gt;how many tokens a full task loop will consume from start to finish.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&#34;9-how-to-control-the-bill-in-practice&#34;&gt;9. How to control the bill in practice
&lt;/h2&gt;&lt;p&gt;If you are already using APIs or agents, the following methods are usually the most effective:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Shorten the system prompt and cut repeated wording&lt;/li&gt;
&lt;li&gt;Trim old conversation history regularly&lt;/li&gt;
&lt;li&gt;Keep only necessary fields from tool outputs&lt;/li&gt;
&lt;li&gt;Retrieve first, then send only relevant parts of long documents&lt;/li&gt;
&lt;li&gt;Limit output length and avoid unbounded expansion&lt;/li&gt;
&lt;li&gt;Use expensive models for high-value tasks and cheaper ones for lower-value tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In many cases, the best way to save money is not to switch blindly to a cheaper model. It is to remove unnecessary token consumption from the workflow first.&lt;/p&gt;
&lt;h2 id=&#34;10-how-to-think-about-all-of-this&#34;&gt;10. How to think about all of this
&lt;/h2&gt;&lt;p&gt;At the end of the day, token pricing is a way of charging for how much the model had to read, infer, and write.&lt;/p&gt;
&lt;p&gt;It is not like traditional software pricing, where per-account, per-request, or monthly billing is enough to describe resource use. A model call is a dynamic computation process. The amount of context you send, the tools you invoke, and the output length you request all directly affect cost.&lt;/p&gt;
&lt;p&gt;So the most important thing is not memorizing price tables. It is building the right intuition:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Long context increases input cost&lt;/li&gt;
&lt;li&gt;Long output increases generation cost&lt;/li&gt;
&lt;li&gt;Tool chains amplify total token usage&lt;/li&gt;
&lt;li&gt;Caching and workflow design can change the bill significantly&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once those points are clear, the pricing structure of most LLM APIs becomes much easier to understand.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>AI Terms Explained: Agent, MCP, RAG, and Token in Plain Language</title>
        <link>https://knightli.com/en/2026/04/23/ai-terms-agent-mcp-rag-token-explained/</link>
        <pubDate>Thu, 23 Apr 2026 13:13:40 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/23/ai-terms-agent-mcp-rag-token-explained/</guid>
        <description>&lt;p&gt;When people first get into AI, what pushes them away is often not the models themselves, but the long list of terms that keeps showing up in every discussion. &lt;code&gt;Agent&lt;/code&gt;, &lt;code&gt;MCP&lt;/code&gt;, &lt;code&gt;RAG&lt;/code&gt;, &lt;code&gt;AIGC&lt;/code&gt;, and &lt;code&gt;Token&lt;/code&gt; all look familiar, but without a simple explanation, many people only recognize the words without really understanding them.&lt;/p&gt;
&lt;p&gt;This article follows a common beginner-friendly line of explanation and condenses 10 high-frequency AI terms into a set of meanings that is easier to remember. The goal is not to sound academic. It is to help you build a basic mental model that lets you follow everyday AI conversations.&lt;/p&gt;
&lt;h2 id=&#34;10-common-ai-terms-and-what-they-mean&#34;&gt;10 common AI terms and what they mean
&lt;/h2&gt;&lt;h3 id=&#34;1-agent-an-ai-that-does-more-than-chat&#34;&gt;1. Agent: an AI that does more than chat
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;Agent&lt;/code&gt; can be understood as an AI assistant that actually gets work done.&lt;/p&gt;
&lt;p&gt;A normal chatbot usually works in a simple question-and-answer pattern. An &lt;code&gt;Agent&lt;/code&gt; goes a step further. It can break a task into steps, arrange a process, call tools, and return a finished result. If you ask it to organize materials, look something up, or generate a document, it may do more than give advice. It may actually chain those actions together and complete them.&lt;/p&gt;
&lt;p&gt;That is why the key point of an &lt;code&gt;Agent&lt;/code&gt; is not whether it can talk, but whether it can act.&lt;/p&gt;
&lt;h3 id=&#34;2-openclaw-an-ai-assistant-that-stays-on-your-computer&#34;&gt;2. OpenClaw: an AI assistant that stays on your computer
&lt;/h3&gt;&lt;p&gt;Here, &lt;code&gt;OpenClaw&lt;/code&gt; is described as a kind of AI assistant that lives on your computer.&lt;/p&gt;
&lt;p&gt;You can think of this type of tool as a more desktop-oriented AI helper. It does not only receive text. It may also observe the interface, call local tools, and execute tasks step by step. Compared with a normal web chat interface, this kind of tool emphasizes operational ability much more.&lt;/p&gt;
&lt;p&gt;If &lt;code&gt;Agent&lt;/code&gt; is the abstract idea of an execution-oriented AI, this kind of desktop assistant is a more concrete personal-computer version of that idea.&lt;/p&gt;
&lt;h3 id=&#34;3-skills-capability-packs-added-to-an-agent&#34;&gt;3. Skills: capability packs added to an Agent
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;Skills&lt;/code&gt; can be understood as functional modules or operating instructions for an &lt;code&gt;Agent&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The same &lt;code&gt;Agent&lt;/code&gt; can behave very differently depending on which &lt;code&gt;Skills&lt;/code&gt; it has. Some may focus on copywriting, some on data organization, and some on code-related work. They are a bit like apps on a phone, and a bit like reusable workflows.&lt;/p&gt;
&lt;p&gt;So in many cases, it is not that the model suddenly became smarter. It is that a clearer set of rules, tools, and steps was added behind it.&lt;/p&gt;
&lt;h3 id=&#34;4-mcp-a-unified-way-for-ai-to-connect-to-tools&#34;&gt;4. MCP: a unified way for AI to connect to tools
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;MCP&lt;/code&gt; stands for &lt;code&gt;Model Context Protocol&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In everyday terms, it is a bit like a &lt;code&gt;Type-C&lt;/code&gt; connector for the AI world. In the past, connecting a model to different tools often meant building separate integrations one by one. With a unified protocol, the way those tools connect becomes more standardized and easier to reuse.&lt;/p&gt;
&lt;p&gt;For most users, the most important thing to remember is this: &lt;code&gt;MCP&lt;/code&gt; is not about whether a model can answer a question. It is about how a model can connect to external tools and resources in a safe and stable way.&lt;/p&gt;
&lt;h3 id=&#34;5-gacha-ai-output-is-inherently-random&#34;&gt;5. Gacha: AI output is inherently random
&lt;/h3&gt;&lt;p&gt;The term &amp;ldquo;gacha&amp;rdquo; often appears in &lt;code&gt;AI&lt;/code&gt; image generation, video generation, and creative work.&lt;/p&gt;
&lt;p&gt;The idea is simple. Even with the same prompt and the same general direction, the result can still be different each time. Sometimes the output is great. Sometimes it falls apart. That is why people compare repeated generation attempts to pulling gacha in a game.&lt;/p&gt;
&lt;p&gt;What this really reminds us is that AI generation is not a fixed formula. It is a probabilistic process with variation.&lt;/p&gt;
&lt;h3 id=&#34;6-api-the-connection-between-an-app-and-a-model&#34;&gt;6. API: the connection between an app and a model
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;API&lt;/code&gt; stands for &lt;code&gt;Application Programming Interface&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You can think of it as the standard entry point through which programs communicate. When you call a model service from your own app, script, or editor, you are essentially using an &lt;code&gt;API&lt;/code&gt; to send a request and receive a result.&lt;/p&gt;
&lt;p&gt;If you compare a model service to a restaurant, then:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the menu is like the &lt;code&gt;API&lt;/code&gt; documentation&lt;/li&gt;
&lt;li&gt;placing an order is like making an &lt;code&gt;API&lt;/code&gt; request&lt;/li&gt;
&lt;li&gt;the kitchen sending back the dish is like the model returning a result&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why many tools may look different on the surface while still calling some form of &lt;code&gt;API&lt;/code&gt; underneath.&lt;/p&gt;
&lt;h3 id=&#34;7-multimodality-ai-handles-more-than-text&#34;&gt;7. Multimodality: AI handles more than text
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;Multimodality&lt;/code&gt; means AI no longer only reads and writes text. It can process multiple kinds of input and output.&lt;/p&gt;
&lt;p&gt;For example, it may be able to read images, understand voice, interpret video, generate pictures, or even support real-time voice and video interaction. Compared with early text-only models, multimodal models are much closer to having the combined abilities to see, hear, speak, and write.&lt;/p&gt;
&lt;p&gt;That is also why many AI products are no longer centered around a single text box.&lt;/p&gt;
&lt;h3 id=&#34;8-rag-retrieve-information-first-then-generate-an-answer&#34;&gt;8. RAG: retrieve information first, then generate an answer
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;RAG&lt;/code&gt; stands for &lt;code&gt;Retrieval-Augmented Generation&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is useful for solving a practical problem: a model&amp;rsquo;s training data has a time boundary, and it does not automatically know your company&amp;rsquo;s newest documents, customer-service records, or business rules. The idea behind &lt;code&gt;RAG&lt;/code&gt; is to retrieve relevant material from specified sources first, and then generate an answer based on that material.&lt;/p&gt;
&lt;p&gt;Its value usually shows up in three ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;answers are more likely to stay close to real source material&lt;/li&gt;
&lt;li&gt;you can trace where the answer came from&lt;/li&gt;
&lt;li&gt;new documents can be added and reflected quickly&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why many enterprise knowledge bases, AI customer-service systems, and internal Q&amp;amp;A tools rely on &lt;code&gt;RAG&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id=&#34;9-aigc-the-general-term-for-ai-generated-content&#34;&gt;9. AIGC: the general term for AI-generated content
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;AIGC&lt;/code&gt; stands for &lt;code&gt;AI Generated Content&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is not a single tool. It is a broad label for content produced by AI, including text, images, audio, video, and more. AI writing, AI illustration, AI short-form video generation, and AI voice synthesis all fit under the umbrella of &lt;code&gt;AIGC&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;What matters most about this term is that it describes a way of producing content, not one specific model.&lt;/p&gt;
&lt;h3 id=&#34;10-token-the-unit-used-to-measure-model-processing&#34;&gt;10. Token: the unit used to measure model processing
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;Token&lt;/code&gt; can be understood as the basic unit a model uses to process text.&lt;/p&gt;
&lt;p&gt;It is not exactly the same as one character or one word, but in practice, you can treat it as the common unit used for model computation and billing. Your input consumes &lt;code&gt;Token&lt;/code&gt;, the model&amp;rsquo;s output consumes &lt;code&gt;Token&lt;/code&gt;, and the context kept in memory also takes up &lt;code&gt;Token&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That is why model services keep talking about context length, cost control, and prompt compression. At the core, all of those topics are tied to &lt;code&gt;Token&lt;/code&gt;.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
