AI Agent on KnightLi Blog

What Is OpenAI Symphony? Codex Orchestration, Issue-Driven Development, and AI Agent Workflows

Mon, 25 May 2026 00:17:32 +0800

OpenAI recently open-sourced an interesting Codex orchestration specification: Symphony.

It is not another chat-based coding assistant, nor is it a complete new IDE. More precisely, Symphony is a way to orchestrate work around Codex: it turns an issue tracker similar to Linear into the control plane for coding agents, so every open task can correspond to a continuously running Agent.

One line from the official article captures its direction well: in the past, engineers had to monitor multiple Codex sessions at once, continually assigning work, reviewing output, correcting course, and restarting sessions. Symphony is designed to address exactly that context-switching bottleneck.

Symphony is not solving code writing, but Agent management

A single Codex session works well for interactive development: you give it a task, it changes the code, you review it, and then you keep asking follow-up questions. But once a team starts using multiple Agents at the same time, the problem shifts from “can the code be written?” to “who is working on what, how far along is it, and who takes over after a failure?”

OpenAI’s approach is to move the center of gravity from “sessions” to “tasks”:

the issue is the real unit of work;
every open issue can map to an independent Agent workspace;
Symphony continuously polls the task board and decides which tasks should be started, retried, stopped, or reclaimed;
Codex performs implementation, testing, commits, PR creation, status updates, and related actions inside the workspace;
humans no longer micromanage every session, but instead review results, adjust goals, and maintain boundaries.

The shift behind this is important: an Agent is no longer just a tool that humans temporarily summon, but a continuously running kind of executor inside the development workflow.

Why an issue tracker?

Because teams already use issue trackers to manage real work.

Requirements, bugs, refactors, migrations, research, priorities, blockers, owners, and milestones are already recorded in Linear, GitHub Issues, or similar systems. Symphony does not reinvent a large console. Instead, it treats these existing systems as the task entry point for Agents.

This has several advantages:

Work does not need to be copied from an issue into a chat window.
Humans can keep creating, splitting, scheduling, and closing tasks in familiar ways.
Agent state changes can be written back to the same work system, making async collaboration easier for the team.
Task dependencies can naturally form a DAG, allowing unblocked tasks to move forward in parallel.

If traditional CI is “automation after a code commit,” Symphony is closer to “automation after an issue is created.”

Its core workflow

A typical Symphony flow can be understood as:

创建 issue
  -> Symphony 轮询到可执行任务
  -> 为该 issue 创建独立 workspace
  -> 启动 Codex agent session
  -> Agent 阅读任务、修改代码、运行测试
  -> 创建或更新 PR
  -> 写回任务状态、评论、证据和交付物
  -> 人类 review、合并或要求修改

The official specification also emphasizes several engineering details:

each issue uses an independent workspace to reduce cross-contamination;
the orchestrator maintains retry, concurrency, and recovery state;
workflow policy lives in the repository’s WORKFLOW.md, so teams can version the rules that describe how Agents should handle tasks;
implementations need to preserve observability, with at least structured logs;
a successful state does not have to be Done; it can also be an intermediate state handed to humans for review.

This shows that Symphony is not simply about “letting AI write code automatically.” It defines a runnable, recoverable, and auditable Agent work system.

Goal-driven, not a rigid state machine

OpenAI mentions an important shift in the article: early on, they tried hard-coding many actions in the outer harness, such as committing code, running tests, and handling GitHub workflows. But as Codex became more capable, that approach started to constrain the Agent.

The later direction was to give the Agent a goal, rather than encoding every step as a fixed state transition.

For example, a task’s goal might be “complete the Vite migration and ensure CI passes.” The Agent can decide for itself whether it needs to change configuration, fix tests, read CI logs, handle review feedback, or even create new follow-up issues. Symphony provides boundaries, context, and the runtime framework instead of prescribing every action for the Agent.

This is also where it differs from traditional automation scripts: scripts are good at repeated, deterministic processes; Symphony is aimed at engineering tasks with uncertainty.

How is this different from normal Codex usage?

A normal Codex session is more like “a human writes code with AI”:

the human opens a session;
the human describes the task;
the human watches the output;
the human corrects course at any time;
after one task ends, the human starts the next session.

Symphony is more like “a team hands a task pool to a group of Agents”:

humans write clear issues;
the system continuously discovers executable tasks;
Agents make progress in isolated environments;
results come back as PRs, comments, test status, videos, or analysis reports;
humans review at key checkpoints.

This is not about replacing engineers. It is about freeing engineers from the burden of simultaneously watching many sessions. OpenAI notes in the official article that some teams saw a significant increase in PRs merged to the main branch. But the more important point is the change in working style: the startup cost of trying an idea, launching a refactor, or validating a hypothesis becomes lower.

Where does it fit?

Symphony is better suited to tasks such as:

routine feature implementation;
small refactors in an existing codebase;
infrastructure migrations;
dependency upgrades;
filling in tests;
CI fixes;
research followed by an implementation plan;
continuing to revise a PR based on review feedback.

It is not necessarily a good fit for highly ambiguous tasks that require strong business judgment or architectural decisions. For those problems, an interactive Codex session is still more natural because humans need to stay involved throughout the process.

Risks and boundaries

Symphony is appealing, but in real adoption, teams cannot look only at the “automation” side.

Several boundaries need to be made clear in advance:

issues must be written clearly, otherwise Agents will amplify vague requirements into incorrect implementations;
Agent permissions should be constrained, especially access to repositories, secrets, production environments, and third-party services;
every workspace should be isolated to avoid contamination between tasks;
CI, tests, lint, and review remain necessary quality gates;
task status, PR links, logs, and failure reasons need to be traceable;
human review cannot be skipped, especially for changes involving security, billing, data migration, and permission logic.

The official repository also positions Symphony as an engineering preview and reference implementation for trusted environments, not a finished platform that can blindly replace a development process.

My understanding of Symphony

The most valuable part of Symphony is not that it uses Linear, nor that the reference implementation chose Elixir. Its value is that it redefines the entry point for programming Agents.

In the past, we were used to starting AI coding from a chat window. That is flexible, but once the scale grows, human attention becomes the bottleneck. Symphony puts the entry point back in the issue tracker and lets Agents work continuously around real tasks. In that sense, AI coding starts moving from a “personal productivity tool” toward “team workflow infrastructure.”

If you are already using Codex, Claude Code, Cursor Agent, or similar tools, the most important thing to notice about Symphony is not any specific implementation, but the pattern behind it:

Do not only manage Agent sessions. Manage the work that needs to be done.

This may become a key dividing line for the next stage of AI coding tools.

References

How browser-harness domain skills keep AI agents from repeating browser automation mistakes

Sun, 24 May 2026 23:43:35 +0800

The most interesting part of browser-use/browser-harness is not only that it lets AI agents control real Chrome. It also turns web-operation experience into reusable domain skills.

That matters because browser automation is rarely difficult only because of clicking buttons. Each website has its own details:

Which pages require login.
Which data can be fetched directly through an API.
Which buttons do not respond to normal DOM clicks.
Which iframes, shadow DOM components, or popups block the flow.
Which selectors are stable and which are temporary classes.
Which actions involve accounts, payments, or irreversible changes and require human confirmation.

If this experience only stays in one task log, the agent will hit the same problems again next time. domain skills are meant to preserve that experience so the agent does not start from zero every time it opens a site.

What domain skills are

You can think of domain skills as site-operation manuals for agents.

They are not ordinary user documentation, and they are not one-off scripts. They are closer to field-tested site knowledge:

Whether the site is suitable for browser automation.
Which API should be used first if an API exists.
Which URL should be used when the browser is necessary.
Which DOM structures, aria-labels, and button behaviors have been verified.
Which common approaches fail.
Which scenarios should stop and ask for human intervention.

This content can be reviewed by humans and read by agents during tasks. It turns on-the-spot exploration into maintainable experience.

A good browser agent should not turn every problem into opening a webpage, looking at screenshots, and clicking buttons.

One important kind of experience in domain skills tells the agent when not to use the browser.

For sites such as ArXiv, paper search, metadata, and abstracts can be fetched directly through the Atom API or HTML meta tags. HTTP requests are usually faster, more stable, and easier to parse than opening a browser.

GitHub follows a similar pattern. Repository, user, and release data should use the REST API first. File contents should use raw.githubusercontent.com first. Only pages such as GitHub Trending, which do not have an equivalent API, need browser interaction.

This shows that browser-harness is not based on “the browser solves everything.” It puts the browser in the right place: when APIs, HTTP, and static pages cannot solve the problem, let the agent operate a real page.

They store site-level knowledge

Traditional automation scripts are usually written around one task, for example:

`1`	`Open page -> enter keyword -> click button -> download file`

That script may complete the task, but the experience is scattered inside code. When the site changes, the script may fail. When the task changes, much of the experience may not be reusable.

domain skills are closer to a site-level knowledge base. They care about:

Which container selector is stable in Amazon search results.
Which GitHub data should go through the REST API.
How LinkedIn invitation buttons differ in aria-label.
Which Shopify Admin pages are embedded apps.
Why Shopify Polaris inputs cannot always be filled with normal JS value assignment.
How Browser Use Cloud browser instances are created, listed, and cleaned up.

These are not steps for one task. They are decision-making knowledge that many future tasks can reuse.

Example: Amazon product search

For Amazon product search, the important part is not only how to search, but which path is more stable.

A more reliable approach is to use a direct search URL instead of opening the homepage and simulating typing every time. Search results can be extracted from a container such as [data-component-type="s-search-result"]. Field extraction also has details: title, price, rating, review count, and sponsored status each have more stable DOM sources.

This kind of experience is valuable for an agent. Without it, the agent may guess buttons from screenshots and repeatedly try selectors. With it, the agent can go directly to a more stable extraction path.

More importantly, a skill can record traps. For example, some selectors that look usable may misread sponsored results or cross-sell areas. You only learn that from field testing.

Example: LinkedIn invitation management

LinkedIn is closer to a real account workflow, and the risk is higher.

On the invitation manager page, the Accept and Ignore buttons use different aria-label formats. You cannot simply derive one from the other. Some invitation cards even render Accept as an <a> element rather than a <button>, and ordinary CDP clicks may not trigger the accept action.

This shows that real web automation does not end when an element is located. Button labels, event binding, soft navigation, and component implementation all affect whether an action really works.

For an agent, this experience also has a safety meaning. Operations involving social accounts, invitations, messages, and posting should not be fully delegated. A skill can record the path and traps, but accepting invitations in bulk, sending content externally, or changing account details should keep human confirmation.

Example: Shopify Admin

Shopify Admin shows another issue: backend systems are often not one page, but a combination of embedded apps and complex components.

Many Shopify apps run inside iframes. Polaris React inputs, Web Components, and embedded apps all behave differently. Some inputs cannot be filled with element.value = ...; they need CDP keystrokes that are closer to real keyboard input.

The value of this kind of skill is that it lets the agent first identify what kind of UI it is looking at, then choose the right operation method.

Shopify experience also emphasizes “do not use the browser if you do not have to”:

For read-only product and inventory data, use the Storefront API first.
If an Admin API token exists, use the Admin API first.
For theme code editing, use Shopify CLI first.
Use the browser only when there is no API, the change is rare, or you are exploring the admin.

That is a mature tool-selection logic for agents.

Example: Browser Use Cloud

domain skills do not only serve webpage clicking. They can also record API experience around browser runtimes.

Browser Use Cloud experience can record how to create cloud browsers through REST APIs, list running browsers, clean up zombie browsers, and obtain liveUrl and cdpUrl.

This means a skill is not limited to “how to click a button.” Any recurring task with a stable method can become a skill:

API call patterns.
Authentication header format.
Request and response structure.
Verified status codes.
Common failure modes.
Resource cleanup and recycling methods.

For agents, all of these are reusable capabilities.

Why this is more reliable than ad-hoc reasoning

Many people expect a large model to understand the webpage by itself every time. In real tasks, relying only on ad-hoc reasoning is unstable.

The reasons are simple:

Web UI changes often.
The same button may have multiple implementations.
Visible does not mean clickable.
Clickable does not mean the action really worked.
Some tasks should use APIs instead of browsers.
Some operations require human confirmation and should not be decided by the model alone.

Writing these experiences into files brings several benefits:

Humans can review them.
Wrong experience can be corrected.
Site knowledge can accumulate over time.
New agents can inherit old experience.
Temporary task discoveries can become long-term knowledge.

This is more stable than putting everything into a prompt or chat context.

How teams can use it

In a team, domain skills can become a lightweight automation knowledge base.

Useful content to record includes:

Post-login paths in internal systems.
Report export flows.
Common popup handling.
Which buttons require human confirmation.
Which pages have API alternatives.
Which selectors were tested and found reliable.
Which tasks agents are not allowed to run automatically.

This knowledge does not need to be complete at the beginning. A practical path is to start with low-risk, frequent, reversible workflows: read-only tasks, downloads, organization, and checks. Once the flow is stable, turn the experience into a skill.

For team managers, skill files also make automation boundaries visible. You can inspect what the agent knows, what it can do, and where it should stop.

Boundaries to keep

domain skills can improve an agent’s success rate, but they should not fully automate high-risk operations.

Several boundaries matter:

Do not record passwords, Cookie, token, customer data, or sensitive internal URLs.
Keep human confirmation for payments, deletion, bulk submission, account changes, and external publishing.
Record verification date and scope.
Allow skills to expire after site changes and require revalidation.
Do not make bypassing risk controls or platform limits a goal.

In other words, domain skills make agents steadier. They do not give agents unlimited permission.

Conclusion

The domain skills mechanism in browser-harness shows one thing: AI browser automation cannot rely only on the model improvising at runtime.

A usable browser agent needs at least three layers:

Low-level control: screenshots, clicks, input, downloads, CDP, HTTP.
Site-level knowledge: API priority, stable selectors, component traps, login boundaries.
Human safety rules: do not give credentials to the model, confirm high-risk actions, and do not write sensitive information into skills.

domain skills fill the second layer. They let an agent enter a web task with verified experience instead of rediscovering everything every time.

References:

browser-harness domain skills: https://github.com/browser-use/browser-harness/tree/main/agent-workspace/domain-skills
Amazon product-search skill: https://github.com/browser-use/browser-harness/blob/main/agent-workspace/domain-skills/amazon/product-search.md
ArXiv scraping skill: https://github.com/browser-use/browser-harness/blob/main/agent-workspace/domain-skills/arxiv/scraping.md
GitHub scraping skill: https://github.com/browser-use/browser-harness/blob/main/agent-workspace/domain-skills/github/scraping.md
LinkedIn invitation-manager skill: https://github.com/browser-use/browser-harness/blob/main/agent-workspace/domain-skills/linkedin/invitation-manager.md
Shopify admin skill: https://github.com/browser-use/browser-harness/blob/main/agent-workspace/domain-skills/shopify-admin/README.md
Browser Use Cloud skill: https://github.com/browser-use/browser-harness/blob/main/agent-workspace/domain-skills/browser-use-cloud/cloud.md

browser-harness, Playwright, and Puppeteer: which browser automation tool should you choose?

Sun, 24 May 2026 17:51:28 +0800

In browser automation and automated testing, Playwright and Puppeteer are two of the most commonly compared tools. Both can control browsers, click pages, extract content, generate screenshots or PDFs, and both are closely related to Chrome DevTools Protocol.

Once browser-use/browser-harness is added to the picture, the question is no longer simply “which testing framework is stronger.” It becomes a comparison between two kinds of tools:

Playwright / Puppeteer: tools for engineers to write deterministic scripts.
browser-harness: a tool for AI agents to operate real browsers.

The first group fits testing, scraping, and engineered automation. The second is closer to a browser control layer for agents such as Claude Code, Codex CLI, and Gemini.

The relationship between Playwright and Puppeteer

Puppeteer was originally launched by the Google Chrome team and naturally focuses on Chromium and Chrome automation. Its API is concise, the ecosystem is mature, and it is especially convenient for screenshots, PDF generation, page scraping, and lightweight automation around Chrome.

Playwright is maintained by Microsoft, and its team has deep historical links to early Puppeteer work. It absorbed many lessons from Puppeteer and added stronger cross-browser support, auto-waiting, context isolation, test reports, and debugging tools.

In short:

If you only need lightweight Chrome-based tasks, Puppeteer is still very pleasant to use.
If you are doing cross-browser E2E tests, complex SPA automation, or team-level test engineering, Playwright is usually the better fit.

Core differences

Dimension	Puppeteer	Playwright
Maintainer	Google	Microsoft
Browser support	Mainly Chrome / Chromium	Chromium, Firefox, WebKit
Language support	Mainly JavaScript / TypeScript	JavaScript / TypeScript, Python, Java, .NET
Auto-waiting	More explicit waiting	Strong Locator and auto-waiting
Context isolation	Supported, but less central	Excellent BrowserContext workflow
Tooling	Simple, mature, foundational	Codegen, Trace Viewer, reports
Typical use	Chrome automation, screenshots, PDF, lightweight scraping	Cross-browser E2E tests, complex frontend automation

Browser support

Puppeteer is strongest with Chrome. It integrates tightly with Chromium. If your goal is to control Chrome, generate PDFs, take screenshots, or run simple scraping tasks, Puppeteer has a low mental overhead.

Playwright is stronger for cross-browser work. It natively supports Chromium, Firefox, and WebKit. WebKit is especially important because many Safari-related issues cannot be detected through Chrome alone. For applications that need coverage across desktop, mobile, and multiple browser engines, Playwright is the better main tool.

This is the first decision boundary: if you only care about Chrome, Puppeteer is fine. If you are serious about cross-browser testing, choose Playwright first.

Auto-waiting and stability

The most annoying part of browser automation is often not “how to click,” but whether the page is ready. An element may not be attached to the DOM, may be covered, may still be animating, or may still be disabled.

In Puppeteer, you often write:

1
2

await page.waitForSelector('#submit-btn');
await page.click('#submit-btn');

This works, but engineers must think through the waiting logic themselves. The more complex the page, the more likely the script will accumulate waitForSelector, waitForTimeout, and retry logic.

Playwright’s Locator and auto-waiting mechanism is more complete:

`1`	`await page.locator('#submit-btn').click();`

Before clicking, Playwright checks whether the element is visible, actionable, stable, and not covered, then retries within a reasonable time. This matters a lot for modern React, Vue, and Next.js applications with heavy asynchronous rendering, and it reduces flaky tests.

Multi-account workflows and context isolation

If you need to simulate multiple users, or let many tasks share one browser process while isolating Cookie, LocalStorage, and Session, BrowserContext matters.

Puppeteer also supports context isolation, but Playwright makes it a core capability. You can quickly create multiple independent contexts inside one browser instance. Each context behaves like a clean browser environment without repeatedly starting full browser processes.

This is useful for:

Multi-account concurrent tests.
Multi-role workflow tests.
Ecommerce, messaging, and collaborative document scenarios.
Scraping tasks that need isolated Cookie and login state.

Tooling differences

Playwright is the more engineering-oriented option. It includes many tools used in test development:

codegen: operate on a webpage and generate scripts automatically.
Trace Viewer: replay screenshots, DOM snapshots, network requests, and console logs after failures.
Test Runner: assertions, parallelism, retries, reports, and project matrices.
Locator: element selection by text, role, label, test id, and CSS.

Puppeteer is more like a lightweight browser control library. It is not bloated, its API is direct, and it is easy to embed in scripts, server-side jobs, and custom automation flows.

If you are building an enterprise-grade test system, Playwright’s tooling saves a lot of work. If you only need a Node.js script to convert webpages to PDFs or take scheduled screenshots, Puppeteer may be cleaner.

Where browser-harness fits

browser-harness is not the same kind of tool as Playwright or Puppeteer.

Playwright and Puppeteer mostly assume that humans write scripts. Engineers choose selectors, waiting conditions, assertions, and exception handling. They pursue determinism: the same script should produce the same result under the same page state.

browser-harness mostly assumes that an AI agent operates the browser. Its goal is not to provide a huge high-level API, but to connect to real Chrome through CDP and expose screenshots, coordinate clicks, DOM, network requests, and helpers to the agent. The agent can observe the page, decide the next step, add helpers when capabilities are missing, and turn site experience into skills.

That makes it better for open-ended tasks:

Log in to a backend and download invoices.
Fill a group of forms in an internal system.
Handle OA or SaaS pages that change often.
Explore a page according to a user goal instead of running a fixed script.
Give tools such as Claude Code and Codex CLI browser operation capability.

Three-way comparison

Dimension	Puppeteer	Playwright	browser-harness
Target user	Engineers	Engineers and test teams	AI Agent
Main goal	Control Chrome	Stable cross-browser automation	Let agents operate real browsers
Script style	Hand-written JS/TS automation	Scripts plus test framework	User gives a goal, agent executes steps
Element targeting	CSS, XPath, DOM API	Locator, text, role, CSS	Screenshots, coordinates, DOM, CDP
Waiting	More manual control	Strong auto-waiting	Agent observes and adjusts
Browser environment	Usually automated browser	Usually test browser	Often real Chrome
Best fit	Chrome scripts, screenshots, PDF, lightweight scraping	E2E tests, cross-browser validation, complex SPA	AI assistants, open web tasks, real-account workflows

Code feel

Puppeteer feels closer to directly controlling Chrome:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  await page.waitForSelector('#submit-btn');
  await page.click('#submit-btn');

  await browser.close();
})();

Playwright emphasizes Locator and auto-waiting:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  await page.locator('#submit-btn').click();

  await browser.close();
})();

browser-harness feels completely different. You usually do not write a full script. You give a goal inside an agent environment:

`1`	`Open the admin panel, download last month’s invoice, and organize it for reimbursement.`

The agent then repeatedly uses browser-harness to:

Take screenshots and understand the current page.
Click a coordinate or locate an element.
Enter text, upload files, and download files.
Decide how to close popups.
Add helper code when something is missing.
Turn reusable flows into domain skills.

This is not the style of traditional test scripts. It is the workflow of a browser agent.

How to choose

Choose Puppeteer when:

The project mainly runs in Node.js.
You only need Chrome or Chromium.
The task is screenshot, PDF generation, simple page scraping, or lightweight automation.
You want a simple API, fewer dependencies, and more manual control.
You rely deeply on Chrome DevTools Protocol.

Choose Playwright when:

You are building standard UI automation or E2E tests.
You need Chromium, Firefox, and WebKit coverage.
Your team’s main language may be Python, Java, or C#.
The page is a complex SPA with many asynchronous states and potential flaky tests.
You need codegen, Trace Viewer, test reports, and parallel testing.

Choose browser-harness when:

You are building or using AI agents.
You want the model to operate a real browser like a human.
The task steps are not fixed and require page-by-page judgment.
The target site changes often, or has many popups, iframes, and shadow DOM.
You want real web workflows handled by Claude Code, Codex CLI, or similar tools.

Conclusion

Playwright and Puppeteer are browser automation tools whose core goal is to let humans write reliable scripts. Puppeteer is lighter and closer to Chrome. Playwright is more complete and better for cross-browser testing and complex frontend applications.

browser-harness is a different direction. It is not designed to replace Playwright or Puppeteer for tests. It is designed to let AI agents control real browsers. It gives up some traditional script determinism in exchange for stronger adaptability in open-ended tasks.

So the answer is not to pick only one. Choose by task layer:

Test engineering: prefer Playwright.
Lightweight Chrome scripts: Puppeteer fits well.
AI agents doing work on the web: look at browser-harness.

References:

browser-use/browser-harness: https://github.com/browser-use/browser-harness
Playwright documentation: https://playwright.dev/
Puppeteer documentation: https://pptr.dev/
Chrome DevTools Protocol: https://chromedevtools.github.io/devtools-protocol/

What is browser-harness? A browser automation tool that lets AI agents control real Chrome

Sun, 24 May 2026 17:19:54 +0800

browser-use/browser-harness is a browser control tool for AI agents. Its goal is not to build another heavy automation framework, but to connect large language models directly to real Chrome through CDP, so they can browse pages, click, take screenshots, download files, upload files, and fill forms.

The README describes the project as a thin, editable CDP harness for letting LLMs connect to a real browser. When a task lacks a helper, the agent can add code during execution and turn reusable experience into domain skills.

This is worth watching because the browser is still the entry point for many real workflows: admin panels, SaaS dashboards, ecommerce sites, recruiting platforms, CRMs, reimbursement systems, cloud consoles, and document platforms. Many of them do not expose stable APIs, or their API permissions are harder to obtain than webpage access. Giving an agent reliable browser control is a way to fill that last mile of automation.

What browser-harness is

Structurally, browser-harness is closer to a browser runtime for agents than a browser extension for manual users.

Its core ideas are:

Connect directly to Chrome or Chromium.
Control pages through a CDP WebSocket.
Let agents combine screenshots, coordinate clicks, DOM inspection, network requests, and raw CDP.
Put task-specific helpers in agent-workspace/agent_helpers.py.
Store site-specific experience in agent-workspace/domain-skills/.
Keep the core thin instead of turning it into a large automation platform.

The README says the core architecture is roughly four core files and about 1,000 lines of code, covering install.md, SKILL.md, src/browser_harness/, agent-workspace/agent_helpers.py, and agent-workspace/domain-skills/.

The point is not to ship built-in support for every website. The point is to give the agent an operation layer close enough to a real browser, so it can fill in missing capabilities for the task at hand.

How it differs from traditional browser automation

Traditional browser automation usually revolves around testing frameworks such as Playwright, Selenium, or Puppeteer. They are good for deterministic scripts: open a page, locate an element, click it, and assert the result.

browser-harness targets a different kind of work. A user gives a goal, and the agent explores the page, judges the state, handles popups, adds helpers, and reuses site knowledge. It emphasizes adaptation during interaction.

The difference can be summarized like this:

Playwright is better when humans write scripts and agents run them.
browser-harness is better when agents look at the page and act step by step.
Traditional automation favors fixed flows.
browser-harness favors open-ended tasks.
Traditional scripts often depend on selectors.
browser-harness encourages screenshots first, visible UI actions next, and DOM or CDP when needed.

This does not mean it replaces Playwright. For stable tests, Playwright is still more mature. browser-harness is valuable because it turns real webpages into an environment an agent can operate, especially when page structure is complex, steps are not fixed, and situational judgment matters.

Why real Chrome matters

Many browser-agent tools use isolated headless browsers. That is simple to deploy and good for batch jobs, but it does not always reuse the user’s real working environment: login state, extensions, history, bookmarks, and daily browser setup.

browser-harness supports local Chrome and the Browser Use cloud browser. For local browsers, it offers two approaches:

Use chrome://inspect/#remote-debugging to allow the current Chrome instance to be connected.
Start an isolated profile with --remote-debugging-port=9222 --user-data-dir=....

If you want an agent to help with tasks inside real accounts, the docs lean toward the first approach because it reuses everyday Chrome login state, extensions, and bookmarks. For unattended automation, or when you do not want popups to interrupt work, an isolated profile or cloud browser is usually safer.

The trade-off is clear: real Chrome is closer to the user’s workflow, but the security boundary is more sensitive. An isolated browser is easier to control, but login and environment setup must be handled again.

Editable helpers and domain skills

The most interesting part of browser-harness is that it designs “what the agent learns” into the project structure.

agent-workspace/agent_helpers.py stores helpers that are created during tasks. For example, if an agent needs to upload a file and the existing tools are not enough, it can add a stable upload helper. The next time it sees a similar page, it does not have to start from scratch.

agent-workspace/domain-skills/ stores site-level experience. The README mentions areas such as LinkedIn outreach, Amazon ordering, and reimbursement systems. The project recommends letting agents generate these skills from real tasks instead of hand-writing them, because they should reflect actual page behavior.

This fits browser automation well. The hard part is often not “how to click a button,” but:

How a website redirects after login.
Which popups block the main flow.
Which selectors are stable and which are temporary class names.
How uploads, downloads, iframes, shadow DOM, and cross-origin components behave.
What hidden waits and asynchronous states exist in a specific backend.

If this knowledge only stays in one run log, it is quickly lost. Turning it into domain skills gives the agent a chance to improve over time.

Suitable scenarios

browser-harness is better suited for:

Operating real web admin panels for users.
Completing repeated flows in systems without APIs.
Personal or enterprise web tasks that depend heavily on login state.
Complex interactions where screenshots are needed to judge page state.
Agents that need to add tools and site knowledge while running.
Multiple sub-agents each using an isolated browser.
Researching browser-agent runtime design.

Concrete examples include organizing web tables, submitting internal forms, downloading invoices, uploading files, handling reimbursement workflows, checking order status, configuring SaaS dashboards, and extracting information from logged-in pages.

If the task is only to fetch static pages, a browser may not be needed. The project’s own SKILL.md also notes that static pages can often be fetched through HTTP in bulk. Browsers should be reserved for tasks that truly need page state, login state, and interaction.

Risks to watch

Letting an AI agent control real Chrome is powerful, but risky.

First, the permission boundary must be clear. Real Chrome may contain email, payment dashboards, cloud consoles, company systems, and personal accounts. Once an agent can operate the browser, it effectively has access to part of those webpage permissions.

Second, do not hand credentials to the model. For login pages, payment verification, and second confirmations, the user should handle the sensitive step. The agent can wait for login to finish, but it should not read or enter passwords, verification codes, or payment details from screenshots.

Third, automation is not the same as delegation. Many web tasks look simple but may involve risk controls, mistaken clicks, data deletion, bulk submissions, or irreversible operations. Start with read-only, low-risk, reversible workflows.

Fourth, domain skills should not leak private data. Site knowledge can be shared, but account names, internal URLs, customer data, coordinate logs, and one-off task details should not be written into skills.

Fifth, choose the browser connection mode carefully. Reusing daily Chrome is convenient when login state matters. For long-running automation, an isolated profile or cloud browser is more controllable.

Why it matters for AI agent tools

browser-harness represents a pragmatic direction for agent tooling: build less platform, and give the model a direct interface to the real environment.

Many agents fail at two ends. On one end, the model can reason but cannot touch the real page. On the other, automation frameworks are powerful but require humans to hard-code the flow. browser-harness tries to connect the two: the browser holds real-world state, while the agent observes, decides, and adds tools.

That is also the meaning of a self-improving harness. It does not mean the agent magically becomes smarter. It means reusable operation experience is placed into the project structure, so the next task can avoid some of the same detours.

For developers, its value is mainly in three areas:

A browser control layer for personal agents.
A reference for studying browser automation and agent workflows.
An experimental framework for turning web workflows into reusable skills.

It is not the answer to every browser automation problem, but it points in a clear direction: when agents truly help people do work, the tool layer should not only call APIs. It should also understand and operate the web interfaces people use every day.

Conclusion

browser-use/browser-harness is interesting not because it wraps many advanced features, but because it brings several key browser-agent questions into focus: real Chrome, CDP, screenshot-driven control, editable helpers, site skill accumulation, and user permission boundaries.

If you are writing stable end-to-end tests, Playwright or Selenium is still a better fit. If you want agents such as Codex or Claude Code to handle real webpage tasks, browser-harness offers an entry point that matches how agents work.

In practice, start with low-risk tasks: let it read pages, take screenshots, and extract information first. Then gradually try clicking and submitting. Once it can reliably understand page state, you can consider giving it longer workflows.

References:

GitHub project: https://github.com/browser-use/browser-harness
README: https://github.com/browser-use/browser-harness/blob/main/README.md
Installation guide: https://github.com/browser-use/browser-harness/blob/main/install.md
Usage guide: https://github.com/browser-use/browser-harness/blob/main/SKILL.md

GitHub AI Open Source Project Categories: From Coding Agent to RAG Knowledge Bases

Thu, 21 May 2026 08:53:13 +0800

This page groups GitHub AI projects by application direction, covering AI coding and Coding Agents, agent skills and workflows, RAG and knowledge bases, multimodal creation, local models and inference, vertical applications and automation, and AI application development infrastructure. New projects can be added later using the same structure.

Category Summary

Category	Projects	Who Should Start Here
AI Coding and Coding Agents	22	Users who often work with Claude Code, Codex, Cursor, terminal agents, or repository automation
Agent Skills and Workflows	7	Users who want to standardize AI coding, research, or content workflows
RAG, Knowledge Bases, and Memory	7	Users who need document retrieval, knowledge bases, long-term memory, web crawling, or structured extraction
Vertical Applications and Automation	7	Users looking at finance, trading, Xianyu monitoring, desktop control, browser automation, and other applied scenarios
Multimodal and Content Creation	5	Users working on images, video, transcription, prompt libraries, and content distribution
AI Application Development Infrastructure	5	Developers building AI apps, browser automation, or Prompt/MCP toolchains
Local Models and Inference	1	Users interested in local DeepSeek, inference engines, and hardware adaptation

The distribution shows several high-frequency directions in current AI open source projects: AI coding tools dominate, followed by agent workflows, RAG knowledge bases, and concrete application scenarios. Pure model inference projects are fewer here because much local deployment content is organized around models, GPUs, or deployment plans rather than a single GitHub project.

AI Coding and Coding Agents

This group focuses on code understanding, code modification, engineering workflows, and terminal agents. It is the largest group, with 22 projects.

Project	Article	GitHub	Core Use	Best For
Ralph	Ralph: turning Claude Code and Amp into an autonomous development loop	snarktank/ralph	Drive Claude Code / Amp through PRD, planning, execution, and review loops	Users who want a straighter agent coding process
Claude-Mem	Claude-Mem: long-term cross-session memory for Claude Code	thedotmack/claude-mem	Add cross-session memory to Claude Code	Heavy Claude Code users
Claude Code Hooks Mastery	Claude Code Hooks Mastery: getting started with 13 hooks lifecycle stages	disler/claude-code-hooks-mastery	Learn Claude Code hooks lifecycle and automation control	Users who want to customize Claude Code workflows
Compound Engineering Plugin	Compound Engineering Plugin: turning AI coding into planning, execution, and review loops	EveryInc/compound-engineering-plugin	Split AI coding into planning, execution, and review cycles	Users who care about engineering discipline in AI coding
free-claude-code	free-claude-code: connecting Claude Code to OpenRouter, DeepSeek, and local models	Alishahryar1/free-claude-code	Use a proxy to connect Claude Code to different model backends	Users who want to reduce Claude Code cost
Hermes Agent	What is Hermes Agent: overview, strengths, quick start, and OpenClaw comparison	NousResearch/hermes-agent	Local agent framework with tool calling and task execution	Users who want to run local agents
OpenHarness	What OpenHarness can do as an open source agent harness	HKUDS/OpenHarness	Agent harness and multi-agent execution framework	Users researching agent orchestration
CodexBridge	Using Codex with domestic LLMs: OpenAI-compatible APIs and CodexBridge	begonia599/CodexBridge	Connect Codex to OpenAI-compatible model APIs	Users who want Codex with domestic models
ccx	Using CCX to manage OpenAI-compatible APIs for Codex and domestic models	BenedictKing/ccx	Manage API proxies for Claude, Codex, Gemini, and more	Multi-model switching users
cc-haha	cc-haha: a desktop workspace for Claude Code	NanmiCoder/cc-haha	Desktop workspace and Computer Use entry for Claude Code	Claude Code users who prefer a GUI
DeepSeek-TUI	DeepSeek-TUI: turning DeepSeek V4 into a terminal coding agent	Hmbown/DeepSeek-TUI	Run a DeepSeek coding agent in the terminal	DeepSeek and command-line users
Open Design	Open Design: turning Claude Code and Codex into AI design tools	nexu-io/open-design	Bring Claude Code / Codex into design generation	Users who want agents for design prototypes
agentmemory	agentmemory: persistent memory for Claude Code, Codex, and Cursor	rohitg00/agentmemory	Add persistent memory to coding agents	Developers maintaining long-running projects
Graphify	Graphify: turning a codebase into an AI-queryable knowledge graph	safishamsi/graphify	Convert a codebase into a knowledge graph to reduce repeated file reads	Large-codebase users
oh-my-pi	What is oh-my-pi? An AI coding assistant that connects terminal, IDE, and debugger	can1357/oh-my-pi	Connect terminal, IDE, LSP, and debugger as a local AI coding console	Developers who want to unify CLI and IDE workflows
Claude Plugins Official	Claude Code now has a plugin directory: what to install, how to install it, and what to watch	anthropics/claude-plugins-official	Official Claude Code plugin directory and installation entry point	Users who want to extend Claude Code
CodeGraph	What is CodeGraph? A local code map for Claude Code, Codex, and Cursor	colbymchenry/codegraph	Generate local indexes and relationship graphs to help Coding Agents understand projects	Developers maintaining medium-to-large codebases
CC Switch	CC Switch: managing Claude Code, Codex, Gemini CLI, and OpenClaw in one desktop tool	farion1231/cc-switch	Manage multiple AI CLI tools and account/config switching	Users of multiple CLI tools
Warp	Warp open source: from terminal to Agentic Development Environment	warpdotdev/warp	Agentic terminal and development environment	Heavy terminal users
opencode	opencode vs Claude Code vs Codex: open source AI coding tools guide	anomalyco/opencode	Open source AI coding agent	Users looking for Claude Code / Codex alternatives
9Router	9Router: connecting Claude Code, Codex, and Cursor to one AI router	decolua/9router	AI coding model routing and token cost control	Multi-tool, multi-model users
goose	goose: an open source AI Agent across desktop, CLI, and API	aaif-goose/goose	Open source agent across desktop, CLI, and API	Users who want a general agent workspace

Agent Skills and Workflows

This group focuses on turning AI capabilities into repeatable skills, processes, and specifications. It includes 7 projects.

Project	Article	GitHub	Core Use	Best For
mattpocock/skills	Rejecting Vibe Coding: Matt Pocock’s skills repo adds engineering constraints to AI coding	mattpocock/skills	Use skills to constrain AI coding workflows	Users who want engineering discipline for agents
Superpowers	Superpowers: bringing coding agents back into engineering workflows	obra/superpowers	Agentic skills framework and software development methodology	Users who want systematic coding-agent workflows
Prompt-Vault	Prompt-Vault: a prompt specification library for testing AI coding ability	w512/Prompt-Vault	Collect prompt specs for testing AI coding ability	Model and tool evaluators
web-video-presentation	web-video-presentation: an agent skill for turning articles into recordable web videos	ConardLi/garden-skills	Turn articles into recordable web videos	Content creators and automation users
nuwa-skill	nuwa-skill: making “distilling a person” into an executable workflow	alchaincyf/nuwa-skill	Recreate a person’s expression and thinking flow with a skill	Users building style-based agents
Scientific Agent Skills	Scientific Agent Skills: giving research workflows to AI agents	K-Dense-AI/scientific-agent-skills	Skill collection for scientific workflows	Researchers, data analysts, and technical writers
easy-vibe	easy-vibe: a learning map for Vibe Coding beginners	datawhalechina/easy-vibe	Learning map for Vibe Coding	AI coding beginners

RAG, Knowledge Bases, and Memory

This group addresses document retrieval, knowledge base construction, long-term memory, and structured extraction. It includes 7 projects.

Project	Article	GitHub	Core Use	Best For
LangExtract	Google LangExtract: extracting structured data from long text with LLMs	google/langextract	Extract structured information from long text	Information extraction and data processing users
qmd	qmd: local Markdown document search for AI agents	tobi/qmd	Local Markdown document search	Users managing knowledge in Markdown
Firecrawl	Firecrawl: web search, crawling, and interaction API for AI agents	firecrawl/firecrawl	Web crawling, search, and structured data entry point	RAG and agent data-ingestion users
RAGFlow	RAGFlow: features and usage of an open source RAG engine	infiniflow/ragflow	Open source RAG engine	Enterprise knowledge base and document Q&A users
OpenHuman	OpenHuman: the desktop route for open source personal AI agents	tinyhumansai/openhuman	Local-first personal AI agent and memory layer	Users who want to integrate personal data
OpenKB	OpenKB: compiling documents into continuously updated LLM knowledge bases	VectifyAI/OpenKB	Compile documents into updatable knowledge bases	Documentation knowledge-base maintainers
PageIndex	PageIndex: reasoning-style RAG document indexing without vector databases	VectifyAI/PageIndex	Reasoning-style document indexing without vector databases	Users watching new RAG approaches

Multimodal and Content Creation

This group covers image, video, transcription, and content distribution scenarios. It includes 5 projects.

Project	Article	GitHub	Core Use	Best For
rembg	rembg: local image background removal tool	danielgatis/rembg	Local image background removal	E-commerce, design, and image-processing users
awesome-gpt-image-2-prompts	GPT-Image 2 prompt library: e-commerce, posters, portraits, and UI	EvoLinkAI/awesome-gpt-image-2-prompts	GPT-Image 2 prompts and case library	AI art and prompt users
faster-whisper	faster-whisper: a faster Whisper transcription engine	SYSTRAN/faster-whisper	High-performance speech-to-text	Subtitle, transcription, and speech-processing users
Pixelle-Video	Pixelle-Video: an open source AI engine for generating short videos from one topic	AIDC-AI/Pixelle-Video	One-topic short-video generation workflow	Short-video and AIGC creators
AiToEarn	Too many content platforms? AiToEarn uses AI agents to help creators save effort	yikart/AiToEarn	Multi-platform content distribution and creator automation	Content operators and creators

Local Models and Inference

This group focuses on local model runtime and inference experiments. It currently has fewer projects, with 1 project.

Project	Article	GitHub	Core Use	Best For
ds4	Running DeepSeek 4 locally: Antirez ds4 on Apple Silicon Mac	antirez/ds4	Experiment with running DeepSeek 4 on Apple Silicon	Local model and inference experiment users

Vertical Applications and Automation

This group applies agents or AI capabilities to finance, trading, browsers, desktops, e-commerce monitoring, and other concrete scenarios. It includes 7 projects.

Project	Article	GitHub	Core Use	Best For
TradingAgents-CN	TradingAgents-CN: a multi-agent financial trading research framework for Chinese users	hsliuping/TradingAgents-CN	Multi-agent financial trading research framework	Quant, finance, and agent researchers
FinceptTerminal	FinceptTerminal: open source financial terminal, quant research, and AI Agent workspace	Fincept-Corporation/FinceptTerminal	Financial terminal, quant research, and AI agent workspace	Financial analysis and quant users
Anthropic financial-services	Anthropic financial-services: reusable templates for financial agent scenarios	anthropics/financial-services	Financial services agent templates	Users building financial AI solutions
ai-goofish-monitor	ai-goofish-monitor: open source AI monitoring system for Xianyu products	Usagi-org/ai-goofish-monitor	AI product monitoring and Xianyu automation	Second-hand marketplace monitoring users
CloakBrowser	CloakBrowser: a more human-like browser for Playwright and Puppeteer	CloakHQ/CloakBrowser	More human-like browser automation environment	Browser automation and agent operation scenarios
UI-TARS-desktop	Let AI operate the computer? UI-TARS-desktop connects desktop, browser, and tools	bytedance/UI-TARS-desktop	Desktop, browser, and tool operation agent	Users who want AI to operate computers
AI-Trader	What is AI-Trader: a platform for AI agents to publish trading signals and run simulations	HKUDS/AI-Trader	AI agent trading signals and simulated trading platform	Financial agent and trading researchers

AI Application Development Infrastructure

This group provides foundational components for building AI applications and agent toolchains. It includes 5 projects.

Project	Article	GitHub	Core Use	Best For
Prompt Optimizer	Prompt Optimizer: open source prompt optimization, testing, and MCP tools	linshenkx/prompt-optimizer	Prompt optimization, testing, and MCP tools	Prompt engineering and app-tuning users
Playwright CLI	Playwright CLI basics: installation, skills, sessions, and common commands	microsoft/playwright-cli	Browser automation CLI for coding agents	Agent users who need browser operation
Vercel AI SDK	What is Vercel AI SDK? A unified toolkit for TypeScript AI apps	vercel/ai	TypeScript AI application SDK	Front-end and full-stack developers
CLIProxyAPI	CLIProxyAPI: wrapping Codex, Claude Code, and Gemini CLI into unified APIs	router-for-me/CLIProxyAPI	Wrap multiple AI CLIs and OAuth login states as compatible APIs	Users who want unified access to Codex, Claude Code, and Gemini CLI
CLIProxyAPI Management Center	CLIProxyAPI Management Center: a visual admin console for CLIProxyAPI	router-for-me/Cli-Proxy-API-Management-Center	Web admin UI for CLIProxyAPI configuration, accounts, logs, and OAuth	Users running CLIProxyAPI as a team gateway or account pool

Google I/O 2026 Summary: Gemini 3.5, Omni, Antigravity, and System-Level Agents

Thu, 21 May 2026 00:07:06 +0800

The main line of Google I/O 2026 is clear: Google is moving Gemini from “model” and “chat assistant” into a fuller Agent ecosystem. It is not only answering questions. It is entering Search, Android, developer tools, video creation, shopping, Workspace, hardware, and enterprise platforms to help users complete longer task chains.

This article summarizes the main Google I/O 2026 announcements from official releases and a developer perspective. For real development, always follow the official Google, Android Developers, and Gemini API documentation.

One-Sentence Summary

The keyword for Google I/O 2026 is agentic Gemini era.

Google announced or strengthened several lines:

Gemini 3.5 Flash: speed, action capability, and Agent workflows.
Gemini Omni: creating content from any input, starting with video creation and editing.
Gemini app: moving from chat assistant to proactive, always-on, task-capable personal Agent.
Google Antigravity 2.0: evolving from an AI coding tool into an Agent-first development platform.
Gemini API Managed Agents: creating hosted Agents through APIs that can reason, use tools, and execute code.
Google AI Studio: expanding to mobile, native Android support, and project export to Antigravity.
Search, Shopping, YouTube, Workspace, and Android: all gaining stronger Gemini and Agent capabilities.

In other words, Google is no longer only showing “how smart the model is.” It is showing how models enter products, tools, and systems to actually execute tasks for users.

Gemini 3.5 Flash: From Prompt to Action

Gemini 3.5 is Google’s new model family at I/O 2026, with Gemini 3.5 Flash as the first public focus.

Google does not position it as simply a “faster chat model,” but as a high-speed engine for real Agent workflows. Google’s developer article describes 3.5 Flash as combining frontier intelligence and high speed to support the shift from prompt to action.

Its main significance:

Optimized for Agent and coding scenarios.
Supports longer task chains and tool use.
Available through Antigravity, Gemini API, Google AI Studio, Android Studio, Gemini Enterprise, and other entry points.
Better suited for applications that need fast responses, multi-turn execution, and frequent tool calls.

For developers, Gemini 3.5 Flash is not just another model option. It is one of the default engines for Google’s new Agent toolchain.

Gemini Omni: Video and World-Model Capabilities

Gemini Omni is another core I/O 2026 announcement. Google describes it as creating content from any input, with the current focus starting from video.

Its highlights fall into three areas:

Multimodal input: text, images, video, audio, and more can be used as references.
Video editing: users can modify video over multiple turns with natural language instead of stopping after one generation.
World understanding: it emphasizes consistency in physics, scenes, actions, narrative, and audiovisual output.

This means AI video tools are moving from “enter one prompt to generate a clip” toward “revise step by step as if talking to an editor.” For creators, the real value is not one-shot generation, but a controllable, traceable, and iterative editing process.

Gemini App: From Chat Assistant to Always-On Personal Agent

Google is also pushing Gemini app in a more Agent-like direction. Official posts describe Gemini app as becoming more proactive, offering daily briefs and always-on assistance.

Key points include:

Gemini 3.5 Flash entering Gemini app.
A new UI and more dynamic interaction.
Personal AI Agent concepts such as Gemini Spark.
Proactive daily briefs that organize what users need to know each day.
More emphasis on 24/7 background assistance instead of waiting for the user to start every chat.

This is the part that affects ordinary users most. Gemini used to feel more like a “you ask, I answer” assistant. After I/O 2026, Google wants it to feel more like a personal Agent that follows up on tasks, proactively reminds users, and works across products.

Antigravity 2.0: Developer Tools Become Agent-First

One of the most important developer-side announcements is Google Antigravity 2.0.

Google positions Antigravity as an agent-first development platform. After I/O 2026, it is not only helping developers write code. It is meant to help developers move from ideas and prototypes to Agent orchestration and production delivery.

Core changes listed by Google include:

Antigravity 2.0 standalone desktop app.
Multi-Agent parallel orchestration.
Dynamic subagents.
Background scheduled tasks.
Integration with Google AI Studio, Android, Firebase, and related ecosystems.
Antigravity CLI for terminal users.
Antigravity SDK for custom Agent behavior and deployment.

This shows that AI coding tools are entering the next stage after “code completion / conversational generation”: developers will manage multiple executable Agents, not just one chat window.

Gemini API Managed Agents: Hosting Agents as API Capabilities

Google also introduced Managed Agents in the Gemini API.

According to the official description, these Agents can be created with a single API call. They can reason, use tools, and execute code in an isolated Linux environment, supported by the Antigravity agent harness.

This matters to developers:

You do not need to build the full Agent runtime yourself.
You can get a persistent, isolated execution environment.
Multi-turn interactions can preserve files and state.
Agents can be extended with markdown skills, custom instructions, and templates.
They are available through Interactions API and Google AI Studio.

If this line matures, Agent platforms will increasingly look like cloud services: developers will not only call models, but call Agents with state, tools, execution environments, and security boundaries.

Google AI Studio: From Prompt Playground to App Generation Entry Point

At I/O 2026, Google AI Studio also moves further.

Key changes include:

Google AI Studio mobile app for capturing ideas and generating prototypes on mobile.
Workspace API integration, making it easier for Agents to access Google Workspace.
Project export to Antigravity, carrying context into local development and production work.
Native Android support, allowing users to build Android apps from prompts.
Google Play Console integration to publish apps to test tracks.

This turns AI Studio from “a place to tune prompts and test models” into an entry point from idea to app. Its relationship with Antigravity is clearer too: AI Studio is good for fast ideation and generation, while Antigravity is better for continued development, orchestration, debugging, and delivery.

Android and AppFunctions: Key Interfaces for Mobile Agents

Android system-level Agents are worth watching on their own, but they need to be understood through accurate interfaces and product boundaries.

The most important current piece is Android’s official AppFunctions. The official documentation describes AppFunctions as an Android platform API with Jetpack libraries that lets apps expose their capabilities to agents, assistants, and other authorized callers. It also simplifies Android MCP integration.

Its significance is that mobile automation no longer has to rely only on screenshots, OCR, simulated taps, and UI control positioning.

Traditional mobile automation looks like:

Recognize the screen.
Find the button.
Simulate a tap.
Wait for the page to change.
Retry after errors.

The AppFunctions direction is:

Apps declare what they can do.
Agents call those capabilities with authorization.
The system handles permissions, call boundaries, and security constraints.

This will affect Android app design. Future apps will not only need human-facing UIs, but also core capabilities designed as Agent-callable interfaces.

Search, Shopping, and Content Products Are Becoming Agentic Too

Google I/O 2026 changes are not limited to models and developer tools. Search and consumer products are changing at the same time.

Official I/O summaries mention:

Search entering a new AI Search stage.
Information agents appearing in Search.
Gemini Spark and Daily Brief entering Gemini app.
Universal Cart making shopping carts smarter.
Ask YouTube enabling conversational queries and navigation over video content.
Gemini capabilities expanding to more products and form factors.

These announcements show that Google’s Agent direction is not a single product. It is spreading horizontally across search, video, shopping, productivity, mobile, and hardware scenarios.

Practical Impact for Developers

The biggest impact of Google I/O 2026 for developers is not “another model.” It is that the development target is changing.

Developers used to mainly build:

Apps.
Websites.
APIs.
Plugins.
Automation scripts.

Next, they will also build:

App capabilities callable by Agents.
Multi-Agent workflows.
Stateful tool execution environments.
Auditable automation flows.
Human-in-the-loop confirmation mechanisms.
Integrations with MCP, AppFunctions, Workspace API, Playwright, Firebase, and other tools.

Software will increasingly look like a set of capabilities, not only a set of interfaces. Products that expose their capabilities clearly, reliably, and safely to Agents will be more likely to enter users’ automation task chains.

Impact on Mobile Automation

Mobile automation will gradually move from “GUI first” to “API first, GUI as fallback.”

In the short term, screenshot recognition, OCR, simulated taps, and browser automation still matter because many older apps have no standard interface.

In the long term, if Android AppFunctions, MCP, and system-level permission models mature, stable task execution will lean toward:

First calling capabilities declared by apps.
Then calling system interfaces when needed.
Then using GUI automation as a fallback.

This will change RPA, mobile Agents, testing tools, and app ecosystems. Apps that expose capabilities are easier for system-level Agents to call. Apps that do not may still only be operated by the old “look at screen, tap screen” approach.

Security, Permissions, and Auditing Become Hard Requirements

The stronger Agents become, the higher the risk.

If an Agent can execute tasks across apps, make payments, change settings, access files, and read context, it needs clear security boundaries:

Permission levels.
Explicit user authorization.
Secondary confirmation for sensitive actions.
Sandbox isolation.
Operation logs.
Reversibility and rollback.
Enterprise auditing and compliance.

This is why Google emphasizes isolated environments for hosted Agents, permission requirements for AppFunctions, enterprise platforms, and controlled deployment. The future of Agents is not “do anything without limits,” but executable, traceable, and governable behavior inside security boundaries.

Summary

The main content of Google I/O 2026 can be summarized in one sentence: Google is turning Gemini into an Agent platform spanning models, apps, systems, developer tools, and hardware.

Gemini 3.5 Flash provides speed and action capability. Gemini Omni pushes multimodal creation toward video and world understanding. Gemini app becomes a proactive personal assistant. Antigravity 2.0 and Managed Agents push developer tools toward Agent-native development. AppFunctions lets Android apps begin exposing capabilities to intelligent agents.

For developers, the next thing to watch is not only model parameters, but how to structure application capabilities, connect to Agent toolchains, design permissions and auditing, and make products safely and reliably callable in a system-level Agent ecosystem.

References:

What Is PageIndex? A Reasoning-Based RAG Document Index Without Vector Databases

Wed, 20 May 2026 23:51:37 +0800

VectifyAI/PageIndex is an interesting RAG project. Instead of starting with “build another vector database,” it first organizes long documents into a tree structure similar to a table of contents, then lets an LLM perform reasoning-based retrieval along that tree.

Project link: VectifyAI/PageIndex

At the time of writing, the GitHub page shows about 31.8k stars and 2.7k forks, with an MIT license. The README positions it as Vectorless, Reasoning-based RAG: RAG without a vector database, based on reasoning.

What Problem It Tries to Solve

The common path for traditional RAG is: chunk the document, vectorize the chunks, store them in a vector database, then retrieve passages by similarity search. This approach is simple, general, and mature, but it often runs into several problems with long professional documents:

Similarity is not the same as true relevance.
Document structure is broken apart by chunking, and section relationships are lost.
Retrieval results are hard to explain, making it difficult to say why a passage was selected.
For financial reports, regulatory filings, legal documents, and technical manuals, questions often require reasoning across sections.

PageIndex takes the opposite route: first organize the document into a semantic tree, then let the model search it like a human reading a table of contents, jumping into sections, and narrowing down to details.

The Basic PageIndex Workflow

The README describes PageIndex retrieval in two steps:

Generate a Table-of-Contents-like tree index for the document.
Perform reasoning-based retrieval through tree search.

This tree is not just a file directory. It is a document structure designed for LLM use. Nodes can contain titles, page ranges, summaries, child nodes, and other metadata. When answering a question, the model does not need to face a pile of fragmented chunks immediately. It can first decide which section to enter, then continue searching downward.

This method is better suited to documents that are well structured but very long, such as:

Financial reports and SEC filings.
Regulatory and compliance documents.
Academic textbooks and papers.
Legal documents.
Technical manuals and product documentation.
Large PDFs that exceed the model context window.

How It Differs From Traditional Vector RAG

PageIndex’s main selling points can be summarized in five areas.

First, it does not require a Vector DB. It relies on document structure and LLM reasoning to locate content, rather than only using vector similarity search.

Second, it does not use traditional chunking. Documents are organized by natural sections instead of fixed-length text fragments.

Third, explainability is stronger. The retrieval path can map back to pages, sections, and tree nodes, making it easier to trace than “this text was hit by vector similarity.”

Fourth, retrieval is context-aware. The question, conversation history, and domain background can all affect the tree search path.

Fifth, it is closer to how human experts read documents. People usually do not cut an entire document into small chunks and calculate similarity; they first inspect the table of contents, locate sections, and then read details.

This does not mean vector databases have no value. A more accurate view is that PageIndex fits scenarios where “semantic similarity is not enough, and structure plus reasoning need to participate” in long-document retrieval.

How to Run It Locally

The README provides a local self-hosting path. First install dependencies:

`1`	`pip3 install --upgrade -r requirements.txt`

Then create a .env file in the project root and write your LLM API key. The project supports multiple models through LiteLLM:

`1`	`OPENAI_API_KEY=your_openai_key_here`

Generate a PageIndex structure for a PDF:

`1`	`python3 run_pageindex.py --pdf_path /path/to/your/document.pdf`

Markdown is also supported:

`1`	`python3 run_pageindex.py --md_path /path/to/your/document.md`

Common optional parameters include:

--model
--toc-check-pages
--max-pages-per-node
--max-tokens-per-node
--if-add-node-id
--if-add-node-summary
--if-add-doc-description

The README also notes that the local open-source version uses standard PDF parsing. For complex PDFs, the project’s cloud service provides enhanced OCR, tree building, and retrieval pipelines.

Agentic Vectorless RAG Example

The project also provides an agentic vectorless RAG example using self-hosted PageIndex and OpenAI Agents SDK. Install the optional dependency and run it:

1
2

pip3 install openai-agents
python3 examples/agentic_vectorless_rag_demo.py

The value of this example is that it pushes PageIndex from “generate a document tree” to “let an Agent use the document tree for retrieval.” If you are building an enterprise knowledge base, financial report Q&A, regulatory Q&A, or technical documentation Agent, this example is more worth running than only reading the README.

Cloud Service, MCP, and API

PageIndex is not just a GitHub repo. The project page also lists several entry points:

Self-hosting: run the open-source code locally, suitable for experiments and controlled deployments.
Chat Platform: a ChatGPT-style document analysis platform.
MCP / API: useful for integrating with existing Agents or automation workflows.
Enterprise: for private or on-premises deployment.

This shows that its positioning is not a simple demo. It aims to turn “reasoning-based document retrieval” into an integrable document intelligence infrastructure.

Suitable Scenarios

PageIndex is suitable for tasks such as:

Long PDF Q&A.
Financial reports, annual reports, prospectuses, and regulatory filing analysis.
Legal and compliance document retrieval.
Technical manual Q&A.
Multi-section textbook or paper retrieval.
Enterprise knowledge bases that need explainable retrieval paths.
Providing structured document context to Agents.

If your material is short, has little structure, or is just a normal FAQ, traditional embedding + vector DB may already be enough. PageIndex’s advantages are more likely to appear in long documents, strong structure, professional domains, and questions that require reasoning.

Things to Watch

First, PageIndex still depends on LLMs. Tree building, summaries, and retrieval quality are affected by model capability, prompts, and document parsing quality.

Second, the local version uses standard PDF parsing. Complex scanned documents, chart-heavy PDFs, or messy layouts may require OCR and stronger preprocessing.

Third, vectorless does not mean zero cost. Tree building itself also consumes model calls and time, especially for large-scale document collections.

Fourth, PageIndex is more like a document structure indexing and reasoning retrieval framework. It does not directly replace every RAG stack. In production, it may also be combined with vector retrieval, keyword retrieval, permission control, caching, and audit systems.

Summary

What makes PageIndex interesting is that it shifts RAG from “text similarity retrieval” toward “document structure + LLM reasoning.” For long and professional documents, this direction is worth watching.

If you are building enterprise document Q&A, financial report analysis, regulatory retrieval, or technical manual Agents, PageIndex is a new RAG architecture reference: give documents structure first, then let the model reason along that structure, instead of breaking everything into chunks and putting it all into a vector database from the beginning.

References:

GitHub: VectifyAI/PageIndex

Gemini 3.5 Is Here: Flash Leads as Google Focuses on Agents and Long-Running Tasks

Wed, 20 May 2026 22:51:31 +0800

Google officially released the Gemini 3.5 series on May 20, 2026. The first model available is Gemini 3.5 Flash. Its positioning is not just chat, but agents, code generation, and long-running complex task execution.

The message is clear: Google wants Gemini 3.5 to answer questions, but also to plan, execute, check results, and keep work moving across multi-step workflows.

Gemini 3.5 Flash Comes First

Gemini 3.5 Flash is already available to several groups:

General users can try it in the Gemini app and AI Mode in Google Search.
Developers can use it through Google Antigravity, Google AI Studio, and the Gemini API in Android Studio.
Enterprise users can access it through Gemini Enterprise Agent Platform and Gemini Enterprise.

Google also said Gemini 3.5 Pro is still in development, already being used internally at Google, and expected to launch next month.

This means the 3.5 series will continue the Flash and Pro split: Flash emphasizes speed, cost, and scalable execution, while Pro will likely target more complex and higher-capability use cases.

The Focus Is Agents and Coding

Google describes Gemini 3.5 Flash as one of its strongest models for agents and coding. The announcement says it beats some Gemini 3.1 Pro results on coding and agent benchmarks such as Terminal-Bench 2.1, GDPval-AA, MCP Atlas, and CharXiv Reasoning.

Most users do not need to care about every benchmark number. The more important point is that Google is pushing model capability toward executable workflows: not only writing code, but also migrating old projects, developing complex apps, organizing financial reports, analyzing data, and running repeated tests.

In the Antigravity development framework, Gemini 3.5 Flash can use multiple collaborating subagents to handle large tasks. Google showed examples such as reading the AlphaZero paper and building a playable game, converting legacy code to Next.js, and generating cityscapes and UI options in parallel.

The direction is clear: AI coding tools are moving from “generate a piece of code” toward “coordinate multiple agents to complete a project.”

Stronger Multimodal UI and Graphics

Gemini 3.5 Flash builds on Gemini 3’s multimodal foundation. Google says it can generate richer web UIs, interactive animations, and visual content.

The announcement includes examples such as:

Creating interactive animations for research papers.
Turning text descriptions into interactive hardware models.
Generating a complete brand concept for a school fundraiser.
Producing multiple UX options for a checkout flow in a short time.

This matters for developers and product teams. The model is no longer only writing explanations. It can participate in frontend prototypes, interaction design, and visualization work.

Enterprise Use: Automating Time-Consuming Workflows

Google listed several partner examples. Shopify uses subagents to analyze complex data and forecast merchant growth. Macquarie Bank is testing 3.5 Flash on documents over 100 pages to accelerate account-opening workflows. Salesforce is integrating it into Agentforce. Ramp uses it to improve OCR for complex invoices. Xero uses AI agents for administrative workflows. Databricks uses automated workflows to monitor data anomalies and suggest fixes.

These examples point to the same trend: enterprise adoption of large models is moving from one-off Q&A to workflow automation. Whether a model is inexpensive, fast, and stable over long tasks can matter more than whether one answer looks impressive.

Gemini Spark: A Personal AI Agent

Google also announced Gemini Spark, a personal AI agent powered by Gemini 3.5 Flash. Its goal is to run over long periods and proactively perform tasks under user guidance.

Gemini Spark has started rolling out to trusted testers. Google plans to open a beta next week to Google AI Ultra subscribers in the United States.

This is worth watching. Google Search, the Gemini app, Android, Workspace, and browser-related ecosystems already touch many parts of personal digital life. If a personal agent can connect with these entry points, its impact may be larger than a standalone chatbot.

Safety Moves Further Upstream

Google says Gemini 3.5 was developed under its Frontier Safety Framework, with strengthened protections for information security and CBRN-related risks. The announcement also mentions interpretability tools that help examine and understand model reasoning before responses are delivered.

This shows that frontier model releases are no longer only a capability race. The more a model emphasizes agents, autonomous execution, and long-running tasks, the more important safety controls, false refusal rates, harmful-output prevention, and interpretability become.

How to View Gemini 3.5

Gemini 3.5 Flash is not just another model launch. It looks more like Google’s bet on the next shape of AI products: models that can call tools, split tasks, coordinate execution, generate UIs, and enter personal and enterprise workflows.

For developers, the important things to watch are the real experience in Google Antigravity, AI Studio, the Gemini API, and Android Studio. For enterprises, the question is whether it can reliably reduce manual work in real workflows, not just score well on benchmarks.

Gemini 3.5 Pro is not publicly available yet. Once Pro ships, the differences between Flash and Pro in capability, price, speed, and context handling will decide which production scenarios each model fits best.

References:

Google Blog: Gemini 3.5

agentmemory: Persistent Memory for Claude Code, Codex, Cursor, and Other Coding Agents

Tue, 19 May 2026 10:56:50 +0800

rohitg00/agentmemory is a persistent memory system for AI coding agents. Its goal is straightforward: Claude Code, Codex CLI, Cursor, Gemini CLI, OpenCode, and similar tools should not have to relearn the project background, architecture decisions, and historical problems every time a new session starts.

Project URL: https://github.com/rohitg00/agentmemory

At the time of writing, the GitHub API showed about 13k stars, TypeScript as the main language, and an Apache-2.0 license. The README describes it as “Persistent memory for AI coding agents.”

What Problem Does It Solve

A common pain point for coding agents is memory fragmentation. You may ask an agent to fix an authentication issue today, then open a new conversation tomorrow, and it no longer knows:

Why a certain architecture decision was made.
Which files are sensitive and should be changed carefully.
What bugs were fixed before.
What commands, tools, or local services the project uses.
Which conventions the team follows.

Static notes help, but they are often forgotten or not connected to the active workflow. agentmemory tries to provide a shared memory layer that can be used across different AI coding tools.

Supported Agents

The README lists support for Claude Code, Codex CLI, Cursor, Gemini CLI, OpenCode, and other MCP-compatible tools. The core idea is to expose memory through a local service, MCP, hooks, and integrations, so multiple assistants can share the same project context.

This is especially useful for teams that switch between tools. One developer may use Cursor, another may use Claude Code, while automation runs through Codex CLI. A shared memory layer reduces repeated explanation.

Quick Start

Install globally:

npm install -g @agentmemory/agentmemory
agentmemory
agentmemory demo
agentmemory connect claude-code

Or run with npx:

`1`	`npx @agentmemory/agentmemory`

The local service is available at:

`1`	`http://localhost:3113`

In practice, the first step is usually to start the memory service, connect the coding assistant, and then let the agent read or write project memories during development.

How It Differs From Static Memory Files

Many teams already maintain AGENTS.md, CLAUDE.md, README notes, or local documentation. These files are useful, but they are static. They do not automatically capture session history, task outcomes, or recurring decisions.

agentmemory is closer to a persistent context service. It can store and surface memories that are relevant to the current project or task. The goal is not to replace documentation, but to make working context easier to reuse.

Typical Scenarios

Useful scenarios include:

Remembering project setup steps and common commands.
Recording why a risky refactor was avoided.
Keeping notes about flaky tests or local services.
Sharing domain terminology across coding assistants.
Helping agents continue work after a new session starts.

This is particularly valuable for long-running products, monorepos, and projects with many hidden conventions.

Things To Watch Out For

First, memory quality matters. If old or wrong information is written into memory, future agents may repeat the mistake. Teams should keep important memories short, clear, and reviewable.

Second, privacy matters. Do not store secrets, API keys, customer data, or sensitive production information in a memory system unless the security model is clear.

Third, memory is not a substitute for tests. It helps agents understand context, but the final guarantee still comes from code review, tests, and verification.

Who It Is For

agentmemory is suitable for developers who use multiple AI coding tools, teams working on large codebases, and users who often need agents to continue previous work. It is less necessary for very small one-off scripts.

Summary

agentmemory is interesting because it treats memory as infrastructure for AI coding, not as a small prompt trick. If coding agents are becoming part of daily development, persistent project memory is a practical missing piece.

Let AI Operate Your Computer? UI-TARS-desktop Connects Desktop, Browser, and Tools

Tue, 19 May 2026 10:56:50 +0800

bytedance/UI-TARS-desktop is ByteDance’s open source multimodal AI agent project. It is not just a single desktop app, but an agent stack. The current README mainly contains two directions: Agent TARS and UI-TARS Desktop.

Project URL: https://github.com/bytedance/UI-TARS-desktop

Official site: https://agent-tars.com

At the time of writing, the GitHub API showed about 34k stars, TypeScript as the main language, and an Apache-2.0 license. The README describes it as an “Open-Source Multimodal AI Agent Stack.”

Difference Between Agent TARS and UI-TARS Desktop

The README places the two projects in one comparison table:

Agent TARS: a general multimodal AI agent stack that connects GUI agents, vision, terminal, browser, and product workflows.
UI-TARS Desktop: a desktop application based on UI-TARS models, providing native GUI agent capabilities for operating local or remote computers and browsers.

Simply put, Agent TARS is more like a general agent runtime, while UI-TARS Desktop is the desktop GUI operation entry point.

What Agent TARS Can Do

Agent TARS mainly provides a CLI and Web UI. Its goal is to let multimodal models complete task flows closer to human operation through MCP and various tools.

Core capabilities listed in the README include:

One-command CLI startup, supporting headful Web UI and headless server.
Hybrid browser agent control through GUI Agent, DOM, or mixed strategies.
Event Stream for tracing and debugging data flows.
MCP integration for mounting MCP Servers and real tools.

Quick start:

`1`	`npx @agent-tars/cli@latest`

Global installation:

`1`	`npm install @agent-tars/cli@latest -g`

Run with a model provider:

1
2

agent-tars --provider volcengine --model doubao-1-5-thinking-vision-pro-250428 --apiKey your-api-key
agent-tars --provider anthropic --model claude-3-7-sonnet-latest --apiKey your-api-key

What UI-TARS Desktop Can Do

UI-TARS Desktop is a desktop GUI Agent. Based on UI-TARS and Seed-1.5-VL / 1.6 model families, it focuses on letting the model understand the screen and execute mouse and keyboard operations.

Capabilities listed in the README include:

Natural language control.
Screenshots and visual recognition.
Precise mouse and keyboard control.
Cross-platform support for Windows, macOS, and browsers.
Real-time feedback and status display.
Local processing with an emphasis on privacy and security.

Example tasks include changing VS Code settings, checking GitHub issues, and operating remote computers or browsers.

Why GUI Agents Matter

Traditional automation depends on APIs, DOM, or scripts. A GUI Agent starts from the interface: it sees buttons, input boxes, menus, and state, then operates through mouse and keyboard.

This has two values. First, many applications do not have stable APIs, or APIs do not cover the full workflow. A GUI Agent can interact from the same surface a human uses.

Second, multimodal models can handle screenshots, documents, web pages, and app interfaces, combining visual understanding with execution.

The limitation is also clear. GUI operations are affected by resolution, language, layout changes, pop-ups, and network latency. Production workflows still need permission control, confirmation steps, and rollback plans.

Relationship With MCP

Agent TARS emphasizes MCP integration. MCP is useful because it gives agents a unified way to call browsers, files, command lines, databases, internal services, and other tools.

For complex tasks, GUI clicking alone is not stable enough. A better pattern is often:

Use APIs where APIs are available.
Use vision when page state must be understood.
Use browser control when real web interaction is needed.
Use GUI Agent when local software must be operated.

Projects like UI-TARS-desktop are exploring how to place these capabilities in one agent stack.

What To Watch Out For

First, desktop agents have execution risk. They can operate mouse, keyboard, and browser, so permissions must be limited to avoid accidental file changes, account operations, payment, or production system actions.

Second, remote computer and remote browser control needs a clear security boundary. Do not expose unauthenticated control endpoints to the public internet.

Third, multimodal models can misread interfaces. Critical operations should require human confirmation, especially delete, submit, pay, publish, trade, or other irreversible actions.

Who It Is For

UI-TARS-desktop is suitable for developers exploring GUI agents, teams building AI assistants for desktop workflows, and researchers comparing browser, DOM, MCP, and visual-control strategies. It is not a simple consumer assistant yet.

Summary

UI-TARS-desktop is worth watching because it moves AI agents from “answering in chat” toward “seeing the screen and operating tools.” Its value is not only in desktop control, but in combining GUI, browser, terminal, and MCP capabilities in one stack.

Too Many Platforms to Post To? AiToEarn Wants AI Agents to Help Creators Save Time

Tue, 19 May 2026 10:56:50 +0800

yikart/AiToEarn is an AI content marketing project for creators, brands, and one-person companies. It tries to put content creation, publishing, engagement, and monetization into one agent workflow, covering platforms such as Douyin, Xiaohongshu, Kuaishou, Bilibili, WeChat Channels, TikTok, YouTube, Facebook, Instagram, Threads, X, Pinterest, and LinkedIn.

Project URL: https://github.com/yikart/AiToEarn

Official site: https://aitoearn.ai/

At the time of writing, the GitHub API showed about 15k stars, TypeScript as the main language, and an MIT license. The README describes it as a content marketing agent platform for OPCs, creators, brands, and enterprises.

Positioning

AiToEarn is not just a copywriting generator or a scheduled posting tool. It breaks content marketing into four agent capabilities:

Monetize: content monetization.
Publish: cross-platform content publishing.
Engage: content interaction and community operations.
Create: content creation.

That positioning fits the current creator workflow. The hard part for many teams is not only “can AI write a post”, but what happens after that: scheduling, distribution, replies, review, and connecting content to business tasks.

Core Features

Monetize: Making Money From Content

AiToEarn provides monetization capabilities around promotional tasks. The README mentions three settlement models:

Model	Full name	Meaning
CPS	Cost Per Sale	Settlement by sales
CPE	Cost Per Engagement	Settlement by engagement
CPM	Cost Per Mille	Settlement by impressions or views

This part is closer to a content task marketplace that connects brand promotion needs with creator distribution.

Publish: Content Publishing Agent

Publish distributes content across multiple platforms and reduces the repeated work of posting manually. The README covers mainstream short video, graphic, and social platforms in China and overseas.

Its practical value is unified scheduling and management. For account matrices, cross-platform distribution, and global content teams, this is often more useful than a single AI copywriting feature.

Engage: Content Engagement Agent

Engage uses a browser extension to support automated engagement operations such as likes, saves, follows, comment replies, and brand monitoring.

This capability should be used carefully. Automated engagement can trigger platform risk controls, so teams need to check account permissions, frequency limits, platform terms, and internal compliance rules.

Create: Content Creation Agent

Create handles content generation. The README mentions video generation models, video translation, video editing, image generation, and batch creation tasks.

This is useful for large-scale content production, but human review is still necessary. Brand content, ad materials, and multilingual assets need factual accuracy, copyright checks, and tone consistency.

Five Ways To Use It

Method	Best for	Deployment needed
Use the website directly	All users	No
Use it in OpenClaw	OpenClaw users	No
Use it in Claude / Cursor and other AI assistants	AI tool users	No
One-click Docker deployment	Teams that want self-hosting	Server needed
Source development	Developers	Development environment needed

MCP support is a notable point. It means Claude, Cursor, or other MCP-compatible agents can call AiToEarn as an external capability.

A common MCP configuration contains:

1
2

MCP URL: https://aitoearn.ai/api/unified/mcp
Auth Header: x-api-key: your-API-Key

Self-hosted users should replace it with their own service URL.

Docker Deployment

The README provides a Docker deployment path:

1
2
3

git clone https://github.com/yikart/AiToEarn.git
cd AiToEarn
docker compose up -d

Then visit:

`1`	`http://localhost:8080`

For teams that care about data control, private deployment, or custom workflows, Docker is more practical than only using the hosted website.

Who It Is For

AiToEarn is suitable for creators who publish across many platforms, small teams running content operations, one-person companies, brands that need creator collaboration, and developers who want to connect content workflows to AI agents.

It is less suitable if you only need a simple text generator. Its value is in connecting creation, publishing, engagement, and monetization.

Notes Before Use

First, automated posting and engagement must respect platform rules. A tool can improve efficiency, but it cannot remove the need for account safety and compliance.

Second, generated content still needs human review. Ads, brand posts, and cross-language content can all carry factual, copyright, or tone risks.

Third, monetization features involve commercial tasks, so settlement rules, disclosure requirements, and platform policies should be checked before use.

Summary

AiToEarn is worth watching because it treats content operations as a workflow, not just a writing task. For creators and small teams, the attractive part is saving repeated work across platforms. For developers, the interesting part is MCP and agent integration.

What Is AI-Trader? A Platform Where AI Agents Publish Trading Signals and Run Paper Trading

Tue, 19 May 2026 10:56:50 +0800

HKUDS/AI-Trader is a trading platform project for AI Agents. The README positions it as an “Agent-Native Trading Platform”, aiming to let AI Agents connect to the platform, publish trading signals, join discussions, copy trades, and use market data.

Project URL: https://github.com/HKUDS/AI-Trader

Platform URL: https://ai4trade.ai

At the time of writing, the GitHub API showed about 18k stars and Python as the main language. The repository API did not return a clear license value, so users should confirm licensing terms before formal use.

This article is only an introduction to the open source project and is not investment advice. Automated trading involves real capital risk. No strategy, signal, or agent output can guarantee returns.

Positioning

The core idea of AI-Trader is simple: humans have trading platforms, and AI Agents may also need their own trading platform.

According to the README, any AI Agent can read the platform Skill file and register quickly:

`1`	`Read https://ai4trade.ai/skill/ai4trade and register on the platform. Compatibility alias: https://ai4trade.ai/SKILL.md`

After connection, agents can publish trading signals, join community discussions, copy strategies from high-performing traders, sync signals to multiple brokers, and accumulate points through prediction performance.

Main Features

The README lists capabilities including:

Instant Agent Integration: quick access for AI Agents.
Collective Intelligence Trading: multiple agents discuss and collaborate on trading ideas.
Cross-Platform Signal Sync: sync trading signals across platforms.
One-Click Copy Trading: follow selected traders or agents.
Universal Market Access: stocks, crypto, FX, options, futures, and more.
Three Signal Types: strategy, action, and discussion signals.
Reward System: earn points through signals and attention.

From a product perspective, it is not just a local quantitative backtesting framework. It combines agents, signals, discussion, copy trading, and paper trading in one platform layer.

Two Types of Users

The README divides users into two groups.

The first group is Agent Traders. AI Agents read the Skill document, connect to the platform, install required components, and publish signals.

The second group is Human Traders. Regular users can visit the platform, create accounts, browse signals, or follow better-performing traders.

Together, this forms a structure where AI Agents produce signals, and humans or other agents consume those signals.

Architecture

The README shows the project structure as:

AI-Trader (GitHub - Open Source)
念岸岸 skills/              # Agent skill definitions
念岸岸 docs/api/            # OpenAPI specifications
念岸岸 service/             # Backend & frontend
岫   念岸岸 server/         # FastAPI backend
岫   弩岸岸 frontend/        # React frontend
弩岸岸 assets/              # Logo and images

The repository puts agent skills, API documentation, backend, and frontend in one place. The backend uses FastAPI and the frontend uses React. The README update notes also mention that the web service and backend workers have been separated so pricing, historical performance, settlement, and market intelligence jobs can run in the background without affecting pages and health checks.

Why It Is Worth Watching

AI-Trader is worth watching not because “AI can automatically make money”, but because it makes the interface between agents and financial scenarios more explicit.

There are several interesting points:

First, it uses a Skill document as the agent access point. This is close to how Codex, Claude Code, OpenClaw, and other agent tools work.

Second, it places trading signals, discussion, copy trading, and a reward system at the platform layer instead of only providing a local script.

Third, it provides OpenAPI documentation, making the platform interfaces easier for developers to understand.

Fourth, it supports paper trading. For research on agent decision-making, a simulated environment is much safer than giving agents direct access to real money.

Risks and Boundaries

Automated trading is a high-risk scenario.

First, signals generated by agents are not investment advice. Models can hallucinate, overfit, misread news, or fail to understand extreme market conditions.

Second, copy trading has contagion risk. If a wrong signal is widely followed, losses may concentrate.

Third, real capital access must be strictly isolated. Do not give agents unlimited order permissions.

Fourth, licensing and compliance need to be confirmed before commercial or production use, especially when brokers, financial data, and user accounts are involved.

Who It Is For

AI-Trader is suitable for researchers studying agent decision-making, developers exploring financial agent interfaces, and teams interested in paper trading or signal collaboration. It is not suitable for users looking for guaranteed profit tools.

Summary

AI-Trader is a signal and paper-trading platform designed around AI Agents. The useful way to read it is not “AI helps you earn money”, but how agents should connect to financial workflows, publish signals, and operate inside controlled risk boundaries.

A Survey of Mainstream AI PPT Tools: How to Choose Between Auto Generation, Web Slides, PPTX, and Image-Based Workflows

Mon, 18 May 2026 22:29:43 +0800

AI for PPT is no longer just “enter a title and apply a template.” In AI coding environments such as Claude Code, Codex, and Cursor, PPT generation is becoming a set of installable, reusable Agent Skills: some output web presentations, some generate truly editable .pptx files, some use image models to turn each slide into a visual draft, and some let AI operate PowerPoint files through MCP.

This article looks at a group of mainstream PPT-related Skills. The useful part is not only the list itself, but the way these tools can be separated by delivery format. Before choosing a tool, ask one question first: who will edit the final deliverable, where will it be presented, and does it need ongoing collaboration?

Several Routes

1. HTML Web Presentations

Representative projects include frontend-slides, guizang-ppt-skill, and html-ppt-skill.

The strength of this route is visual expressiveness. CSS animations, Canvas, WebGL, and responsive layouts are all available. The result can be opened directly in a browser, making it suitable for technical talks, product launches, Demo Day presentations, and talks with a strong personal style.

The trade-off is also clear: after delivery, it is not ideal for clients who need to edit text line by line. If the client receives HTML instead of a PowerPoint file, later changes often need to go back through the generation workflow.

If you only care about HTML presentations, frontend-slides feels like a high-star general entry point, guizang-ppt-skill is stronger in aesthetic constraints and themed style, and html-ppt-skill stands out for its number of themes, layout options, and presenter mode.

2. Native PPTX

Representative projects include mckinsey-pptx, ppt-agent-skills, claude-office-skills, and ppt-master.

This is the most stable route for business delivery. As long as the client asks to “edit text, change images, and apply a company template in PowerPoint,” the final output needs to land in .pptx.

ppt-master is especially worth a separate look. Its idea is to have the LLM generate SVG first, then convert it into native PowerPoint DrawingML objects. The goal is to keep text boxes, shapes, and charts editable inside PPTX. It also supports generating PPTX from PDF, DOCX, URL, and Markdown, as well as template replication, animation, narration, and local preview.

This route works well for consulting deliverables, company reports, white paper presentations, and turning long reports into PPT decks. The downside is that the visual ceiling is usually limited by PowerPoint itself, so complex effects are not as free as HTML or image-based routes.

3. AI Image-Driven Workflows

Representative projects include NanoBanana-PPT-Skills, gpt_image_2_skill, and ppt-image-first.

This route treats each slide as a visual image first, then places the images into PPTX or another container. Its advantage is a high level of visual completion, especially for covers, social media graphics, visual proposals, and communication-oriented content.

The problem is poor editability. A page is essentially an image. If you later need to change a title, replace a paragraph, or move an icon, you may need to regenerate it. It is good for “it needs to look good,” but not for “the client will revise it repeatedly.”

4. MCP / Protocol Layer

Representative projects include Office-PowerPoint-MCP-Server and PPTAgent.

These tools do not necessarily generate a complete PPT directly. Instead, they give AI an interface for operating PowerPoint. After connecting through MCP, the model can read, modify, and write .pptx files.

This route fits workflows where a PPT file already exists and AI is needed to help revise it. Examples include batch format changes, rearranging pages based on feedback, or asking the model to check whether each slide matches the goal. PPTAgent emphasizes reflective generation, meaning it checks back after generating each slide. That direction is useful for reducing the “AI PPT feels rough” problem.

5. Integrated Design Platforms

Representative projects include open-design and docsagent.

These projects go beyond PPT generation itself. open-design is more like a local-first design platform that can generate prototypes, slides, images, and videos, with support for multiple export formats. docsagent is not a PPT tool, but it can index and chat with local documents, making it useful as a material organization layer before generating PPT.

If your need is not a one-off PPT, but a fuller workflow from materials, design, and prototypes to delivery, this type of platform is more worth watching.

Skill Metadata

Star counts come from the original crawl result on 2026-05-15. They are only useful as a popularity reference. Before actual use, open the repositories again to confirm maintenance status, README, and LICENSE.

Skill	Author	Links	Star	Language	Route
frontend-slides	@zarazhangrui	GitHub: zarazhangrui/frontend-slides	17,530	Shell	HTML web presentation
guizang-ppt-skill	@op7418 (Guizang)	Site article: guizang-ppt-skill GitHub: op7418/guizang-ppt-skill	8,832	HTML	HTML web presentation
html-ppt-skill	@lewislulu	GitHub: lewislulu/html-ppt-skill	3,834	HTML/CSS/JS	HTML web presentation
mckinsey-pptx	@seulee26	GitHub: seulee26/mckinsey-pptx	426	Python	Native PPTX
ppt-agent-skills	@sunbigfly	GitHub: sunbigfly/ppt-agent-skills	714	Python	Native PPTX
claude-office-skills	@tfriedel	GitHub: tfriedel/claude-office-skills	631	Python	Native PPTX
ppt-master	@hugohe3	GitHub: hugohe3/ppt-master	16,626	Python	Native PPTX
NanoBanana-PPT-Skills	@op7418 (Guizang)	GitHub: op7418/NanoBanana-PPT-Skills	2,668	Python	AI image-driven
gpt_image_2_skill	@wuyoscar	GitHub: wuyoscar/gpt_image_2_skill	2,102	Python	AI image-driven
ppt-image-first	@NyxTides	GitHub: NyxTides/ppt-image-first	799	Python	AI image-driven
Office-PowerPoint-MCP-Server	@GongRzhe	GitHub: GongRzhe/Office-PowerPoint-MCP-Server	1,708	Python	MCP / protocol layer
PPTAgent	@icip-cas	GitHub: icip-cas/PPTAgent	4,354	Python	MCP / protocol layer
open-design	@nexu-io	Site article: open-design GitHub: nexu-io/open-design	40,822	TypeScript	Integrated design platform
docsagent	@docsagent	GitHub: docsagent/docsagent	687	TypeScript	Integrated design platform

How to Choose

If the client needs to continue editing, prioritize the native PPTX route, especially ppt-master, mckinsey-pptx, and ppt-agent-skills.

If you are presenting yourself and visual expression matters more than later editing, prioritize the HTML route, especially frontend-slides, guizang-ppt-skill, and html-ppt-skill.

If the goal is a poster-like, cover-like, or shareable visual, prioritize the image route, such as ppt-image-first, gpt_image_2_skill, and NanoBanana-PPT-Skills.

If you already have a PPT file and only want AI to help read, edit, and rearrange it, look at the MCP route.

For explicit scenarios such as academic talks, marketing, translation, or compressing long reports into slides, you can also look for vertical Skills instead of forcing a general-purpose PPT generator to do everything.

Final Notes

Open source projects should not be judged by Star count alone. Before actual use, confirm three things:

Whether the LICENSE allows your use case.
Whether the generated output meets delivery requirements, especially editability.
Whether the cost is acceptable, including model calls, image generation, large-context models, and possible cloud service fees.

These tools change quickly. Star counts will change, and project maintenance status will change too. But the selection logic is relatively stable: decide the delivery format first, then look at specific tools. Whether a PPT is for speaking, editing, or viewing often narrows the choices by more than half.

wx-cli Explained: Query Local WeChat Chat History from the Command Line

Mon, 18 May 2026 21:02:21 +0800

wx-cli is a local WeChat data command-line tool written in Rust. Its goal is to let you query your own WeChat sessions, chat history, contacts, group members, favorites, Moments, official account articles, attachments, and statistics from the terminal.

It is not a cloud-based WeChat sync service, and it is not a chatbot. It is closer to a local read-only data retrieval layer: WeChat still runs on your machine, the data still stays on your machine, and wx-cli decrypts, caches, and queries local databases on demand before returning YAML or JSON output for humans or agents.

Two points make this project worth watching. First, it turns local WeChat data access into a cross-platform CLI. Second, it explicitly considers AI Agent workflows for tools such as Claude Code, Cursor, and Codex, providing a SKILL.md file and structured output with meta fields.

What wx-cli Can Do

According to the project README, wx-cli covers a fairly complete set of features:

View recent sessions and unread sessions.
Query chat history for a contact or group.
Search keywords across the whole local database.
View newly arrived messages.
Query contacts, group members, and group nicknames.
Query favorites.
Query Moments notifications, timelines, and post bodies.
Query official account article pushes.
List and extract image attachments from chats.
Generate chat statistics.
Export chat history as Markdown or JSON.

These capabilities make it more than a “chat history search” tool. It turns local WeChat data into a searchable, analyzable, and exportable personal knowledge source.

Why It Fits AI Agents

Many CLI tools are designed only for people: their output is just a block of text. wx-cli clearly takes agent consumption into account.

The README notes that commands such as history, search, sessions, unread, new-messages, stats, and attachments include meta information. That metadata contains result status, unknown shards, the latest timestamp in matched data, the latest session timestamp, and similar fields.

This is useful for agents. AI does not only need to know “what was found”; it also needs to know whether the result is fresh, whether messages may be missing, and whether it should run init again. For example:

status can indicate whether the result is ok or possibly_stale.
unknown_shards can indicate whether there are database shards for which the daemon currently has no key.
chat_latest_timestamp tells the agent the latest message time in the matched data.
session_last_timestamp helps determine whether the local session record is clearly newer than the query result.

This kind of metadata reduces AI misjudgment and makes tools such as Claude Code, Cursor, and Codex more reliable when working with WeChat data.

Installation

The project recommends cross-platform installation via npm:

`1`	`npm install -g @jackwener/wx-cli`

It also supports curl installation on macOS / Linux:

`1`	`curl -fsSL https://raw.githubusercontent.com/jackwener/wx-cli/main/install.sh \| bash`

On Windows, run this in an administrator PowerShell:

`1`	`irm https://raw.githubusercontent.com/jackwener/wx-cli/main/install.ps1 \| iex`

If you want to build from source, you can use Rust directly:

1
2

git clone git@github.com:jackwener/wx-cli.git && cd wx-cli
cargo build --release

The build artifact is target/release/wx, or wx.exe on Windows.

Relationship with Agent Skills

wx-cli also provides a Skill for AI Agents. You can install it into Claude Code, Cursor, Codex, and other Skills-compatible environments through the skills CLI:

`1`	`npx skills add jackwener/wx-cli`

Install it globally:

`1`	`npx skills add jackwener/wx-cli -g`

After installation, the agent reads the repository’s SKILL.md and learns how to install, initialize, and call wx-cli.

That means you can ask an agent to help with local information organization tasks such as:

Find keywords discussed in a group chat during a specific period.
Summarize recent unread messages.
Export recent chat history from a specific session.
Search official account article links.
Analyze posting statistics in a group chat.

The premise is unchanged: the data must be your own WeChat data on your own machine.

Basic Usage

Before initialization, keep WeChat running. Requirements differ by platform.

On Linux:

`1`	`sudo wx init`

On Windows, use an administrator PowerShell:

wx init

macOS is more involved. The README explains that, with the default path, you first need to ad-hoc sign WeChat so the tool can scan process memory. After re-signing, you also need to clean old TCC authorization records, otherwise permissions such as screen capture, video calls, and microphone access may look enabled while actually being denied. The project documentation also warns that re-signing may cause macOS to repeatedly prompt for access to other apps’ data.

After initialization, verify the setup with:

`1`	`wx sessions`

If you can see recent sessions, the basic path is working. The daemon starts automatically on the first call.

Common Command Examples

View recent sessions:

`1`	`wx sessions`

View unread sessions:

`1`	`wx unread`

Show only human unread sessions while filtering out official accounts and folded entries:

`1`	`wx unread --filter private,group`

View recent chat history for a session:

`1`	`wx history "张三"`

Fetch more history:

`1`	`wx history "张三" -n 2000`

Query a group chat by time range:

`1`	`wx history "AI群" --since 2026-04-01 --until 2026-04-15`

Search the whole database:

`1`	`wx search "关键词"`

Search for a keyword inside a group:

`1`	`wx search "会议" --in "工作群" --since 2026-01-01`

Export chat history:

1
2

wx export "张三" --format markdown -o chat.md
wx export "AI群" --since 2026-01-01 --format json

These commands are well suited for scripts or agents, especially when combined with --json.

Moments and Official Account Articles

wx-cli does not only query chats.

Moments-related commands are split into notifications and posts:

1
2
3

wx sns-notifications
wx sns-feed
wx sns-search "关键词"

Note that Moments data only covers content that has appeared locally. The WeChat client downloads data on demand; if something has never appeared locally, the tool cannot retrieve it out of thin air.

Official account articles are queried through separate commands:

wx biz-articles
wx biz-articles --account "返朴"
wx biz-articles --since 2026-05-01 --until 2026-05-10
wx biz-articles --json | jq '.[].url'

It returns fields such as account name, title, URL, summary, cover image, and timestamp. This is useful for people who organize references, collect articles, or build local knowledge bases.

Attachment Extraction

Image attachments in WeChat chats are usually not ordinary readable image files. They are often .dat files under xwechat_files/<wxid>/msg/attach/....

wx-cli provides a two-step flow:

1
2

wx attachments "张三"
wx attachments "AI群" --kind image -n 100

After getting an attachment_id, extract it:

`1`	`wx extract <attachment_id> -o ~/Desktop/photo.jpg`

The output report includes fields such as md5, dat_path, dat_size, output, format, and decoder. The README says it supports decoding modes such as legacy XOR, V1 fixed-AES, and V2 AES + XOR, while image key extraction differs across platforms.

This capability is powerful, but it requires extra care: only process your own data, and do not use it for unauthorized data access.

Why the Daemon Architecture Matters

The performance story of wx-cli comes from its daemon.

The README describes the structure roughly as:

wx (CLI) ──Unix socket──▶ wx-daemon (background process)
                              │
                    ┌─────────┴──────────┐
               DBCache               contact cache
           (mtime-aware reuse)

After the first decryption, the daemon persists database and mtime information under ~/.wx-cli/cache/. If the database file mtime has not changed, later calls can reuse the cache without decrypting everything again.

This is important for command-line queries and agent loops. An agent may query several sessions, search multiple keywords, then run statistics and exports. If every call had to rescan and decrypt everything, the experience would be poor. The daemon cache makes it feel closer to a local query service.

A Brief Look at the Principle

The project README explains the principle directly: WeChat 4.x encrypts local databases with SQLCipher 4, and WCDB caches the derived raw key in process memory.

wx-cli uses platform-specific methods to scan WeChat process memory, match key patterns, extract the key, and then let the daemon decrypt and cache databases on demand.

The underlying mechanism differs by platform:

macOS uses the Mach VM API.
Linux uses /proc/<pid>/mem.
Windows uses VirtualQueryEx and ReadProcessMemory.

These details explain why initialization usually requires elevated permissions, and why macOS involves signing and privacy authorization.

Boundaries and Risks

Tools like this must start with boundaries.

The wx-cli README disclaimer is clear: the tool is only for learning and research, for decrypting your own WeChat data, and it requires users to comply with applicable laws and regulations. It must not be used for unauthorized data access.

In practice, it is also wise to keep these points in mind:

Use it only on your own computer and your own WeChat account.
Do not casually upload exported chat history to cloud models.
When using an agent to analyze chat history, first confirm the API provider and cross-border data risks.
After exporting Markdown / JSON, pay attention to file permissions and backup locations.
On company or shared devices, confirm compliance and authorization first.

A local tool does not mean there is no privacy risk. It reduces the default path for data to leave your machine, but if you hand the output to a cloud model, cloud drive, or third-party script, the risk comes back.

Who It Is For

wx-cli fits these scenarios:

Quickly search your own historical WeChat messages locally.
Export a session as Markdown or JSON.
Analyze posting activity in a group chat over a time range.
Let Claude Code, Cursor, Codex, and similar agents organize local WeChat material.
Collect official account article links into a local knowledge base.
Study WeChat’s local database structure and decryption flow.

It is less suitable for these scenarios:

You want cloud-based WeChat sync.
You want to bypass someone else’s device or account permissions.
You want a point-and-click GUI and do not want to touch the command line.
You do not want to deal with macOS permissions, Windows administrator rights, or Linux sudo.

Summary

The value of wx-cli is not merely “searching WeChat chat history from the command line.” More precisely, it turns local WeChat data into a local data source that can be queried, exported, and consumed by agents.

Its daemon architecture solves repeated decryption and query performance issues; the meta wrapper helps AI Agents judge whether results are fresh; and SKILL.md lets tools such as Claude Code, Cursor, and Codex understand how to install and use it.

If you often need to find information in WeChat, organize group chats, export records, or build a personal knowledge base, wx-cli is worth watching. But one bottom line should always remain clear: only process your own data, and manage exported results carefully.

References

jackwener/wx-cli GitHub repository

Anthropic Founder’s Playbook Explained: How Claude Helps Startup Teams Move Faster

Mon, 18 May 2026 18:02:58 +0800

Anthropic published The Founder’s Playbook on the official Claude blog, aimed at founders. Its core question is direct: how can an AI-native startup move faster from insight to product, launch, and scale?

The playbook is not simply a feature list for Claude. It breaks the startup journey into four stages: Idea, MVP, Launch, and Scale. The point is not to let AI replace founders’ judgment, but to hand repetitive work such as market research, copy drafts, code scaffolding, operations workflows, and sales materials to Claude first, so founders can spend more time on judgment, taste, trade-offs, and trust.

What this playbook is about

AI startups increasingly face a kind of compression race: product cycles are shorter, competitors are more numerous, and users expect speed and quality at the same time. Work that once required a multi-person team can now often be drafted by AI first, then reviewed, corrected, and advanced by the founding team.

Anthropic’s framework is clear: do not try to make the entire company “AI-powered” on day one. Instead, find one process that is time-consuming, repetitive, and low in creative density. Let Claude generate the first draft, script, research summary, or execution checklist. Founders remain responsible for defining goals, calibrating direction, judging quality, and connecting useful output to real business work.

Stage 1: Idea

The Idea stage is not about coming up with a cool concept. It is about validating whether the idea deserves further investment.

Claude can help founders at this stage by mapping markets, summarizing user pain points, comparing competitor positioning, proposing possible wedges, and turning vague ideas into clearer value propositions.

But the most important part is still human judgment. AI can help you see more possibilities faster, but it cannot take responsibility for whether a market truly has strong demand. Founders still need to talk to real users, observe whether they are willing to change existing workflows, and see whether they are willing to pay.

Stage 2: MVP

The MVP stage is where Claude Code can be especially useful.

For small teams, the scarcest resource is often not ideas, but the speed of turning ideas into something users can try. Claude Code can help generate scaffolding, write scripts, fill in components, check edge cases, and produce technical plan notes, helping teams get to a testable version faster.

The key is not asking AI to write a perfect product in one pass. It is reducing the friction from zero to first version. Founders and engineers still need to review architecture, security, data handling, and user experience, but they do not need to spend as much time on mechanical first drafts.

Stage 3: Launch

The Launch stage tests narrative, distribution, and feedback speed.

Many startup teams underestimate how complex a launch can be: website copy, product demos, emails, social media content, user interviews, sales scripts, investor updates. Every item needs to clearly explain why this product is needed now.

Claude can act as a high-frequency collaborator here: generating different positioning variants, rewriting introductions for different user groups, simulating user questions, organizing the launch rhythm, and turning early feedback into the next round of product and market actions.

Stage 4: Scale

The Scale stage shifts the focus from “building it” to “growing repeatably.”

Once a company has stable users and revenue, the founding team gets pulled into operations, sales, support, data analysis, and internal coordination. Agent-like capabilities such as Claude Cowork are better suited to more complete tasks: conducting market research, designing campaigns, organizing fundraising strategy, summarizing growth metrics, or turning an operations process into repeatable steps.

This is also where the difference between AI-native companies and traditional software companies begins to appear. The real change is not simply that employees use AI tools. It is that company processes are designed around AI collaboration from the beginning: which tasks require humans to define standards, which tasks should be drafted by AI first, which outputs must be reviewed, and which workflows can become reusable templates.

What Claude Code, Claude Cowork, and Chat are best for

Based on the official blog post, Anthropic wants founders to think about Claude across three kinds of use cases.

Claude Code is more engineering-oriented. It is suited for writing code, generating scripts, analyzing edge cases, producing component specs, and drafting technical documentation. It helps move ideas toward something that can run.

Claude Cowork is closer to a delegatable work agent. It fits tasks that require continued execution, such as market research, campaign design, fundraising strategy, and operations analysis. It helps push a relatively complete business task through a first pass.

Claude Chat is better suited for founder judgment moments: thinking through go-to-market strategy, stress-testing product positioning, comparing roadmap priorities, and refining key narratives. It is not an execution machine, but a thinking partner that can support rapid iteration.

What is actually useful for startup teams

The value of this playbook is not that it tells founders “AI is important.” That is no longer new.

Its more useful contribution is shifting AI use from scattered tool calls into a company-building method. Each stage has different bottlenecks, and each bottleneck can be broken into parts where AI can participate.

At the Idea stage, AI expands the search space. At the MVP stage, it compresses implementation time. At the Launch stage, it accelerates messaging and distribution experiments. At the Scale stage, it helps turn processes into repeatable workflows.

This logic is especially important for small teams. Small teams do not have enough people to cover every function, but they can use AI to create a first version of a capability, then spend limited human energy on the parts that most require judgment and relationship building.

Pitfalls to watch for

The first pitfall is treating AI-generated output as a conclusion. Market research, competitor analysis, user personas, and growth strategies all need to be validated against real data and user feedback.

The second pitfall is underestimating review cost. AI can significantly reduce the cost of first drafts, but code quality, legal risk, brand expression, commercial promises, and security issues still need human accountability.

The third pitfall is automating too early. A process that has not yet worked manually should not be handed to an agent for automatic execution. A steadier approach is to let AI participate in one small part of the workflow, observe output quality, and then gradually expand the scope.

Summary

The signal from Anthropic’s Founder’s Playbook is clear: the advantage of an AI-native startup is not merely that it can use AI to write code. It is that from day one, AI becomes a collaboration layer across product, engineering, marketing, sales, and operations.

For founders, the most practical starting point is not building a grand AI workflow. It is choosing one task that consumes too much time, repeats too often, and slows progress the most, then letting Claude produce the first version. Real competitiveness comes from human founders’ control over direction, quality, and trust, and from whether the team can embed this collaboration pattern into everyday work.

References

The founder’s playbook for the age of AI

What Is Vercel AI SDK? A Unified Toolkit for TypeScript Developers Building AI Apps

Sun, 17 May 2026 23:07:38 +0800

vercel/ai is the open-source AI SDK maintained by Vercel.

Its positioning is clear: it gives TypeScript developers a unified toolkit for building AI applications and AI Agents. It comes from the team behind Next.js, but it is not limited to Next.js. It also supports React, Svelte, Vue, Angular, and runtimes such as Node.js.

Project link: https://github.com/vercel/ai

If you are building a chat app, AI writing tool, RAG application, tool-calling Agent, streaming interface, or a product that needs to connect multiple model providers behind one application, Vercel AI SDK is worth a close look.

The Core Problem It Solves

When building AI apps today, one of the biggest headaches is not whether you can call a model. It is that different model providers have different APIs, streaming formats, tool-calling conventions, error behavior, and frontend state-management needs.

For example:

OpenAI has its own SDK and response formats.
Anthropic has its own message structure.
Google, xAI, Mistral, DeepSeek, Groq, and others all differ.
Streaming output requires chunk handling.
Tool calling requires structured requests initiated by the model.
Chat UI also needs messages, loading states, cancellation, retry, and error display.

If every provider gets its own handwritten adapter, the project becomes complex very quickly.

Vercel AI SDK tries to hide those differences behind a unified API. Developers write the app against one interface and connect different models through Providers.

Unified Provider Architecture

One key feature of Vercel AI SDK is that it is provider-agnostic. It is not tied to one model vendor.

It can access OpenAI, Anthropic, Google, and other model providers through a unified API. The project README also notes that AI SDK uses Vercel AI Gateway by default, making it easier to reach multiple mainstream providers.

That is useful in real engineering projects.

Many AI products eventually depend on more than one model:

Some tasks need strong reasoning models.
Some tasks need cheap, fast models.
Some tasks require multimodal models.
Some tasks require long context.
Some tasks require local or private deployment.

A unified provider architecture makes model switching, gray releases, cost control, and fallback strategies easier.

Streaming Output Is Key to Frontend UX

One major UX difference between AI apps and traditional APIs is that responses can be long.

If users must wait for a full answer before seeing anything, chat tools, writing tools, and coding assistants feel slow. Streaming output lets text appear gradually, so users see progress sooner.

Vercel AI SDK provides fairly complete abstractions for streaming generation. Developers do not need to handle low-level event streams from scratch. They can use the SDK’s generation and streaming APIs to connect model output to frontend UI.

This is especially convenient for Next.js and React applications.

An AI chat interface looks simple, but in practice it must handle:

Message lists.
User input.
Server requests.
Streaming token display.
Loading states.
Error states.
Stopping generation.
Regeneration.

These are exactly the kinds of repetitive work AI SDK tries to reduce.

Tool Calling and Agent Scenarios

As AI apps move from “chatting” to “doing things”, tool calling becomes increasingly important.

The model may need to call external functions instead of only returning natural language:

Query a database.
Search documents.
Call business APIs.
Read order status.
Generate charts.
Create calendar events.
Modify project files.

Vercel AI SDK supports tool-calling capabilities, allowing developers to define tools, parameters, and execution logic, then let the model request those tools when appropriate.

This is one reason it has evolved from a “chat UI SDK” into a broader toolkit for AI apps and Agents.

Still, tool calling is not magic. Real projects must also handle:

Parameter validation.
Permission boundaries.
Tool-call logs.
Idempotency.
Timeouts and retries.
Human confirmation.
Restrictions for sensitive actions.

AI SDK can help with interfaces and flow, but developers still need to design the safety boundaries.

UI Integration

Vercel AI SDK is friendly to frontend frameworks.

It provides not only core generation APIs, but also abstractions around chat, completion, message state, and streaming UI. For teams using Next.js and React, this can remove a lot of boilerplate.

But it is not only for Vercel deployments.

If your project is built with TypeScript, or your backend runs on Node.js, AI SDK can still serve as the model-calling and streaming layer. Whether you deploy to Vercel depends on your architecture, team habits, and infrastructure choices.

Skill for Coding Agents

The vercel/ai README includes an interesting suggestion: if you use coding agents such as Claude Code or Cursor, you can add the AI SDK skill to your repository.

The example command is:

`1`	`npx skills add vercel/ai`

This shows that Vercel understands AI SDK users are not only human developers, but also coding agents.

When an agent modifies a project that uses AI SDK, a dedicated skill in the repository can help it understand SDK conventions, common APIs, project structure, and best practices, reducing the chance of messy code changes.

This direction is worth watching.

In the future, open-source projects may provide not only README files and docs, but also structured skill instructions for AI coding agents. For complex SDKs, that could become a new developer-experience entry point.

Suitable Projects

Vercel AI SDK is a good fit for:

AI chat apps based on Next.js or React.
Writing, Q&A, support, and coding assistants that need streaming output.
AI products that need multiple model providers.
Teams building quick RAG or document Q&A prototypes.
Apps that need tool calling, function calling, or lightweight Agent capabilities.
Teams already using TypeScript and Node.js.

It is especially suitable for frontend and full-stack developers. The hard part of many AI apps is not only calling a model, but turning model output into a stable, smooth, interactive product experience.

What It Is Not For

If your project is mainly a Python backend, deep-learning training workflow, model fine-tuning system, or low-level inference service, Vercel AI SDK may not be the core tool.

It is an application-layer SDK, not a model-training framework.

If you need to:

Train your own model.
Manage GPU inference clusters.
Run low-level batch inference.
Deeply control tokenizer behavior, KV cache, quantization, and inference engines.

Then you should look at PyTorch, vLLM, SGLang, TensorRT-LLM, llama.cpp, or cloud inference services.

Vercel AI SDK is closer to the application layer that connects model capabilities to products.

What to Watch For

First, do not assume a unified API means all providers are identical.

Different providers still differ in capabilities, context length, tool-calling formats, streaming details, error types, and pricing. A unified SDK reduces engineering friction, but it does not erase model differences.

Second, control costs.

Once an AI app is online, streaming chats, retries, tool calls, RAG retrieval, and multi-model fallbacks can all increase cost. Rate limits, caching, logs, and budget monitoring are necessary.

Third, handle safety boundaries.

If a model can call tools, you must restrict what those tools can do. Do not let the model directly execute high-risk operations, and do not expose secrets, database write permissions, or production operations to it without controls.

Fourth, keep observability.

When an AI app fails, frontend errors are not enough. You need to know the user input, selected model, tool calls, response time, token usage, error type, and final output.

Summary

vercel/ai is not a new model, and it is not just a chat component.

It is closer to infrastructure for TypeScript AI application development: unified Providers, streaming output, tool calling, frontend state management, and Agent scenarios all live inside one open-source SDK.

For teams already using Next.js, React, TypeScript, and Node.js, it can significantly reduce the engineering cost of going from “the model API runs” to “the product experience works”.

But it is not a universal layer. Model selection, permission design, cost control, logging, monitoring, and business safety still belong to the developer.

If you want to build AI applications rather than train models, Vercel AI SDK is a toolkit worth trying early.

References

Midjourney May 2026 Update: Conversational Mode, AI-Assisted Development, and SREF Organization

Sun, 17 May 2026 20:20:51 +0800

The most important signal from Midjourney’s May 14, 2026 Office Hours is not a single model parameter. It is that the product is continuing to move from “type a prompt and generate an image” toward a more conversational, organized, and iterative creative system.

The information comes from a Japanese summary of Midjourney’s recent Q&A, covering conversational mode, AI-assisted development, website redesign, SREF and tag organization, Omni-reference, multi-character consistency, and how the team itself uses Midjourney.

In one sentence: Midjourney is making image generation feel more like a creative system that can be discussed with, organized, and iterated over.

Conversational mode is becoming more important

The most direct change is Conversational Mode.

In the past, using Midjourney still depended heavily on parameters and fixed syntax. You had to remember rules for aspect ratio, image references, style references, model parameters, and then write them into prompts or adjust them in the interface.

The direction of the new conversational mode is to let users describe these settings in more natural language.

For example, users can specify by voice or text:

Default parameters.
Aspect ratio, such as 16:9.
Image references.
Style references, or --sref.
Omni-reference in V7.

This shows Midjourney is not only improving generation quality. It is also reducing the operational cost of parameters.

For ordinary users, the biggest change is that they do not have to memorize commands all the time. For heavy users, if conversational mode becomes stable enough, it may become the main entry point for adjusting generation settings with natural language.

AI-assisted development is changing Midjourney’s iteration speed

Another interesting point is that the Midjourney team is using AI-assisted development at large scale internally.

The source notes that the team can now fix small bugs, interface friction, and workflow issues much faster. There was even an example where a product bug was identified during a user call, fixed in real time with AI assistance, reviewed, and deployed quickly.

This is more interesting than simply saying “AI helps engineers write code.”

It shows that AI development tools are starting to influence how AI products themselves iterate:

User feedback can enter the fix pipeline faster.
Small experience issues are easier to address.
Engineers can spend more energy on architecture, review, design decisions, and testing.
Product teams can clean up edge cases more frequently.

Midjourney has many creative paths, parameter combinations, mobile experiences, search features, and organization workflows. Many issues are not about the core model failing to generate images, but about an entry point being awkward, an operation taking one extra step, or an edge state being unpleasant.

AI-assisted development is especially good at accelerating these many small improvements.

The website redesign is about workflow, not removing features

The Office Hours also mentioned a large website redesign.

The goal is not to remove complex features, but to make the creative flow more intuitive, make onboarding easier, and organize tools and features more clearly.

That matters.

Midjourney’s problem is not a lack of features. As features grow, entry points, collections, organization, references, exploration, and reuse become more complex. For light users, the hard question is “where do I start?” For heavy users, the hard question is “how do I manage many styles, references, and experiment results?”

Possible rollout strategies include:

Offering old and new interfaces in parallel.
Starting with an alpha test.
Moving gradually to avoid disrupting heavy users.

These strategies suggest the team understands that Midjourney is not just a casual image toy. Many users have already integrated it into real creative workflows, so interface changes cannot casually break existing habits.

SREF, styles, and tags remain pain points

SREF and style organization were among the most interesting topics in the Q&A.

Users want better organization systems, especially for:

Random SREF.
Style references.
Saved aesthetics.
Tags and colored tags.
Stronger filtering, grouping, and reuse.

But the team also raised a question: if the current folder system already lets one image belong to multiple folders, supports unlimited folders, and offers filtering and sorting, what exactly do tags provide that folders cannot?

That question is practical.

Many products add tags because users say they want tags. But a poorly designed tag system becomes another messy classification layer. If folders, tags, favorites, search, filters, projects, and style libraries have unclear boundaries, the system becomes harder to manage.

So the Midjourney team wants concrete workflow examples: in which scenario do users need tags? Why are folders not enough? Is it for combining styles quickly, reusing across projects, filtering by theme, color tone, photography style, or character relationship?

For Midjourney, the organization system may become as important as the generation model. Once users create long-term projects, the hard part is not generating one image, but managing thousands of images, hundreds of style directions, and repeated experiments.

Omni-reference points toward more complex character control

The source also mentioned that future Omni-reference / subject reference systems may support multiple character references at once and better separation of different subjects.

This maps directly to a long-running pain point in AI image generation: character consistency and multi-character relationships.

Keeping one character consistent is already difficult. Multiple characters are harder. Common problems include:

Character A’s traits leaking onto character B.
Identity confusion between multiple people.
Clothing, hair, and facial features changing across images.
Reference images influencing the whole style too strongly instead of controlling only the subject.

If Omni-reference can handle subject separation better, Midjourney becomes more useful for comics, storyboards, advertising visuals, character design, game concept art, and continuous narratives.

This is one of the areas worth watching after V7.

Midjourney is rethinking prompts

The summary includes a useful idea: language is an imperfect compression layer for imagination.

That sentence explains Midjourney’s product direction well.

Many users assume AI image generation is mainly about writing longer and more precise prompts. But in real creative work, image references, style references, moodboards, SREF, variations, regeneration, and post-processing are often more useful than a very long text prompt.

Team member Duncan’s workflow reflects this. He reportedly treats Midjourney as a sketchbook, combining moodboards, SREF, short prompts, high --r regeneration, strong and subtle variations, Photoshop retouching, and external upscaling workflows.

This shows mature Midjourney users do not work only through “magic prompts.”

A more realistic process is:

Use a small amount of language to set direction.
Use image references to provide visual context.
Use SREF to narrow the style.
Use many variations to explore the space.
Use human taste to select results.
Use external tools for post-processing.

Prompts still matter, but they are not everything.

What this means for users

If you only generate images occasionally, the most direct impact is that conversational mode should become easier to use. In the future, you may be able to describe desired aspect ratio, references, style, and parameters more naturally instead of memorizing commands.

If you are a heavy user, three areas deserve attention.

First, organization.

How SREF, styles, folders, favorites, and tags evolve will directly affect long-term creative efficiency.

Second, the website redesign.

If the new interface can connect exploration, organization, reuse, and export, Midjourney will feel more like a professional creative tool instead of a single generator.

Third, character and subject reference.

If Omni-reference can reliably handle multiple characters and subject separation, Midjourney becomes better suited for continuous projects rather than only single images.

Summary

The key point from Midjourney’s May 2026 Office Hours is not one flashy parameter. It is that the product is continuing to evolve toward a creative system.

Conversational mode lowers the input barrier. AI-assisted development increases iteration speed. The website redesign aims to reorganize workflows. SREF and tag discussions point to long-term asset management. Omni-reference relates to character consistency and complex subject control.

For AI image generation tools, model capability is obviously important. But once generation quality reaches a certain level, what determines whether users stay long term is often workflow, organization, controllability, and iteration speed.

Midjourney is filling in those pieces.

References

Midjourney 最新ニュース（2026年5月14 日）｜アキスケ

How OpenClaw Creator Peter Steinberger Sees AI Software Development: From OpenClaw to Closed-Loop Coding

Sun, 17 May 2026 20:02:26 +0800

Peter Steinberger’s career is a useful lens for understanding what is changing in AI software development.

He is not a newcomer who suddenly became visible because of AI. Before OpenClaw, he was already the founder of PSPDFKit, a company focused on PDF rendering, document processing, and developer tools. Products like that are hard to win with concept packaging alone. They have to deal with performance, compatibility, API design, enterprise customers, and long-term maintenance.

So when Steinberger later built OpenClaw with AI tools and shared views around AI agents, personal automation, and AI coding, the point was not simply that “one person wrote a lot of code.” The more interesting part is how he combined years of software engineering experience with a new generation of AI coding agents and rethought the development process.

AI coding is not a magic button

Discussions about AI coding often fall into two extremes.

One side says AI can already write code, so programmers are almost obsolete.

The other side says AI-generated code is unreliable, so real engineering still has to be hand-written by people.

Steinberger’s experience points to a third view: AI changes the unit of operation in software development, but it does not remove engineering judgment.

In the past, developers mainly worked around editing code. Requirements breakdown, architecture decisions, implementation, testing, and bug fixing all revolved around manual code changes.

Once AI coding agents enter the workflow, developers increasingly manage an execution system:

Explain the goal.
Provide context.
Set boundaries.
Let the agent modify code.
Run tests and checks.
Iterate based on results.

This is not simply handing the keyboard to a model. It is moving humans from “typing every line” toward “defining direction, designing feedback, and judging results.”

Why he is skeptical of calling it vibe coding

One phrase that often appears around Steinberger is vibe coding.

The term originally described a new style of development: developers describe ideas in natural language, let AI generate large amounts of code, then keep adjusting based on runtime results and feedback.

But Steinberger is not entirely sold on the phrase. Public coverage has noted that he sees vibe coding as potentially dismissive, implying that AI-assisted development is just “generating by feel” while ignoring the skill, judgment, and experience behind it.

That criticism makes sense.

Effective AI coding is not about typing a casual sentence and trusting the model’s output. It requires:

Breaking vague requirements into executable tasks.
Detecting when the model misunderstands the goal.
Designing tests and acceptance criteria.
Judging whether the code structure will remain maintainable.
Knowing when to stop generating and switch to human review.

In other words, AI reduces the friction of writing code, but it does not reduce the responsibility of understanding the system.

The loop is the key

One idea often associated with Steinberger’s interviews and writing is the importance of the loop.

Letting AI generate code is open-loop.

Letting AI generate code, run it, read errors, fix problems, and run tests again is closer to closed-loop development.

That difference matters.

Open-loop generation easily creates software that looks usable on the surface. The page opens, features appear to exist, and there is plenty of code. But once it enters a real environment, problems with state management, permissions, exception handling, edge cases, and deployment quickly appear.

Closed-loop development means output must be constrained by feedback. The simplest loop is:

Write down the goal clearly.
Let AI modify the code.
Automatically run tests, type checks, lint, or a build.
Feed errors back to AI.
Repeat until it passes.
Let a human review the critical path.

This is where AI software development can truly improve efficiency. Not because the model gets everything right the first time, but because it can participate quickly in a cycle of generation, validation, and repair.

More experience makes AI more useful

One of the easiest misconceptions about AI coding is that experience no longer matters.

Steinberger’s case suggests the opposite: experience becomes more important, but its role changes.

An experienced engineer is better at deciding:

Which tasks are suitable for an agent.
Which modules need tests first.
Which changes are too risky for broad AI refactoring.
Which generated code merely looks plausible.
Which problems should be solved through architecture rather than more patches.

AI can generate many candidate solutions. The more candidates you have, the more judgment you need. An inexperienced person may be impressed by “it runs.” An experienced engineer asks: can it be maintained? Can it scale? Does it break a security boundary? Can we debug it when something goes wrong?

That is why AI coding agents do not turn software engineering into pure chat. They outsource part of the execution work while amplifying planning, review, validation, and trade-off decisions.

OpenClaw matters beyond the project itself

OpenClaw drew attention not only because it is an open-source AI agent, and not only because it grew quickly.

It is also a signal: developers increasingly want AI to do more than answer questions. They want it to connect to real tools and perform real actions.

Traditional chatbots stay inside the chat box. They can explain code, write drafts, and give advice, but people still need to copy, paste, open software, and run commands.

The agent direction connects models to tools:

File systems.
Browsers.
Terminals.
Email.
Calendars.
Third-party services.
Project repositories.

Once models can use those tools, the boundaries of software development shift. AI is no longer just code completion. It can participate in project reading, task decomposition, file editing, test execution, PR preparation, and workflow automation.

That is also why Steinberger’s move to OpenAI drew attention. He represents not just a single developer story, but a product direction: personal agents moving from demos into everyday work.

What this means for ordinary developers

For ordinary developers, Steinberger’s experience is not something everyone can copy directly.

Not everyone can manage multiple agents at once. Not every project is suited to heavy AI generation. Not every team accepts a workflow of “generate first, iterate quickly.”

But several lessons are useful.

First, write tasks clearly.

AI is sensitive to vague goals. If you say “optimize this,” it may change style, structure, features, and logic. If you say “change the login failure message from English to Chinese without altering the authentication flow,” the result is usually more controllable.

Second, standardize validation commands.

If a project has no tests, no build command, and no lint, AI has trouble forming a loop. Even basic commands like npm test, go test ./..., pytest, or hugo are better than relying only on visual inspection.

Third, control the scope of changes.

Having AI handle one module, one bug, or one page at a time is usually more reliable than asking it to “refactor the whole project.”

Fourth, keep human review.

For authentication, payments, permissions, data deletion, deployment scripts, database migrations, and security configuration, do not lower the review bar just because the code was generated by AI.

Fifth, review prompts and failure patterns.

If AI often misunderstands a certain type of task, write those constraints into project rules, agent instructions, or skill files. AI coding capability comes not only from the model, but also from the work environment you build around it.

Where AI software development is going

Steinberger’s story suggests that AI software development is moving from “helping write code” toward “organizing software production workflows.”

Early AI coding tools were mainly useful for function completion, error explanation, and template generation. The shift now is that agents can work across files, call tools, run checks, and continue fixing based on feedback.

This points to several trends.

First, the productivity ceiling for individual developers will rise.

One person can push more prototypes, scripts, internal tools, and small products. But higher output does not automatically mean higher quality. The faster code is generated, the more validation matters.

Second, project structure becomes more important.

The clearer the code, tests, and documentation, the easier it is for AI to make correct changes. Messy projects are hard for humans and hard for AI.

Third, software engineers will look more like workflow designers.

In the future, what matters will not only be whether someone knows a programming language, but whether they can organize requirements, context, tools, tests, deployment, and permissions into a controlled loop.

Fourth, security boundaries become more sensitive.

If an agent can do things, it can also do the wrong things. If it can read files, run commands, and access services, then permissions, audit, and rollback become infrastructure for AI development environments.

Summary

The most valuable part of Peter Steinberger’s view of AI software development is not how much code AI generated. It is the development posture he demonstrates.

Humans are no longer only typing line by line inside an editor. They are designing goals, managing agents, building feedback loops, reviewing results, and adjusting the system. Code remains important, but it is no longer the only center of labor.

If traditional software development emphasized “writing the code correctly,” AI software development increasingly emphasizes “making the system continuously produce verifiably correct results.”

This is not just about lowering the engineering barrier. It changes the shape of engineering ability: from manual implementation toward task decomposition, context management, tool orchestration, automated validation, and final judgment.

References

Google Gemini Spark Leak: A 24/7 Gemini Agent May Be Coming

Sun, 17 May 2026 11:58:08 +0800

Google has not officially released Gemini Spark.

Current information about it mainly comes from internal Gemini Web test screens, community screenshots, TestingCatalog reporting, and 36Kr / Xinzhiyuan’s summary of related leaks. The consistent picture is that Gemini Spark BETA may be an always-on AI Agent that Google is preparing. Its positioning is no longer just a chat assistant, but an “everyday AI agent” that can handle email, online tasks, and multi-step workflows in the background.

So the boundary should be clear first: this is a leak analysis, not an official Google announcement. All features, naming, and launch timing still need to be confirmed by Google.

Bottom line

Based on currently exposed information, Gemini Spark has three key points:

It may be a 24-hour online Agent inside the Gemini system, not a normal chat model.
It may use broader personal context, including Google apps, chat history, tasks, logged-in websites, and location.
Its risks are as large as its appeal, because it may involve information sharing, remote browser data, purchases, and third-party service calls.

If Google really launches Spark, Gemini’s role will change from “AI that answers questions” to “AI that continuously handles tasks for you.”

What Gemini Spark is

TestingCatalog reported on May 14, 2026 that Google is testing Gemini Spark BETA inside Gemini Web. The exposed welcome text describes it as an everyday AI agent that can help users 24/7 with inbox, online tasks, and more multi-step work.

The 36Kr / Xinzhiyuan article also says that after Spark was uncovered, the outside world saw a “full-time Agent” direction: it can stay on standby all day, process inboxes, execute online tasks, and may even involve purchases and information sharing.

This means Spark is not simply a new model name. It looks more like a Gemini product-layer upgrade: bringing Gemini out of the conversation window and into users’ email, web, calendar, tasks, and cross-app workflows.

How it may work

According to the hidden onboarding text disclosed by TestingCatalog, Gemini Spark may gather context from multiple sources, including:

Connected Apps.
skills.
chats.
tasks.
Websites the user has logged into.
Personal intelligence.
location.

This information would help Spark understand what the user wants to complete and call the necessary context while executing tasks. The text also says that, to complete some actions, Gemini may share necessary information with third parties, such as names, contact details, files, preferences, and information the user may consider sensitive.

If these descriptions prove accurate, Spark will work more like a context-aware agent system than a one-shot Q&A assistant. It will not only look at the current prompt, but may combine long-term preferences, connected apps, browser state, and task history.

Why it matters

The key to Gemini Spark is not one more chat entry point. It is that Google has a natural ecosystem entry point.

OpenAI and Anthropic can build powerful Agents, but they do not naturally own the full chain of Gmail, Calendar, Drive, Chrome, Android, and Workspace. If Google connects Spark into these products, users will not need to build many extra workflows before letting an Agent enter daily work.

This may bring three changes.

First, Gemini may move from passive Q&A to active execution. Users may no longer only ask “summarize this email”; they may ask it to continuously organize the inbox, track tasks, and take follow-up actions.

Second, Agents will rely more on personal context. The more it understands your email, calendar, files, browser state, and preferences, the more useful the result may be.

Third, permission boundaries will become more sensitive. Doing more also means users need to know more clearly when it can act, how far it can go, and whether confirmation is required.

Where the risks are

Several details in the onboarding text disclosed by TestingCatalog are worth watching.

First, Spark is experimental. Even if it launches, it should not be treated as a fully mature system that needs no supervision.

Second, although the system is designed to ask for permission before sensitive operations, the text also warns that it may share information or complete purchases without asking.

Third, to maintain session continuity, Gemini will save remote browser data, such as login details and remote code execution data. Users can clear these data in Settings and can also disable Connected Apps and Personal intelligence.

Taken together, these points show that Spark’s product direction is aggressive: it wants to be an Agent that can truly execute tasks, not only generate suggestions. But the closer it gets to real execution, the more it needs strict permissioning, auditing, confirmation, and rollback mechanisms.

Relationship with Remy and AI Ultra

TestingCatalog says Spark may be a renamed version of the agentic Gemini upgrade previously codenamed Remy, and may also relate to the Gemini Agent direction for Google AI Ultra subscribers.

If this clue is correct, Spark may not be a brand-new project from nowhere. It may be Google repackaging previously higher-end or more closed Agent capabilities and preparing to bring them to a wider audience.

36Kr / Xinzhiyuan also describes it as an upgrade from “Remy” to “Spark”: Gemini Agent is no longer just a feature, but is moving toward a 24/7 digital life manager.

But this is still a judgment based on leaked information. Whether Google will use Spark as the official name, whether it will be limited to AI Ultra, and whether a lighter subscription tier will appear all need official confirmation.

MCP, skills, and the tool ecosystem

The same batch of community screenshots also showed model selector entries such as MCP Tool Testing. The 36Kr article suggests this may hint that the new Gemini will natively support MCP third-party tool integration, with Thinking mode also being rebuilt.

This clue becomes more interesting when viewed together with Spark.

If Spark were only a chat assistant, skills and MCP would matter less. But if Spark is a long-running Agent, it needs to reliably call tools, access web pages, execute tasks, read and write context, and deliver results to users.

In other words, Spark may not be a single feature. It may be part of Google’s Agent tool ecosystem: the model handles understanding and planning, while skills / MCP / connected apps handle execution and expansion.

What it means for ordinary users

If Gemini Spark really launches, ordinary users may see these direct changes:

Email is not only summarized, but can be categorized, followed up, and turned into tasks.
Web tasks are not only suggested, but may be continuously executed in a remote browser.
Calendar, location, preferences, and previous chats become long-term Agent context.
Purchases, bookings, form filling, and similar actions may enter the AI execution range.

This sounds convenient, but users will need new habits: not only checking what the AI says, but also what it is preparing to do, what it has already done, whether it can be undone, and whether there is a record.

Future AI Agent experience may depend not only on model intelligence, but also on clear permission prompts, inspectable task logs, and recovery from mistakes.

What it means for developers and teams

For developers, Spark matters because Google may be moving Agents from “demo products” toward real workflow platforms.

If Spark can reliably connect Google apps, third-party tools, and browser state, developers will care about:

Whether APIs or extension mechanisms are open.
Whether MCP or skills can be connected by third parties.
Whether enterprise admins can control permissions, data retention, and audits.
Whether Agent execution failures have traceable logs.
Whether sandboxing, approval flows, and sensitive-operation confirmation are supported.

For teams, Spark may first enter high-frequency scenarios such as Gmail, Calendar, Docs, Drive, and Chrome. It may not be suitable for fully automated high-risk work at the beginning, but it is a strong fit for inbox triage, meeting follow-up, document organization, market research, and lightweight operations tasks.

How to read it now

This story is best understood as “high-confidence direction, low-certainty details.”

The high-confidence direction is that Google is pushing Gemini Agents to be more proactive, longer-running, and more deeply integrated with its ecosystem. The Gemini Web test text reported by TestingCatalog, community screenshots, and 36Kr’s summary of multiple leaks all point in the same direction.

The low-certainty details are the official name, launch timing, permission rules, subscription tiers, supported regions, API availability, and whether it will really be called Gemini Spark.

The safest view for now:

Do not treat Spark as an already released official product.
Treat it as a strong signal for Google’s next AI Agent direction.
Wait for official explanations around permissions, privacy, third-party data sharing, and remote browser data storage.

Summary

If Gemini Spark eventually launches, it may be a key step in Gemini’s move from chat assistant to always-on Agent. It is not just a model swap; it places Gemini into Google’s ecosystem of email, web, tasks, location, personal intelligence, and third-party services.

Its potential is large: more proactive, closer to real workflows, and easier to distribute to many users through Google’s ecosystem. Its risks are just as large: once AI can share information, save browser state, make purchases, and call third-party services, permission boundaries must be extremely clear.

So the most important question about Gemini Spark is not “how smart is it”, but how Google plans to make a 24-hour online AI Agent controllable, auditable, and trustworthy.

References:

Gemini 3.5 Pro Leak: Codenamed Cappuccino, Google Tries to Regain Momentum in Coding and Agents

Sun, 17 May 2026 11:47:27 +0800

Google has not officially released Gemini 3.5 Pro.

What we can see so far mainly comes from developer community screenshots, anonymous benchmarks, leakers, and media reports. On May 15, 2026, 36Kr / Xinzhiyuan reported that a next-generation Gemini checkpoint may be internally codenamed Cappuccino, and that related models have already surfaced in communities and benchmark platforms.

This information should not be treated as an official launch, but it points in a clear direction: Google is trying to address two gaps at once, coding and reasoning on one side, and always-on AI agents on the other.

Bottom line

This leak can be read in three layers:

Gemini 3.5 Pro has not been officially released, and Cappuccino looks more like an internal checkpoint or candidate build.
The leaked information suggests the new Gemini is improving in code generation, SVG / interactive web generation, and multimodal output.
Google’s parallel test of Gemini Spark may matter more than the model itself, because it points to a 24-hour personal AI agent.

In other words, this is not just a “model benchmark” story. It looks more like a product roadmap signal ahead of Google I/O: the model needs to catch up with GPT-5.5, while the agent layer needs to capture user workflows.

What Cappuccino is

The 36Kr article says a post from Lentils indicates that the Gemini 3.5 Pro checkpoint codenamed Cappuccino has started to appear. The community had been discussing Gemini 3.2 only hours earlier, but the latest leak jumped directly to 3.5.

If that naming is ultimately accurate, Google may want to frame the next Gemini as a larger version jump rather than a routine point release.

For now, Cappuccino should still be treated as a leaked internal codename. It does not mean Google has publicly launched the final model, and it does not guarantee that the final release name will be Gemini 3.5 Pro.

Why coding is the focus

The most discussed part of the leak is the new Gemini’s coding ability.

According to community screenshots and benchmark claims cited by 36Kr, the new model appears stronger at:

Generating SVG and visual components.
Generating interactive web apps.
Handling animation, 3D, adjustable control panels, and other complex frontend outputs.
Improving logical reasoning and code generation.

The article also cites Abacus.AI CEO Bindu Reddy as saying that 3.2 Flash is close to GPT-5.5 in coding and reasoning while being much cheaper. Other media sources reportedly believe the new Gemini roughly reaches the GPT-5.5 tier overall, but may not represent a qualitative leap.

That is why the phrase “matches GPT-5.5” needs caution. It is more of a relative judgment from different leaks and anonymous tests than an official Google benchmark result.

Why Google needs to catch up in coding

AI coding has moved from developer tooling into the center of foundation model competition.

OpenAI has Codex, and Anthropic has Claude Code. They serve engineers, but they also bring product managers, designers, and operators into workflows where natural language can produce runnable products.

By comparison, Google has Gemini and Antigravity, but it has not formed the same default entry point in developer mindshare. The 36Kr article also notes that Antigravity has not truly broken through externally, and that pricing, quota reminders, and experience stability have drawn community discussion.

So if the new Gemini needs to prove itself, coding is the most direct battlefield. The question is not only whether it can write code, but whether it can reliably produce complete interfaces, understand complex requirements, call tools, fix errors, and fit into real development workflows.

Spark may matter more than 3.5 Pro

In the same wave of leaks, Gemini Spark BETA also surfaced.

According to TestingCatalog and other sources, Spark is positioned like an always-on AI agent: it can process inboxes, execute online tasks, manage multi-step workflows, and connect context from Google apps, skill modules, chat history, scheduled tasks, logged-in websites, and location data.

That means Spark is not a normal chat entry point. It may be a system that stays online, continuously reads context, and performs tasks for users.

Its appeal is obvious: if Google can connect Gmail, Calendar, Chrome, Android, Workspace, and Gemini, Spark will have a distribution advantage that OpenAI and Anthropic cannot easily copy.

The risk is just as obvious. The 36Kr article mentions wording around Spark saying it may share information or complete purchases without asking. Even if the system is designed to request permission before sensitive operations, this kind of agent still raises privacy, authorization-boundary, and accidental-action risks.

What this means for ordinary users

If you are a regular Gemini user, the most important part of this leak is not the model name. It is three shifts.

First, Google may continue to strengthen the ability to produce complete results. Users have often complained that Gemini can be lazy with visual generation, SVG, and frontend pages. If the new model can produce several complete options in one pass, the experience will improve noticeably.

Second, coding ability may continue to move into lighter models. The leak repeatedly mentions Flash improvements in coding, reasoning, and interactive generation, which means complex tasks may not always require Pro models in the future.

Third, agents will become more proactive. If Spark launches, Gemini may no longer just answer questions. It may start taking over email, web tasks, purchases, calendars, and cross-app workflows over longer periods.

That is good for efficiency, but it creates a new challenge for permission management.

What this means for developers

Developers should watch two issues more closely.

The first is tooling. The 36Kr article says community screenshots showed an unreleased entry called MCP Tool Testing in the model selector. If Gemini natively supports MCP or third-party tool testing, it will be easier to connect it to developers’ own toolchains.

The second is cost and stability. Even if the new Gemini matches GPT-5.5 on some benchmarks, developers will ultimately judge three things: actual code quality, context stability, and whether pricing and quotas are predictable.

The past year of AI coding tool competition has shown that model capability is only the ticket in. What keeps developers is whether the tool can reliably edit code, run tests, read context, and handle edge cases in daily projects.

How to read this news now

This story is best understood as “strong signal, weak confirmation.”

The strong signal is that multiple community clues point to Google preparing a stronger new Gemini and a more proactive Gemini Spark Agent.

The weak confirmation is that Gemini 3.5 Pro has not been officially released, Cappuccino remains a leaked codename, and claims that it “matches GPT-5.5” still need validation through official Google benchmarks, third-party tests, and real user experience.

The safest view for now:

Do not treat it as a released product.
Treat it as an early preview of Google’s next Gemini direction.
Watch whether I/O or later official events confirm the model name, API availability, pricing, context window, tool calling, and agent permission boundaries.

Summary

The exposure of Gemini 3.5 Pro / Cappuccino suggests Google may be preparing a stronger next-generation Gemini push. It is not trying to fix one isolated capability, but a whole AI workflow: the model needs to write code better, generate interfaces, and handle complex reasoning, while Spark pushes Gemini toward an always-on agent.

But before an official release, all benchmarks and screenshots remain clues. What will decide whether Gemini 3.5 Pro can regain momentum is not whether the codename sounds good, but whether it can reliably win in real development, real office work, and real multi-step tasks.

References:

easy-vibe: A Learning Map for Vibe Coding Beginners

Sat, 16 May 2026 22:44:43 +0800

easy-vibe is an open source Vibe Coding learning project from Datawhale. It is not aimed at developers who are already fluent with AI coding tools. It is aimed at students, product managers, designers, operators, indie developers, and technical hobbyists who are just starting with Vibe Coding.

The value of this project is not that it lists another batch of AI tools. It turns “how to start building projects with AI” into a learning path that is easier to understand. For many beginners, the hard part is not knowing that Claude Code, Cursor, MCP, or Agents exist. The hard part is knowing what to learn first, how to practice, and when to move into more advanced tools.

Beginners Need a Path Most

Vibe Coding has become popular in recent years, but it is not very friendly to beginners.

On the surface, as long as you can describe a requirement, you can ask AI to write code. In reality, as soon as the task becomes slightly more complex, problems appear: the requirement is unclear, the model edits the wrong file, the project structure is confusing, errors are hard to handle, dependencies fail to install, prompts become messier, and the workflow falls back to “copy code into a chat box”.

So getting started with Vibe Coding cannot only mean learning “how to write prompts”. It needs to solve several things:

How to split an idea into executable tasks;
How to let AI understand a project structure;
How to read code generated by the model;
How to handle errors and iterate;
How to use the terminal and local development environment;
How to move from web chat to real AI coding tools.

This is where easy-vibe matters: it tries to organize these topics into a learning route, instead of leaving beginners lost among tools, tutorials, and terminology.

It Is a Roadmap, Not a Single Tutorial

According to the project description, easy-vibe covers basic tutorials, interactive exercises, visual content, RAG, terminal tools, AI coding tools, and more advanced topics such as Claude Code, MCP, Skills, and Agent Teams.

This structure is suitable for beginners because AI coding is not a single skill. It is a combination of abilities:

Describing requirements;
Splitting tasks;
Reading projects;
Asking the model to edit code;
Running and verifying results;
Iterating based on errors;
Turning repeated workflows into tools or skills.

If you only learn one tool, it is easy to be constrained by that tool’s interface. Switch models, editors, or CLIs, and the workflow becomes unclear again. A roadmap helps build the working method first, then places tools where they belong.

Especially Useful for Non-Programmers

The biggest appeal of Vibe Coding is that it lets non-professional programmers build prototypes.

Product managers can turn product ideas into interactive demos. Designers can validate interaction logic. Operators can write internal tools. Students can quickly build course projects. Founders can validate demand early. These people do not necessarily need to become full-time engineers in the traditional sense, but they do need a method for “letting AI help me turn ideas into working things”.

This is also why easy-vibe fits the Chinese community. Many Chinese users already know AI can write code, but they still lack systematic beginner materials. Development environment, prompts, project structure, debugging methods, and Agent tools are easier to learn when explained clearly in Chinese and paired with exercises.

For these users, the most important thing is not to learn a complex framework immediately. It is to complete a full loop first: propose a requirement, generate a project, run it, find problems, keep modifying, and finally get a usable version.

The Advanced Part Moves Toward Real AI Development Workflows

The Claude Code, MCP, Skills, and Agent Teams mentioned in easy-vibe are no longer just beginner concepts.

Claude Code represents terminal coding Agents: the model can enter a local project, read files, edit code, and run commands. MCP solves tool and data source integration, so the model is not trapped in a chat box. Skills preserve reusable workflows, such as fixed project generation, document organization, test checks, or content production processes. Agent Teams further split tasks across multiple agents.

These topics may feel distant for beginners, but they are worth understanding early. The direction of Vibe Coding is already clear: from “let AI write a piece of code” to “let AI participate in a complete project workflow”.

If a learning route stops at prompts, it will quickly fall behind tool evolution. On the other hand, if every advanced concept is thrown at beginners immediately, they will not know where to start. The useful part of easy-vibe is that it places these topics on a gradual upgrade path.

Two Mistakes to Avoid

The first mistake is thinking that Vibe Coding means you can ignore code entirely.

AI can generate a lot, but the user still needs to judge whether the result is correct. At minimum, you need to understand the project structure, know how to run it, and roughly know where an error is happening. Even if you do not write complex code, you still need basic engineering common sense.

The second mistake is thinking that more advanced tools are always better.

Beginners do not necessarily need Claude Code, MCP, or multiple Agents at the start. A better order is to first build a feedback loop with simple projects, then gradually introduce the terminal, version control, testing, tool calling, and automated workflows. Tools should match task complexity; otherwise they look powerful but have no clear use.

How to Use It

If you are just starting with Vibe Coding, you can use easy-vibe as a learning checklist.

Start with basic concepts and simple exercises. Do not rush to chase every tool. Build a small project, such as a personal homepage, data dashboard, form tool, automation script, or knowledge base demo. During the process, observe where AI helps and where you still need to confirm things yourself.

Once you can complete small projects consistently, move into more complex topics:

Use terminal tools to work with local projects;
Use Git to manage each change;
Use RAG to connect your own materials;
Use MCP to connect external tools;
Use Skills to solidify repeated workflows;
Use Agent Teams to split complex tasks.

Learning Vibe Coding this way is not just learning to ask AI. It is learning to put AI into your own workflow.

Conclusion

easy-vibe is best seen as a Chinese learning map for Vibe Coding. It organizes scattered AI coding concepts, tools, and exercises into a route that helps beginners move from “I heard AI can write code” to “I can build a project with AI”.

The real value of Vibe Coding is not that it lets people skip all learning. It lowers the threshold from idea to prototype. You still need to understand requirements, organize tasks, verify results, and control risks. But many repetitive, tedious, and blocking steps can be handled with AI assistance.

If you want a systematic entry point into AI coding, without getting trapped immediately in tool names and complex engineering setup, easy-vibe is a good place to start.

Anthropic financial-services: Reusable Templates for Financial Agents

Sat, 16 May 2026 22:43:08 +0800

anthropics/financial-services is a reference project from Anthropic for the financial services industry. It is not a single application, but a set of examples that can be studied and reused separately: Agents, Plugins, Skills, MCP connectors, and prompts and integration patterns designed around financial workflows.

This project is worth watching not because it provides a “universal financial assistant”, but because it breaks common AI implementation problems in finance into more concrete components: what kind of Agent each role needs, which data sources need to be connected, which tasks can be automated, and which steps still require human judgment.

It Is More Like a Showroom for Financial Agents

When companies talk about AI Agents, the discussion can easily stay abstract: reading files, querying data, writing reports, and calling tools. Once the scenario enters finance, the questions become much more specific.

Investment banking analysts need to organize company materials, generate transaction briefs, and compare comparable companies. Equity research needs to read filings, follow news, perform valuation, and analyze risks. Private equity and asset management teams need to screen deals, write memos, and track portfolio companies. Wealth management needs to place client profiles, market information, and investment advice within a compliance framework.

These scenarios cannot be handled by a generic chat box alone. They require roles, processes, data sources, output formats, and permission boundaries. The value of this Anthropic repository is that it turns multiple typical financial services roles and tasks into Agent templates that can be used as references.

Why Provide Agents, Plugins, Skills, and MCP Together

Judging from the project structure, Anthropic did not only provide a set of prompts. It provides several kinds of components at the same time. This maps to several layers of enterprise Agent implementation.

Agents are more like work units for roles or tasks. They define what the agent should do, how it should do it, when to call tools, and how to produce output.

Plugins are more like external capability extensions. Financial work rarely happens only inside the model. It often needs to connect databases, document systems, market data, CRM, research libraries, and internal workflow systems.

Skills are reusable professional capability packages. Fixed analysis frameworks, report structures, checklists, and data processing methods can be turned into skills instead of being rewritten as prompts every time.

MCP connectors solve tool integration and context standardization. For enterprises, the more tools there are, the more they need a relatively unified way to connect them. Otherwise every system needs separate adaptation, and maintenance cost rises quickly.

Only when these pieces are combined does the result begin to resemble a real enterprise AI workflow.

Why Finance Is a Good Industry for Agent Examples

Financial services is a good industry for showing Agents because it has three traits at the same time.

First, information density is high. Financial work relies heavily on filings, announcements, meeting notes, research reports, trading data, client records, and regulatory documents. If a model only relies on general knowledge, it quickly becomes ineffective. It must connect to real data sources.

Second, output formats are stable. Investment memos, company profiles, KYC documents, research summaries, client briefings, and fund operation reports all have relatively fixed structures. This makes it easier for Agents to form verifiable workflows.

Third, risk boundaries are clear. Finance has strict requirements for compliance, auditability, permissions, and traceability. AI cannot casually provide investment advice or bypass approval processes. This forces Agent design to become more engineering-driven: keep references, separate facts from inferences, record tool calls, and limit executable actions.

That means this project is not only for financial companies. Any team building enterprise Agents can use it to observe how Anthropic decomposes industry scenarios.

What Typical Workflows It Covers

According to the project description, the repository covers several financial services areas, including:

Investment banking;
Equity research;
Private equity;
Wealth management;
Fund operations;
KYC and compliance-related workflows.

These workflows have one thing in common: they all require a lot of reading, organizing, comparison, and structured document generation. The best role for AI here is not to make decisions directly, but to reduce the time spent on information processing and document production.

For example, in investment banking, an Agent can help organize target company information, extract key financial metrics, and generate a first draft of a transaction summary. In research, it can read filings and news first, then list key changes and open questions. In KYC, it can help check whether materials are complete and whether there are unusual signals.

The final judgment should still belong to professionals. The Agent’s role is closer to assistant, analyst, and workflow accelerator.

What It Suggests for Enterprise Adoption

The most useful part of this repository is that it turns “model capability” into “business components”.

Internal AI projects often run into the same problem: model demos look impressive, but once they are connected to real business, they are hard to reuse. One team writes one set of prompts, another team writes another. One system connects a database, another builds its own interface. Security and audit requirements are scattered everywhere.

A steadier approach is to split capabilities into several types of assets:

Role-oriented Agents;
Process-oriented Skills;
MCP connectors for system integration;
Execution rules for permissions and audit;
Templates and checklists for business output.

The benefit is that the enterprise does not restart from “building a chatbot” every time. It gradually accumulates maintainable AI workflow assets.

Compliance and Responsibility Boundaries Cannot Be Ignored

The easiest misunderstanding around financial Agents is treating “can generate analysis” as “can replace decisions”.

In financial services, AI output should usually be treated as supporting material. It can organize facts, draft documents, highlight risks, and complete files, but it cannot bypass investment research, risk control, legal, compliance, and suitability requirements. Especially when investment advice, trading decisions, asset allocation, or identity checks are involved, human approval and responsibility chains must remain.

That is why enterprise Agents cannot be evaluated only by answer quality. They must also be evaluated by:

Whether data sources are reliable;
Whether references and evidence are traceable;
Whether tool calls are recorded;
Whether sensitive data is restricted;
Whether output has human confirmation;
Whether wrong results can be discovered and rolled back.

If these questions are not solved, the more automated the Agent becomes, the larger the risk radius becomes.

Conclusion

anthropics/financial-services is more like a financial Agent reference implementation than an out-of-the-box financial product. It shows one way Anthropic thinks about enterprise AI adoption: do not build only generic chat assistants; organize Agents around specific roles, specific workflows, specific data sources, and specific permission boundaries.

For financial institutions, it can serve as a reference for designing internal AI workflows. For developers, it is a sample for observing enterprise Agent architecture: Agents handle roles and tasks, Skills preserve professional processes, Plugins and MCP connect external systems, and the model eventually enters real business workflows.

If early AI tools solved “how to make models answer questions”, projects like this care more about “how to let models participate in work within controlled boundaries”. That is where enterprise Agents become truly difficult.

DeepSeek-TUI: Turning DeepSeek V4 into a Terminal Coding Agent

Sat, 16 May 2026 22:41:41 +0800

DeepSeek-TUI is an open source project that brings DeepSeek V4 into terminal-based development workflows. It is not just a chat wrapper. It is closer to a “command-line coding agent” like Claude Code or Codex CLI: it can read files, edit code, run commands, call tools, and keep working through tasks in a TUI.

If you already switch between an editor and a terminal, the value of this kind of tool is straightforward: you do not need to copy code back and forth into a web chat window, and you do not need to manually describe the whole project structure. You give it a task, and it can read context from the current workspace, plan steps, make changes, then return the result for your review.

It Solves the Entry Point Problem for DeepSeek

DeepSeek models already provide strong reasoning and coding capabilities, but model capability needs an engineering layer before it can land in real development workflows.

Web chat is suitable for asking questions, but not for long-running project edits. APIs are suitable for system integration, but individual developers still need to build tool calling, context management, file operations, and permission control themselves. DeepSeek-TUI tries to fill this layer: it wraps DeepSeek V4 into an Agent that can work inside the terminal.

According to the project description, its main capabilities include:

A terminal TUI;
Conversation and task execution for DeepSeek V4;
Tool calling and file operations;
1M context support;
Auto mode;
Sub-agents;
Sandboxed execution;
A persistent task queue.

Together, these features are not aimed at making the model sound more human. They are aimed at making the model easier to bring into the development environment.

A TUI Fits Long Tasks Better Than Plain CLI Text

Many AI CLI tools start with plain text interaction: enter a prompt, wait for output, then copy commands or add more context. This is simple, but longer tasks quickly become messy.

The advantage of a TUI is that it can place conversations, files, execution results, and task status in a more stable interface. For a coding Agent, that matters. A code task is rarely a single question and answer. It often includes:

Understanding the project structure;
Finding relevant files;
Editing code;
Running tests or commands;
Fixing issues based on errors;
Summarizing changes.

If the interface is only a stream of logs, it is hard for the user to see where the Agent is in the process. A TUI at least provides a better place to observe and take over.

Auto Mode Is Best for Tasks with Clear Boundaries

The Auto mode mentioned by DeepSeek-TUI is best for tasks with clear boundaries. For example: fixing a small bug, adding a script, changing a configuration, organizing a set of documents, or implementing a local feature.

These tasks have something in common: the goal is clear, the verification method is clear, and the impact scope is controllable. The Agent can inspect files, edit files, run commands, and then hand the result back to the user for confirmation.

But Auto mode should not mean unlimited permission. In real projects, file deletion, large-scale refactors, database migrations, and deployment commands should all require explicit confirmation. The efficiency of coding Agents comes from automation, but so does the risk. The more a tool can execute commands, the more it needs sandboxing, permission boundaries, and human review.

Sub-Agents Matter Because They Split Tasks

Sub-agents are not a new concept, but they are useful in coding scenarios.

A moderately complex task usually requires several kinds of work at the same time: someone reads the code, someone changes the implementation, someone checks tests, and someone organizes documentation. Traditional multi-agent systems often feel ornamental because they have no real tools or real workspace; they only discuss inside a conversation.

If sub-agents can work with the file system, command execution, and task queues, they become more like a task decomposition mechanism. For example, one sub-agent can analyze dependencies, another can modify a specific module, and the main agent can integrate the result. This can reduce the problem of putting too much unrelated information into one context.

Of course, sub-agents also add cost: more tokens, more complex state, and responsibility boundaries that are harder to track. They are better suited to medium-complexity tasks and above, not necessarily every small edit.

1M Context Is Not Magic, but It Helps with Projects

1M context sounds exaggerated, but in coding scenarios it is not just a marketing number.

The context of a real codebase is fragmented: README files, configuration files, type definitions, tests, call chains, historical conventions, and error logs can all affect one change. Longer context can reduce the problem of editing after seeing only a local fragment, and it can help the model retain more project constraints.

Still, longer context does not automatically mean better judgment. Code tasks still need retrieval, filtering, and verification. Putting an entire project into context is not necessarily better than reading the relevant files precisely. A good coding Agent should treat long context as a buffer, not as a shortcut that replaces engineering judgment.

Who It Is Best For

DeepSeek-TUI is better suited to several groups:

Developers who want to use DeepSeek for coding tasks in the terminal;
People who do not want to build tool calling and file operation frameworks themselves;
Users familiar with Claude Code or Codex CLI who want to try a DeepSeek-based entry point;
People who need local project context instead of only asking about code snippets in a web page;
Developers who want to put AI coding workflows into a command-line environment.

If you only occasionally ask how to write a function, web chat is enough. If you want the model to participate directly in project edits, a terminal Agent becomes more meaningful.

Risks to Watch

There are three things to watch most closely with this kind of tool.

The first is permissions. As long as a tool can read and write files or execute commands, you need to know what it can access by default, whether it can delete files, whether it can access the network, and whether dangerous commands require confirmation.

The second is rollback. Before using it, it is best to keep the Git working tree clean, so every Agent change can be clearly seen through git diff. Do not let an Agent automatically edit a project while many unrelated changes are already uncommitted.

The third is verification. Code written by an Agent does not mean the task is complete. Tests, builds, linting, and human review still need to remain. AI coding tools can speed up progress, but they cannot replace final engineering confirmation.

Conclusion

The significance of DeepSeek-TUI is not that it adds another chat client. It puts DeepSeek V4 into a terminal environment that is closer to real development work.

For developers, model capability is only the first step. The real experience depends on whether it can read a project, safely edit files, run verification commands, maintain state in long tasks, and let the user take over at any time.

If you want to use DeepSeek for daily code changes, project reading, and automated development tasks, DeepSeek-TUI is worth watching. The direction is also clear: AI coding tools are moving from “answering code questions” to “participating in project execution.”

How Did AI Agents Evolve? A Complete 2022-2026 Five-Generation Timeline

Sat, 16 May 2026 19:19:52 +0800

AI Agents did not appear overnight.

At the end of 2022, ChatGPT was still mainly a chat window. By 2026, agents had begun to gain tool calling, file operations, computer control, long-term memory, remote collaboration, and persistent execution. In four years, they moved from “models that answer questions” toward “digital workers that can move tasks forward.”

If we look at the timeline, AI Agents have roughly gone through five generations. Each generation solved the previous one’s core limitation, while creating new bubbles and new safety problems.

Overview: five generations of Agents

Stage	Time	Keyword	Capability shift	Core problem
Generation 0	Late 2022 - early 2023	Chat box	Generates text, but cannot act	Model and real world are disconnected
Generation 1	Mid-2023 - late 2023	Tool calling	Outputs structured calls, connects APIs and RAG	Open-loop execution and task drift
Generation 2	Late 2023 - 2024	Engineered workflows	Planning, state, reflection, and multi-agent collaboration	Workflows are easy to copy; low-code bubble
Generation 3	2024 - 2025	Computer Use	Sees screens, clicks, and operates GUIs	Permission, safety, and misoperation risks
Generation 4	2025 - 2026	MCP / Skills / persistence	Tool networks, long-term context, and professional skills	Persistent execution expands the risk radius
Generation 5 preview	After 2026	Loops and world models	Stronger memory, validation, and physical action	Governance becomes harder

Late 2022: Generation 0, the ChatGPT chat-box era

Generation 0 begins with the release of ChatGPT on November 30, 2022.

This generation was not yet a real Agent. It had strong language generation ability, but it was mostly trapped in a chat box. It could write Python code, but not run it on your computer. It could plan a trip, but not book tickets. It could tell you how to edit a file, but not enter the file system and make the change.

Its capability boundary was clear:

understand natural language;
generate articles, answers, code, and plans;
no active access to fresh data;
no stable access to internal company knowledge;
no external action;
no long-term task state.

The core issue was the break between model capability and the real world. It could think and speak, but not act.

This stage also produced the first bubble: prompt engineers, prompt template markets, prompt courses, and prompt certifications. Early models were indeed sensitive to prompts, but the market mistook a temporary patch for a long-term moat.

As GPT-4-level models, system prompts, function calling, and better product defaults matured, many prompt templates lost scarcity. This pattern would repeat: a new capability creates a middle layer; the next generation internalizes it; the middle layer evaporates.

Mid-2023: Generation 1, tool calling wakes up

The keyword for Generation 1 is tool calling.

In June 2023, OpenAI released function calling. Developers could describe function names, purposes, parameter types, and JSON Schema. After understanding a user request, the model could output a structured JSON call instead of ordinary natural language, and an external system would execute it.

The architectural significance was large: the model started moving from a brain that only talks to a brain that can drive external tools.

Key capabilities included:

choosing tools based on user intent;
outputting structured arguments;
calling external APIs;
feeding API results back into the model;
using RAG to access external knowledge;
forming early personas through plugins and knowledge bases.

At the same time, RAG and vector databases became popular. They addressed the model’s lack of fresh information, private enterprise materials, and internal knowledge. The system retrieved relevant document chunks, injected them into context, and let the model answer from those materials.

The basic Agent structure became:

who you are: system prompt and persona;
what you know: knowledge base, RAG, private documents;
what you can do: function calling, plugins, external APIs.

The most dramatic bubble of this generation was AutoGPT. It showed an attractive idea: the user gives a broad goal, and AI breaks it down, searches, writes files, evaluates, loops, and stops when it believes the work is done.

But AutoGPT quickly exposed the problem. It lacked state constraints, stopping conditions, and reliable feedback. Tasks drifted, APIs were called with bad arguments again and again, and bills could be burned by huge numbers of model calls. The lesson was simple: tools plus an infinite loop do not make a production-grade Agent.

Late 2023 to 2024: Generation 2, engineered workflows

AutoGPT’s failure taught the industry that models cannot simply be left to improvise. Complex tasks need structure.

Generation 2 is about engineered workflows. An Agent became not just one model call, but a software system with state, control flow, and evaluation.

Key capabilities included:

task planning: breaking large goals into steps;
state management: tracking where work stands;
reflection and revision: generating, reviewing, and improving;
tool orchestration: switching between tools;
human-in-the-loop: asking for confirmation at key points;
multi-agent collaboration: dividing roles.

A typical pattern is ReAct, or Reasoning + Acting. The model reasons, calls a tool, observes the result, and then reasons again. The Agent no longer acts blindly; each step has auditable logic and feedback.

Common agentic workflow patterns emerged:

reflection: generate, review, revise;
tool use: choose search, databases, code execution, and enterprise APIs;
planning: decompose goals and track state;
multi-agent collaboration: product, developer, tester, reviewer roles.

The value of Generation 2 was putting model capability inside a controllable process. A well-designed workflow can sometimes make a smaller model produce more stable results than a single large-model call.

This generation also produced the low-code Agent platform bubble. Many tools used drag-and-drop interfaces to combine prompts, RAG, plugins, and flows. They lowered the building barrier, but if a workflow can be copied cheaply, the platform itself has a weak moat.

Low-code tools can capture early demand, but a demand window is not a defensible wall.

2024 to 2025: Generation 3, Computer Use reaches real interfaces

The keyword for Generation 3 is Computer Use.

Earlier tool calling relied mostly on APIs. What an Agent could do depended on what developers had connected. But many real-world apps do not have clean APIs, or their APIs are incomplete, closed, or inconsistent.

Computer Use lets models look at screens, click, and operate GUIs. The general computer interface itself becomes a tool.

Key capabilities included:

recognizing screen content;
clicking buttons, typing text, switching windows;
operating web and desktop software;
reading repositories, editing files, running tests;
inspecting terminal output and errors;
behaving more like a real engineering assistant.

This pushed Agents from “using connected tools” toward “operating software like a person.” It also made coding agents closer to real workflows: read a project, change code, run tests, and continue from errors.

But the trust boundary expanded. If AI operates a computer, it can click the wrong button, delete the wrong file, submit the wrong form, or be manipulated by webpage text, documents, and UI instructions. Prompt injection becomes a file-operation, permission, and system-safety problem.

Vibe coding debates also concentrated in this stage. Fast AI-generated projects feel exciting, but without tests, evaluation, permissions, and deployment boundaries, fast prototypes can become fast incidents.

Generation 3’s lesson: the closer an Agent gets to real operations, the more it needs sandboxing, approvals, rollback, and least privilege.

2025 to 2026: Generation 4, MCP, Skills, and persistent digital workers

Generation 4 is about persistence, connection, memory, and specialization.

The focus is not only stronger single tasks. Agents start to have long-term context, tool networks, professional skills, and a sense of time. They become less like helpers in one chat and more like digital workers that can continue working.

MCP addresses tool connection. It lets Agents connect to file systems, databases, browsers, design tools, project management tools, and enterprise systems in a more standardized way. Once the protocol stabilizes, many “tool-connection middle layer” products get compressed.

Skills address professional method. Tools tell an Agent what it can do; skills tell it how to do the work. A good skill is not just a prompt. It packages domain workflows, constraints, checks, common pitfalls, and tool-call order.

Key capabilities included:

long-term memory: storing preferences, project rules, and history;
project context: understanding repositories, docs, and work rules;
tool networks: connecting through MCP, APIs, browsers, and file systems;
professional skills: packaging task methods through Skills;
persistent execution: waiting, waking, reminding, and following up;
remote collaboration: users can return from different devices to approve and steer.

This generation starts to feel like an employee:

identity and responsibility boundaries;
long-term context;
professional work methods;
time awareness;
tool permissions;
ability to continue work without being watched.

But the more it resembles an employee, the more its risk radius resembles an employee’s. Persistent execution, local data access, secrets, tool calls, and task handling move security from the edge to the center.

One point matters especially: text is also an attack surface. If an Agent reads and follows Markdown, documentation, skill packs, or webpages, malicious text can change its behavior. Prompt injection becomes a supply-chain, permission, and execution-safety problem.

Generation 4’s lesson: persistent Agents need governance, not just capability.

After 2026: Generation 5 preview, loops, internal memory, and world models

Generation 5 is not established history yet. It is an extrapolation from the previous four years.

The first direction is more complete closed loops.

A mature Agent needs at least three loops:

execution loop: verify after each action, rollback, revise, and retry if needed;
time loop: track long-term goals across multiple wake cycles;
cognitive loop: know what is certain, what is guessed, and what is outdated.

The second direction is internal memory.

Most memory so far is outside the model: RAG, vector stores, chat logs, local files, and memory.md. If future model architectures support persistent state across sessions, Agent memory systems may be rebuilt.

The third direction is world models.

Many Agents today are still reactive: observe, respond, observe again. High-risk tasks require the model to simulate consequences. Before changing a database script, it should think about data loss, rollback failure, and compatibility issues, not learn only after an accident.

The fourth direction is embodiment.

Earlier generations mainly happened in digital space: APIs, screens, files, browsers, and enterprise tools. The next step may extend Agent action into the physical world, including robots, device control, industrial systems, and standardized physical interfaces.

Generation 5 will need to solve not only how Agents execute tasks, but how they understand consequences, manage long-term state, and stay reliable inside a larger risk radius.

Six patterns behind the timeline

First, base-model capability remains the ceiling. An Agent is not magic outside the model; it is a way to release model capability through engineering systems.

Second, engineered architecture amplifies model capability. Planning, verification, reflection, revision, evaluation, and permission control are closer to deliverable work than one-shot generation.

Third, open protocols reshape value distribution. Once MCP, Skills, and project-context standards stabilize, competition shifts from “who connected the tool first” to “who accumulated real domain capability.”

Fourth, the hidden main line of Agent evolution is expanding human-machine trust. From trusting text, to API calls, to workflows, to computer operations, to persistent execution, each generation pushes the risk radius outward.

Fifth, every generation’s accidents become the next generation’s rules. AutoGPT’s loops pushed structured orchestration; vibe coding failures pushed evaluation-driven development; production deletions pushed least privilege and sandboxing; skill poisoning pushed supply-chain safety.

Sixth, the Agent ecosystem repeatedly booms and collapses. New capabilities create temporary middle layers, and model or platform internalization later removes them. Mistaking a time window for a moat is dangerous.

The real moat

The real moat in AI Agents is not packaging a new capability first.

More reliable moats include three things.

First, vertical depth. Do you truly understand an industry’s workflow, risks, exceptions, and responsibility boundaries? General models can learn concepts, but they may not replace hard-earned domain execution experience.

Second, a data flywheel. Can you collect high-quality feedback from real usage and improve workflows, evaluation, fine-tuning, and product decisions?

Third, user trust. Will users hand you higher-value, longer-running, riskier work, or only treat you as a one-off tool?

If a platform or base model absorbs a capability, the products that still retain process, feedback, responsibility boundaries, and trust are more likely to survive. Many others are temporary bubbles.

Final note

From 2022 to 2026, AI Agent evolution was not “models getting better at chatting.” It was “humans becoming willing to hand more work to AI.”

A mature Agent is not the system most eager to execute automatically. It is the system that knows when to execute, when to verify, when to pause, and when to ask a human.

To judge whether an Agent product has long-term value, ask one question: when the next model or platform builds this capability in, what remains?

If the answer is domain workflow, real data, verifiable results, and user trust, there may be long-term value.

Gemini 3.5 Pro Leaks: Google Wants Spark Agent to Win Back the AI Coding Entry Point

Fri, 15 May 2026 23:45:34 +0800

Gemini 3.5 Pro has not been officially released yet, but leaks around it are already heating up.

The current round of information revolves around several keywords: Gemini 3.5 Pro, the codename Cappuccino, Gemini Spark, AI coding, and MCP tool integration. Together, they point in one direction: Google is not just preparing another chat model update. It wants to reconnect models, tools, Agents, and Google ecosystem entry points.

Before an official release, all of this should still be treated as leaked information. The more important signal is not one screenshot or one benchmark claim, but the gaps Google may be trying to close next.

Why Gemini 3.5 Pro Matters

Based on the exposed information, Gemini 3.5 Pro may be a jump in naming.

People were still discussing Gemini 3.2 earlier, and then Gemini 3.5 Pro appeared in leaks. If the naming is real, Google likely wants to tell a bigger version story in the next release rather than ship a routine minor update.

The leaked highlights mainly fall into three areas:

continued improvements in coding and reasoning;
stronger SVG, interactive page, animation, and 3D generation;
a new Agent product, Gemini Spark, potentially moving to the front stage.

None of these directions is surprising. Gemini has long emphasized multimodality, and Google has very strong distribution channels. The real question is whether it can catch up with OpenAI and Anthropic in developer tools and Agent workflows.

Coding Is The Lesson Google Most Needs To Catch Up On

In 2026, coding is no longer just a model benchmark item. It has become one of the most direct product entry points.

The reason is simple: AI coding tools are used frequently and generate a large amount of feedback data. Developers ask models to read code, modify code, run tests, and fix bugs every day. These interactions naturally push the next generation of models and tooling forward.

Over the past year, Claude Code has gained strong mindshare among developers, while OpenAI has kept strengthening the connection between Codex and ChatGPT. Google has products such as Antigravity, but its external presence has not been as strong.

That is why Gemini 3.5 Pro is being watched closely. If it only becomes better at chatting or answering faster, the impact is limited. If it truly improves code understanding, cross-file editing, tool calling, and long-running task execution, it may change developer workflows.

Gemini Spark May Be The Bigger Variable

More aggressive than the model itself is the rumored Gemini Spark.

According to the leaks, Spark is not positioned as a normal chat assistant, but as an always-on AI Agent. It may connect to email, calendars, web pages, tasks, account state, and personal context to help users handle multi-step workflows.

This kind of product has a large imagination space. For example:

automatically organizing an inbox;
following up on tasks for the user;
performing actions on web pages;
handling cross-application workflows;
arranging daily matters based on personal preferences.

But the risks are just as obvious. If an always-on Agent can access login state, browser data, files, location, and third-party services, it must answer several questions: when must the user confirm an action? Which operations must be blocked from automation? Will data be shared with third parties? How are remote browsers and credentials isolated?

So the real question for Spark is not just whether it can get work done. It is whether Google can make permissions, auditing, confirmation flows, and user control clear enough.

What MCP Tool Integration Suggests

The leaks also mention that the new Gemini selector may include MCP-related models or testing entries.

If this ships, it suggests Google is also pushing models from a question-answering system toward a tool operating system. The model will no longer only generate text. It will need to call external tools, access business systems, read and write files, run commands, and maintain task state across multiple steps.

This direction is consistent with OpenAI and Anthropic. Whoever makes tool calling more reliable will have an easier time embedding AI into real workflows.

But MCP integration itself is not the finish line. The hard part is stability:

can the model choose the right tool;
are the parameters reliable;
can it recover after failure;
are permission boundaries clear;
can users trace every step.

If these questions are not solved, more tools also mean a larger surface for mistakes.

Multimodality Is Still Google’s Strong Card

The place where Google has the best chance to differentiate is still multimodality.

Based on exposed SVG, interactive page, animation, and visual generation examples, Gemini may continue to strengthen its ability to generate interactive content from prompts. Compared with simply writing a piece of code, this is closer to product prototyping: the user describes an idea, and the model directly produces an operable, adjustable, previewable interface.

This path fits Google well. It can build on Gemini’s multimodal strengths and also connect with Android, Chrome, Workspace, Search, Ads, and Cloud.

If Google wants to avoid competing only on “whose coding model is stronger”, it may put more emphasis on a more complete multimodal Agent system.

The Three Companies Are Splitting Into Different Playbooks

The current model race is no longer just a leaderboard race.

OpenAI’s advantage lies in product iteration and distribution speed. Codex, ChatGPT, enterprise tools, and APIs are becoming more tightly connected.

Anthropic’s advantage lies in developer mindshare and code model quality. Claude Code has already become the default AI coding entry point for many people.

Google’s advantage is ecosystem access. Gmail, Docs, Chrome, Android, Search, YouTube, Maps, and Cloud services form a huge personal and enterprise data network. If Agents can safely connect to these entry points, Google may move from a “model chaser” to a “workflow entry point controller”.

That is why Gemini Spark is worth watching. It does not necessarily need to rank first on every benchmark. If it enters daily workflows, it may still build its own moat.

How Regular Users Should Read This

For regular users, there is no need to be pulled around by every leak in the short term.

The more practical things to watch are:

Whether Gemini 3.5 Pro’s coding ability truly improves, especially in complex repositories, long context, and tool calling.
Whether Gemini Spark is safe by default, with clear confirmation and traceable records before sensitive operations.
Whether Google gives clear pricing, quotas, and enterprise permission management, rather than only showing demos.

Pretty screenshots alone do not mean much. Whether it can reliably enter real workflows is the dividing line for this round of AI Agent products.

What It Means For Developers

Developers should care less about “which model won” and more about whether their workflow is portable.

Claude Code, Codex, Gemini, Antigravity, Cursor, Windsurf, and many other tools are all competing for the entry point. If every process is locked into one platform, future changes in cost, quota, model policy, or permission rules will make migration painful.

A safer approach is:

keep standard Git workflows for important projects;
always inspect diffs after automated edits;
use tests and CI as backstops for key tasks;
do not hand production credentials to opaque Agents;
when open protocols can connect tools, prefer replaceable options.

Models will keep getting stronger, but engineering discipline will not become obsolete.

Summary

The Gemini 3.5 Pro leaks suggest that Google is accelerating its effort to catch up in AI coding and Agent entry points. Model improvements are only one part of the story; always-on Agents such as Gemini Spark may be the larger strategic move.

But the more a system can “do things automatically” for users, the more it needs strict permission boundaries and verifiable workflows. For Google, the real challenge is not only catching up with GPT-5.5 or Claude. It is combining strong models, safety mechanisms, and ecosystem entry points into a trustworthy daily workflow.

If Google pulls that off, Gemini may not need to top every leaderboard to regain some initiative in AI entry points.

OpenHuman Quick Read: The Desktop Route for an Open-Source Personal AI Agent

Fri, 15 May 2026 14:52:31 +0800

OpenHuman is an open-source personal AI Agent project from tinyhumansai. Its goal is not to build yet another chat window, but to place a desktop app, personal memory, third-party integrations, voice, coding tools, and a local knowledge base into the same agent harness, so AI can understand your daily work context faster.

The project README positions it as “Personal AI super intelligence,” and the official site emphasizes private, simple, and extremely powerful. That claim is ambitious, but it is more useful to break it down: the part of OpenHuman that deserves attention is its attempt to make “personal context” the product core, instead of leaving model calls, plugin configuration, and document retrieval for users to assemble themselves.

At the time this article was checked, the GitHub repository had about 7.8k stars and 629 forks. The latest release was OpenHuman v0.53.43, dated May 13, 2026. The project is still in Early Beta, and the README clearly warns that it is under active development, so rough edges should be expected.

What Problem Is It Trying to Solve?

The problem with many AI assistants is not that the model is too weak, but that the context is too cold. Every time, you have to explain the project background, recent emails, calendar, code repositories, documents, tasks, and preferences again. Once you move across Gmail, Notion, GitHub, Slack, Calendar, Drive, Linear, Jira, and similar systems, the information is scattered across different tools.

OpenHuman’s approach is to connect those data sources first, then use automatic fetching, compression, summarization, and a local knowledge base to build a personal memory layer that can keep updating. The agent then remembers more than the current conversation; it can form long-term context around your workflow.

This is also the biggest difference between it and a normal chatbot. Chatbots often work around prompts; OpenHuman is closer to a desktop personal operating-system entry point, trying to prepackage connectors, memory, tools, and model routing.

Main Capabilities

Core capabilities listed in the OpenHuman README include:

A desktop-first UI and a short onboarding path, without requiring users to start from terminal configuration.
A desktop mascot with a “face” that can speak, respond to the environment, and participate in Google Meet.
118+ third-party integrations covering Gmail, Notion, GitHub, Slack, Stripe, Calendar, Drive, Linear, Jira, and other tools.
An automatic fetching mechanism: the project description mentions traversing active connections every 20 minutes and pulling new data into the memory tree.
Memory Tree: compresses connected data and activity information into Markdown blocks and stores them in local SQLite.
Obsidian-compatible vault: writes knowledge blocks as .md files so users can open, browse, and edit them with Obsidian.
Built-in search, web scraping, coding tools, file system access, git, lint, test, grep, voice input and output, and other capabilities.
Model routing: routes requests to different model types according to the task.
TokenJuice: compresses token usage before tool results, web pages, email bodies, and search results enter the LLM.
Optional Ollama support for local AI workloads.

These capabilities sound broad, but the real focus can be reduced to two points: reducing configuration and plugin assembly, and turning your personal data into memory that an agent can search, compress, and continuously update.

Installation

The project provides a website download entry point and terminal installation commands.

macOS or Linux x64:

`1`	`curl -fsSL https://raw.githubusercontent.com/tinyhumansai/openhuman/main/scripts/install.sh \| bash`

Windows:

`1`	`irm https://raw.githubusercontent.com/tinyhumansai/openhuman/main/scripts/install.ps1 \| iex`

If this is your daily primary machine, it is better to download the installer from the official site first, or at least open and inspect the install script before deciding whether to execute a remote script directly. OpenHuman touches email, documents, code repositories, calendars, and local file permissions, so installation and authorization deserve more caution than a small ordinary utility.

Open Source and Technical Stack

The OpenHuman repository uses the GPL-3.0 license. The language breakdown shows Rust as the main language, followed by TypeScript, with JavaScript, Shell, CSS, and PowerShell also present. The README’s contribution notes require Node.js 24+, pnpm 10.10.0, Rust 1.93.0, CMake, and platform-specific desktop build dependencies.

The rough local development path is:

git submodule update --init --recursive
pnpm install
pnpm dev
pnpm --filter openhuman-app dev:app

Before submitting changes, focused checks are recommended, for example:

1
2
3

pnpm typecheck
pnpm format:check
cargo check -p openhuman --lib

Judging from the repository structure, this is not a lightweight script project. It is a full product-style repository containing a desktop app, frontend, Rust backend, docs, tests, examples, and build scripts.

Why Memory Tree and the Obsidian Vault Matter

The concept most worth examining in OpenHuman is Memory Tree. The README says it standardizes connected data into Markdown chunks of up to about 3k tokens, scores them, folds them into a hierarchical summary tree, and stores them in local SQLite. The same content also enters an Obsidian-compatible vault.

This route has several advantages:

Users can directly see the agent’s knowledge base instead of only trusting black-box memory.
Markdown files are convenient for search, backup, version control, and manual revision.
SQLite is suitable for local indexing and fast queries.
Hierarchical summaries are better suited to long-term context compression than a flat pile of documents.

But it also has practical challenges: whether data sync is stable, whether summaries drop key details, whether permission boundaries are clear enough, whether deletion and undo are complete, and whether different connectors’ semantics can be handled consistently. These are not solved by one README phrase like “remembers everything”; they require long-term use and auditing.

TokenJuice: A Middle Layer for Cost and Latency

OpenHuman also emphasizes TokenJuice. Its role is to compress web pages, emails, search results, and tool-call results before they enter the model. Examples include converting HTML to Markdown, shortening long URLs, and removing some unnecessary characters. The README claims this can reduce cost and latency, with up to 80% lower token usage.

The direction is reasonable. In agent systems, the truly expensive part is often not one chat turn, but background fetching, tool calls, search, web parsing, and long-context injection. Cleaning data before handing it to the model is usually steadier than directly stuffing raw content into context.

However, a compression layer also creates new questions: it decides which information is kept and which is discarded. If you use it for contracts, bills, medical records, compliance material, or production incident logs, you cannot look only at token savings. You also need traceability, original-text review, and compression-error control.

Privacy: A Selling Point and an Audit Focus

One of OpenHuman’s selling points is privacy. The official site mentions that local AI models can handle low-level tasks, and the README emphasizes that workflow data stays on device, is encrypted locally, and is treated as yours.

This design direction is attractive because once a personal AI Agent connects to Gmail, Drive, Calendar, Slack, and GitHub, it touches the most sensitive work data. Compared with a fully cloud-based assistant, a local-first memory layer and a visible Markdown vault at least give users more sense of control.

But the full picture matters: OpenHuman also mentions one subscription, 30+ providers, model routing, ElevenLabs TTS, OAuth integrations, and other capabilities. That means it is not a purely offline tool. To evaluate privacy seriously, you need to check what each connector, each kind of model call, and each voice or search capability sends, and where it sends it.

Who Should Pay Attention?

OpenHuman is currently more suitable for three groups:

Users who want a personal AI control desk rather than a single-purpose chatbot.
Developers willing to try an Early Beta and accept changing features and rough edges.
People interested in local memory, Obsidian workflows, agent connectors, and context compression.

If you only want a stable, lightweight offline assistant with very simple privacy boundaries, it may be too heavy right now. If you want to study how the next generation of personal AI Agents might integrate desktop apps, connectors, memory, and tools, OpenHuman is an open-source sample worth tracking.

My suggestion is to first treat it as a “product-style open-source experiment”: watch release cadence, issue quality, connector permissions, data export capability, deletion mechanisms, and readability of the local vault. The key question for personal AI is not only whether it can answer questions, but whether it can carry your context for the long term in a transparent and controllable way.

References

What is Token Efficiency? DeepSeek V4, big-model planning, and small-model execution

Fri, 15 May 2026 08:59:33 +0800

The next important metric for AI coding may not be who has the strongest model, but who can complete more verifiable work with fewer tokens, lower cost, and a more stable process.

That is the value of Token Efficiency.

Many people hear Token Efficiency and think only about cheaper models, longer context, or cheaper cache hits. Those are base conditions. Real productivity comes from model division of labor, task orchestration, context budgeting, and evaluation.

In other words, Token Efficiency is not a cost-saving trick. It is an engineering method for turning tokens into output.

DeepSeek V4: productizing the split between planner and executor

The missing background in this topic is the positioning of DeepSeek V4.

DeepSeek V4 is not just another stronger model. It splits the two capabilities needed for Token Efficiency into V4 Pro and V4 Flash: V4 Pro is better suited for planning, reasoning, architecture judgment, and critical review, while V4 Flash fits high-frequency execution, batch rewriting, code completion, data organization, and ordinary agent-loop nodes.

That maps directly to two roles in AI coding:

V4 Pro: planner / consultant for requirement breakdown, technical design, complex bug analysis, architecture review, and final acceptance.
V4 Flash: executor for file scanning, simple implementation, test completion, documentation, candidate generation, and repetitive work.

DeepSeek’s API documentation shows that both V4 Flash and V4 Pro support 1M context, JSON Output, Tool Calls, Chat Prefix Completion, and FIM Completion. The pricing page also prices cache-hit input separately and notes that input cache-hit prices have been reduced to one tenth of the launch price.

Together, these are why it matters for Token Efficiency: 1M context reduces compression in complex agent tasks; low cache-hit pricing lowers the cost of repeatedly loading prompts, project docs, code, and history; the Flash / Pro split solves the problem of using a flagship model for every step or an unstable small model for every step.

DeepSeek V4 should therefore be understood in three ways:

Cheap execution layer: many agent nodes can run on V4 Flash.
Usable judgment layer: key steps can still call V4 Pro.
Long-chain friendly: 1M context and cache pricing make codebases, docs, and tool history easier to keep in the usable window.

Its significance for AI coding is not just another model option. It offers a realistic cost structure for the “consultant model + executor model + harness orchestration” pattern.

Do not let the strongest model do everything

The old approach was to pick the smartest model and let it handle requirement analysis, code, tests, and summaries end to end.

That is simple but not always efficient. Many tasks do not need frontier reasoning. Expensive models should behave more like consultants, architects, or planners that appear only at key decision points.

A better structure is:

Big models break down problems and make key decisions.
Small models execute, batch-process, and repeat edits.
Tools and harnesses manage process, state, context, and validation.
Humans define product goals, accept results, and make tradeoffs.

This prevents frontier reasoning from being wasted on mechanical execution.

Context is not always better when larger

Long context matters for coding agents because code, docs, chat history, test output, and logs all consume the window. When the window fills up, compression, forgetting, and misjudgment appear.

But long context does not mean dumping everything into the model.

Token Efficiency means each task should fit inside a clear, controlled context window:

Bring only necessary files.
Include only decision-relevant documents.
Keep only the current state from history.
Give each node clear input and output.
Compress completed work into structured summaries for the next node.

Cheap context can tempt people to include noise. Noise does not make a model smarter.

Harness matters more than a single model

Connecting Claude Code, Codex, or another coding agent to a cheap model is not enough. Small models drift in long-chain tasks unless a stronger process controls them.

A harness is a scheduling system. It decides how to split tasks, run nodes, choose models, validate results, retry failures, and pass context.

A useful orchestration system should answer:

Which tasks need planning?
Which tasks can execute directly?
Which nodes can run in parallel?
Which nodes must be serial?
Which nodes use big models or small models?
What is the context budget for each node?
What structured output does each node produce?
Who reviews and decides whether to continue?

Without this software layer, small models are merely cheap. With it, they can become leverage.

Split tasks with DAGs

A good approach is to split complex work into a directed acyclic graph.

A feature task might become:

Requirement clarification
Technical design
Task decomposition
Implementation
Test completion
Code Review
Fixes
PR submission

Each node can be an independent agent with its own role, prompt, tools, permissions, and output format. Nodes should pass structured results, not long chat transcripts.

This makes each node shorter, easier for small models, and easier to measure.

Run multiple task replicas

When tokens are cheap enough, the same task does not have to run only once.

You can run the same task with different models, prompts, or orchestrations, then pick the best result or merge useful parts. This is suitable for design proposals, copy, test cases, bug hypotheses, refactor options, and code review.

It is not suitable for tasks with external side effects, shared mutable state, or unclear acceptance criteria.

The goal is not gambling. It is collecting comparable samples that can improve orchestration, model selection, and node skills.

Build an evaluation system

Token Efficiency cannot be judged only by price. A cheap model with a high failure rate can consume more human time and become more expensive.

Start recording:

Completion rate
Human interventions
Tool-call failure rate
Test pass rate
Review findings
Token cost per task
Time per task
Rework count
Differences between model combinations

With this data, you can decide which tasks fit small models, which require big models, and which should stay human-led.

Make business workflows atomic

Most users do not need to build a full harness today. But they can start decomposing their business workflow into atomic nodes.

Content production can become topic selection, research, outline, draft, fact check, style rewrite, SEO title, translation, and publishing check.

Software development can become requirement confirmation, technical design, data structure, API change, unit tests, implementation, migration script, documentation, and review.

Each node should have clear input, output, acceptance, and context limits. When harness tools mature, these workflows can plug in directly.

Hardware is not the first priority

Many discussions of Token Efficiency jump to local deployment and GPUs. For most people, API should still be the first choice.

Before the economic model works, local hardware is only prepaid cost. A safer sequence is:

Use API to validate the workflow.
Record task evaluation and cost.
Find stable high-frequency execution nodes.
Consider which nodes should be localized.
Then calculate hardware, power, maintenance, and depreciation.

For personal productivity, API is often enough. For startups exploring inference frameworks and model boundaries, local CUDA platforms can be useful. For production workloads with clear unit economics, multi-GPU deployment becomes worth discussing.

Summary

Token Efficiency is not replacing expensive models with cheap ones. It is redesigning the AI workflow.

Big models make key judgments, small models execute in bulk, the harness schedules and validates, and humans define goals and acceptance. Only when these layers work together can tokens reliably become productivity.

Models will get cheaper, context windows will grow, and small models will improve. The future gap may not be who calls the strongest model, but who can use the same tokens to produce more real output.

Superpowers: a skills framework that pulls coding agents back into engineering process

Fri, 15 May 2026 08:53:17 +0800

obra/superpowers is both a skills framework for coding agents and a software development methodology. Its goal is not to add another universal prompt, but to make agents follow a process: clarify goals, produce a design, write a plan, implement through TDD, then review and finish.

Project: https://github.com/obra/superpowers

At the time of writing, the GitHub API shows more than 190,000 stars, an MIT license, and recent activity. The README describes it plainly: An agentic skills framework & software development methodology that works.

What problem it solves

Many AI coding tools are not weak at writing code; they are too eager to write code.

A user says something vague, the agent edits files, and the result looks finished while boundaries, tests, and architecture remain unclear. Small tasks may survive this. Complex projects turn it into rework and technical debt.

Superpowers makes the agent enter a workflow before touching code:

When the user wants to build something, ask about the goal first.
Turn the conversation into a spec and confirm it in sections.
After design approval, write an implementation plan.
After the user says “go”, begin implementation.
During implementation, emphasize TDD, YAGNI, DRY, and code review.

This is not new software engineering. It is important because fast agents need stronger guardrails.

Supported tools

Superpowers is not tied to a single agent. The README lists installation paths for Claude Code, Codex CLI, Codex App, Factory Droid, Gemini CLI, OpenCode, Cursor, and GitHub Copilot CLI.

That makes it more like a workflow layer across harnesses than a model-specific trick.

The base workflow

The base workflow has several stages.

First is brainstorming. Before implementation, the agent turns rough ideas into an executable design and confirms it with the user.

Second is using-git-worktrees. After design approval, it creates an isolated worktree and branch, then checks that install and test baselines are clean.

Third is writing-plans. It decomposes design into small tasks with paths, code scopes, and validation steps. The plan should be clear enough for someone without context to execute.

Fourth is execution. subagent-driven-development can dispatch tasks to subagents, while executing-plans runs them in batches. Each task should be reviewable and verifiable.

Fifth is test-driven-development: true RED-GREEN-REFACTOR. Write a failing test, confirm failure, implement minimally, confirm pass, refactor.

Sixth is requesting-code-review. Reviews happen between tasks; critical findings block progress.

Finally, finishing-a-development-branch validates tests and offers choices such as merge, PR, keep, or discard the worktree.

What is in the skills library

The skills library can be grouped by purpose.

Testing centers on test-driven-development.

Debugging includes systematic-debugging and verification-before-completion. They focus on reproduction, minimization, hypotheses, validation, and not claiming completion before verification.

Collaboration skills include:

brainstorming
writing-plans
executing-plans
dispatching-parallel-agents
requesting-code-review
receiving-code-review
using-git-worktrees
finishing-a-development-branch
subagent-driven-development

Meta skills include writing-skills and using-superpowers.

Together they give the agent engineering habits: when to ask, when to plan, when to test, and when to stop for review.

How it differs from a prompt

A normal prompt often piles rules into one system message: do not over-edit, think first, test, explain, be concise. As rules accumulate, complex tasks make the model forget or ignore some of them.

Superpowers splits rules into phase-specific workflow modules. Each skill is shorter and focused. The agent knows the current phase, complex processes become checkable, and teams can turn their own practices into reusable skills.

The lesson is not just “use a smarter model”. Give the model a repeatable way to work.

Who should use it

Superpowers is most useful for developers already using coding agents on real projects, especially when:

The task spans multiple files.
The agent should design before implementation.
TDD or validation matters.
Multiple branches or worktrees are common.
Subagents can help with implementation or review.
A team wants to encode its workflow as skills.

For a one-line config change, it may feel heavy. For multi-step development, the constraints are valuable.

Notes before using it

Do not treat it as full autopilot. It gives the agent process, but humans still own requirements, tradeoffs, and final acceptance.

TDD and review add upfront cost. For small tasks they may slow things down; for complex tasks they reduce rework.

Parallel subagents are not always better. They work when boundaries and write scopes are clear. If the requirement is still fuzzy, parallelism only multiplies confusion.

Teams must maintain skill quality. Outdated processes, vague instructions, and conflicting rules can also hurt agents.

Summary

Superpowers is valuable because it pulls coding agents away from “receive request, edit code” and back into software engineering process.

AI coding often lacks not generation speed, but clarification, planning, verification, review, and closure. The stronger the model becomes, the less these steps should be skipped.

If you use Codex, Claude Code, Cursor, or Gemini CLI on real projects, Superpowers is worth studying. Even if you do not install it, its skill decomposition is a good reference for designing your own agent workflow.

Codex /goal vs Claude Code /goal: running long tasks until they are done

Thu, 14 May 2026 22:25:31 +0800

/goal is becoming an important command in AI coding tools.

It is not about making the model write a few more lines of code. It solves a more practical problem: when a task has clear completion conditions, can the agent keep going until those conditions are met, instead of stopping after every turn and waiting for the user to say “continue”?

Codex CLI has already added an experimental /goal command in its official docs. Claude Code has also published its own /goal documentation, describing it as an automation capability that can keep working across multiple turns. The names are the same, but the product direction is not exactly the same.

What problem does `/goal` solve?

Ordinary AI coding conversations usually work as a one-turn-at-a-time loop:

The user describes a task.
The agent analyzes, edits code, and runs tests.
The agent reports the result.
The user decides what to do next.

That workflow is fine for short tasks. But for migrations, refactors, test fixes, or issue backlog cleanup, it gets fragmented. The agent may move forward a little, then stop and wait for you to type “continue”.

/goal changes the question from “what should you do next?” to “what final state counts as done?” For example:

`1`	`/goal 完成登录模块迁移，所有 auth 测试通过，lint 无报错`

This kind of target naturally fits long tasks because it has a clear endpoint: tests pass, the build succeeds, files are split, a queue is empty, or acceptance criteria are satisfied.

Codex `/goal`: experimental and attached to the current thread

OpenAI’s Codex CLI documentation marks /goal as experimental. It is not a stable default capability and requires features.goals to be enabled first.

There are two ways to enable it:

`1`	`/experimental`

Or add this to config.toml:

1
2

[features]
goals = true

Once enabled, you can use it like this:

`1`	`/goal Finish the migration and keep tests green`

Common commands include:

/goal
/goal pause
/goal resume
/goal clear

According to OpenAI’s docs, Codex attaches the goal to the current active thread and keeps tracking that target while a larger task continues.

One detail matters here: the official wording for Codex /goal is restrained. It emphasizes setting an experimental goal for long-running work and attaching the goal to the current thread, but it does not describe, in the same level of detail as Claude Code’s docs, an independent evaluator that automatically checks every turn and starts the next one. So for now, it is better to treat Codex /goal as an experimental long-task goal mechanism, not a fully stable unattended execution mode.

Claude Code `/goal`: multi-turn execution driven by completion conditions

Claude Code’s /goal documentation is more explicit: after the user sets a completion condition, Claude keeps working across turns until that condition is met.

Example:

`1`	`/goal all tests in test/auth pass and the lint step is clean`

Claude Code’s mechanism is roughly:

After the current turn finishes, control is not immediately returned to the user.
A small, fast model checks whether the goal condition has already been met.
If it has not been met, Claude automatically starts the next turn.
If it has been met, the goal is cleared automatically and the completion status is recorded in the transcript.

This makes Claude Code’s /goal more like “auto-continue until the completion condition is satisfied.” It does not merely pin a target to the conversation; it gives an independent evaluation step the decision of whether to continue.

Claude Code also supports checking status directly:

/goal

The status shows the goal condition, elapsed time, evaluated turn count, token usage, and the evaluator’s latest reason.

To stop early, use:

`1`	`/goal clear`

stop, off, reset, none, and cancel also work as clearing aliases. After a goal is enabled, if the session is interrupted and later resumed with --resume or --continue, an active goal can be restored. However, elapsed time, turn count, and token baselines are recalculated.

The biggest difference

Both Codex and Claude Code are pushing AI coding from single-turn answers toward long-running task execution, but their /goal commands have different positioning.

Comparison	Codex CLI `/goal`	Claude Code `/goal`
Status	experimental	documented on a dedicated official page
Enablement	requires `features.goals`	usable directly in a trusted workspace
Goal scope	current active thread	current session
Common operations	set / view / pause / resume / clear	set / view / clear
Automatic evaluation	docs emphasize attachment and tracking	docs explicitly describe evaluator checks after each turn
Auto-continuation	official wording is restrained	starts the next turn automatically when conditions are unmet
Best fit	keeping a long-term target in a Codex task	letting Claude Code keep moving toward completion conditions

In short, Codex /goal is closer to “attach an experimental long-term target to the current thread.” Claude Code /goal is closer to “set a verifiable stop condition for the current session and let it keep working until satisfied.”

How to write a good `/goal`

Whichever tool you use, /goal is not a good place for vague wishes.

Not a great goal:

`1`	`/goal 把项目优化一下`

A better goal:

`1`	`/goal 将 payment 模块迁移到新 API，npm test -- payment 退出码为 0，git diff 只包含 payment 相关文件`

A good goal usually includes three things:

A clear completed state.
An executable validation method.
Boundaries that must be respected.

If the goal is large, add a stop condition:

`1`	`/goal 修复 eslint 报错，npm run lint 退出码为 0；如果超过 20 轮仍未完成，停止并总结剩余问题`

This matters. The stronger /goal becomes, the more it needs boundaries. Otherwise, the agent may modify too many files, run too long, consume too many tokens, or keep pushing forward on a question that should have been paused for human input.

When `/goal` is a good fit

Good fits:

Test fixes: until specific tests pass.
Code migrations: until all call sites are updated and compilation succeeds.
Batch cleanup: until a class of lint or type errors is reduced to zero.
Documentation completion: until all specified modules have documentation.
Issue queue handling: until every issue under a tag is handled or clearly classified.

Poor fits:

The requirement itself is still unclear.
The task needs frequent product judgment.
It involves high-risk deletion, data migration, or permission changes.
Acceptance can only be judged subjectively.
The task spans many unrelated modules.

A practical rule: if you can write “which command to run, what result to see, and which files must not be touched,” it is a good candidate for /goal. If you can only write “make this better,” ordinary conversation, plan mode, or human review is still safer.

What this means for AI coding tools

/goal points to a clear direction: AI coding tools are moving from interactive assistants toward continuously executable work units.

In the past, using an agent often meant staying nearby. If it got stuck, you prompted it. If tests finished, you told it to continue. If errors appeared, you issued another command. /goal compresses that interaction into a completion condition and lets the agent decide what the next turn should do.

But this also raises the bar for users. Writing prompts is no longer just describing a task; it also means defining acceptance criteria, validation commands, modification boundaries, and stop rules. In other words, the user’s job shifts from “keep telling it to continue” to “define what done means.”

The fact that both Codex and Claude Code have reached /goal shows that long-running agents are no longer only for background tasks or cloud queues. Local terminal coding tools now also need stronger autonomous progress.

Summary

Codex CLI and Claude Code both have /goal, but at this stage they should not be treated as the same feature.

Codex /goal is still experimental, requires features.goals, and is better understood as a way to maintain a long-term target in the current Codex thread. Claude Code /goal more explicitly connects completion conditions with auto-continuation, using an independent evaluator to decide whether to keep going.

For everyday development, this kind of command is best for engineering tasks with clear acceptance criteria. It does not replace product judgment or code review, but it can reduce the repetitive “continue,” “run it again,” and “fix until tests pass” loop inside long tasks.

The real skill is not memorizing the command. It is learning how to write tasks as clear, verifiable, stoppable goals.

References

OpenAI Codex CLI Slash Commands: https://developers.openai.com/codex/cli/slash-commands
Claude Code Goal documentation: https://code.claude.com/docs/en/goal

Why DeepSeek Became the Cost-Saving Key in This Round of AI Coding Tools

Mon, 11 May 2026 04:59:00 +0800

In this round of AI coding tool competition, the surface battle is about model capability, plugin ecosystems, and agent automation. But once you actually use these tools, the first wall you hit is cost.

Claude Code, Codex, OpenClaw, and Superpowers are all useful, but they share one trait: once a task becomes complex, they eat tokens aggressively. They need to read the project, build a plan, call tools, summarize context, repeatedly check results, and sometimes launch multiple subtasks. The smarter the model and the more automated the workflow, the easier it is for the bill to quietly grow.

That is why DeepSeek has become important in this cycle. Not merely because it can write code, but because its long context and cache pricing happen to hit the most expensive part of AI coding tools.

Why Agent Tools Burn So Many Tokens

Traditional chat-style coding assistants usually work in question-and-answer mode. You ask how to write a function, and the model returns a code snippet. This still costs tokens, but it is relatively controllable.

Agent tools are different. They do not just answer questions. They enter the project like a temporary engineer:

scan directories and key files;
understand the requirement and existing architecture;
make a plan;
modify files;
run commands or tests;
keep fixing based on errors;
summarize what changed at the end.

During this process, the model repeatedly reads the same context. Project descriptions, code snippets, tool outputs, conversation history, plans, and error logs all get placed back into the context. Once the task is a little complex, hundreds of thousands of tokens can disappear quickly.

If you add more aggressive plugins, the cost becomes even more obvious. Some OpenCode or Claude Code enhancement tools may organize a whole agent team by default. You only wanted to change a small feature, but it may still start planning, review, execution, and retrospective steps. The task may look more “intelligent”, but the token count keeps climbing.

The Advantage of Superpowers Is On-Demand Activation

One advantage of tools like Superpowers is that they do not force a full agent workflow onto every task.

Most of the time, you can still let Claude Code, OpenCode, or Codex work in their normal mode. Only when you explicitly call a skill, such as brainstorming, planning, executing a plan, or doing a retrospective, does it enter a heavier automation flow.

That matters for cost.

AI coding should not use heavy artillery for every task. Changing one config line, checking one error, or writing a small script can be handled through ordinary conversation. Only complex refactors, cross-file changes, long-document processing, and multi-round validation deserve a full agent workflow.

The stronger the tool, the more you need to control when it triggers. Otherwise, more automation simply means more waste.

DeepSeek’s Key Advantage Is Cheap Cache Hits

One important reason DeepSeek fits these agent tools is its low cache-hit cost.

AI coding tasks contain a lot of repeated prefixes: project background, system prompts, tool instructions, file content, and earlier conversation turns often appear again in later requests. If the model service supports prompt caching, those repeated parts become much cheaper after a cache hit.

For many models, a cache hit is only somewhat cheaper than a miss, perhaps around one third of the original price. DeepSeek’s advantage is that the gap after a cache hit can be much larger. For long-context, multi-round agent workflows that repeatedly read the same project, this gap shows up directly on the bill.

In other words, DeepSeek is not necessarily the strongest answer on every single turn. But in scenarios with long tasks, many rounds, and repeated context reads, its cost structure is unusually suitable for AI coding.

Long Context Makes Claude Code More Useful

When Claude Code or similar tools are connected to DeepSeek V4, another clear advantage is long context.

AI coding tools fear insufficient context. Once context runs short, compression becomes frequent. Once compression becomes frequent, previously read details may be lost. The model may start forgetting the project structure, constraints, or why a certain file was changed, and quality declines afterward.

DeepSeek V4’s long-context capability makes it better suited for code repositories, document batch processing, subtitle translation, and site article cleanup. Especially when connected to tools like Claude Code or OpenClaw, the right configuration can delay context compression and preserve more project detail.

That is why some tasks feel “durable” when run on DeepSeek. It may not be dazzling at every step, but it can tolerate long-running, low-cost, repeated calls.

How to Split Work Between V4 Pro and V4 Flash

DeepSeek V4 Pro and V4 Flash should not be mixed casually.

For simple tasks, DeepSeek V4 Flash is usually a better fit. It is fast and cheap, and is often enough for:

subtitle translation;
document cleanup;
ordinary script generation;
small code edits;
lightweight OpenClaw tasks;
simple site content processing.

For complex tasks, consider DeepSeek V4 Pro:

large-scale refactoring;
multi-module code understanding;
complex reasoning;
long-chain agent tasks;
high-risk code changes;
engineering tasks that require stronger planning.

Many people want to attach the strongest model immediately, but that is often uneconomical. The practical way to use AI coding tools is to layer tasks: let the cheaper model handle a large amount of routine work, and reserve the expensive model for key decision points.

MiniMax, Doubao, and DeepSeek Occupy Different Positions

Among domestic models and plans, MiniMax, Doubao, Kimi, and DeepSeek each have their own place.

MiniMax’s advantage is generous quota, low price, and broad functionality. It may not be the smartest coding model, but it is cost-effective for translation, lightweight cleanup, and batch processing. For example, batch subtitle processing, format conversion, and simple proofreading are good fits for MiniMax-style plans.

Doubao’s advantage is a broader tool ecosystem: image, video, search, TTS, possible STT, and embedding can be connected together. It feels more like a comprehensive toolbox.

DeepSeek’s position is clearer: text, code, long context, and low-cost caching. It lacks a complete image generation, voice, and video ecosystem, and its weaknesses are obvious. But in AI coding and long-text agent workflows, its strengths are long enough to matter.

So this is not about one tool replacing another. It is about splitting the task and using each tool where it fits.

Saving Money Is Not Just Choosing a Cheap Model

Saving money in AI coding does not mean simply switching every request to the cheapest model.

The effective methods are:

Do not start a heavy agent for simple tasks.
Do not use Pro when Flash is enough.
Use cache as much as possible for long tasks.
Keep repeated context stable, so meaningless changes do not break cache hits.
Let a cheaper model draft and batch-process first, then use a stronger model for key reviews.
Tell the agent clearly not to repeat facts or summarize the same point again and again.

The last point matters more than it looks. AI tools are prone to verbosity, and verbosity is not only a reading problem; it is also a cost problem. Putting “describe each fact once and state each opinion once” into the prompt can improve both article quality and token consumption.

What AI Coding Workflows DeepSeek Fits Best

DeepSeek is best suited for:

reading long code repositories;
lightweight multi-file edits;
batch document cleanup;
batch subtitle translation;
Hugo article cleanup;
agent plan execution;
low-cost automation with lots of repeated context.

It is not the best fit for every task. If you need especially strong frontend taste, complex product judgment, or cross-modal creation, you may still need Claude, GPT, Gemini, Doubao, or other tools.

But whenever a task is long-text, long-context, repeated-call, and cost-sensitive, DeepSeek can easily become the first choice.

Summary

In this round of AI coding tools, DeepSeek’s value is not just that a domestic model can write code. Its real value is that it addresses the most practical pain point of agent tools: long tasks are too expensive.

Tools like Claude Code, OpenClaw, and Superpowers make the development process increasingly automated, but behind that automation are massive context reads and multi-round calls. Whoever can lower this part of the cost can make AI coding go from “fun once in a while” to “affordable every day”.

DeepSeek’s long context, low cache cost, and layered use of V4 Flash / V4 Pro put it in exactly that position.

The real cost-saving key in this cycle is not avoiding good models. It is combining good models, cheap models, cache, and agent workflows properly. Once you understand that bill, AI coding tools can become real productivity rather than a beautiful but expensive toy.

goose: An Open Source AI Agent with Desktop, CLI, and API

Fri, 08 May 2026 13:41:15 +0800

goose is an open source AI agent that runs on your own machine. It is not limited to code completion; it aims to cover code, research, writing, automation, data analysis, and other tasks. The README positions it as a desktop app, CLI, and API that can serve both normal users and custom workflows.

The project has moved from block/goose to the Agentic AI Foundation (AAIF) at the Linux Foundation. The current repository is:

`1`	`https://github.com/aaif-goose/goose`

goose is mainly written in Rust and TypeScript and uses the Apache-2.0 license. Its GitHub description says it is an open source, extensible AI agent that goes beyond code suggestions and can install, execute, edit, and test with any LLM.

What Problem It Solves

Many AI coding tools focus on suggestions or local code edits. goose takes a broader view: let an AI agent complete tasks directly on your machine.

It can be used for:

Code changes and tests.
Local automation.
Research and writing.
Data analysis.
Multi-step workflows.
Embedding through an API.
Tool extension through MCP.

If you only need IDE completion, a Copilot-style tool may be enough. goose is more useful when you want AI inside the local task execution chain.

Desktop, CLI, and API

goose has three entry points.

The desktop app supports macOS, Linux, and Windows. It is good for users who prefer a visual interface.

The CLI fits terminal workflows and local development automation.

The API lets other systems or internal tools embed goose as an agent runtime.

Personal users can start with the desktop app or CLI. Teams and workflow builders should also look at the API and custom distribution support.

Installation

The README recommends downloading the desktop app:

`1`	`https://goose-docs.ai/docs/getting-started/installation`

CLI install:

`1`	`curl -fsSL https://github.com/aaif-goose/goose/releases/download/stable/download_cli.sh \| bash`

GitHub Releases provide builds for multiple platforms. The latest release checked here was v1.33.1, published on 2026-04-29, with macOS, Linux, Windows, deb, rpm, and Flatpak assets.

After installation, configure a provider from the official quickstart and test in a low-risk directory first. goose can execute local tasks, so avoid giving it broad permissions in a production repository from the start.

Providers

goose supports 15+ providers, including:

Anthropic
OpenAI
Google
Ollama
OpenRouter
Azure
Bedrock
other cloud or OpenAI-compatible providers

It can use API keys, and it can also use existing Claude, ChatGPT, or Gemini subscriptions through ACP.

ACP is important because many users already pay for subscriptions, but different tools cannot easily reuse them. goose uses ACP providers to bring those subscriptions into an agent workflow.

Provider policies change quickly. Check whether the access method is allowed, whether there are quotas, and whether it is suitable for company code or sensitive data.

MCP Extensions

goose supports Model Context Protocol extensions. The README mentions 70+ extensions.

MCP matters because an agent should not only chat and edit files. Through standard protocol servers, it can connect to documentation, databases, browsers, internal systems, search services, design tools, or project management tools.

For teams, MCP can become a safer integration layer: expose internal capabilities through explicit interfaces instead of letting the model touch every system directly.

Difference from a Coding Assistant

goose is not just a code completion tool. It is closer to a local agent runtime.

Common coding assistants focus on:

Code completion.
Code explanation.
Function generation.
Local editor edits.

goose emphasizes:

Local task execution.
Multi-step workflows.
Switchable providers.
Extensions.
Desktop and CLI.
Embeddable API.
Non-code tasks too.

This also means more complexity. You must think about model configuration, permissions, extensions, workspace scope, logs, and credentials.

Custom Distributions

The repository includes CUSTOM_DISTROS.md, which explains how to build a custom goose distribution with preconfigured providers, extensions, and branding.

This is useful for teams:

Preconfigure allowed model providers.
Connect internal MCP servers.
Set safety policies and logging.
Block disallowed external services.
Apply company branding and onboarding.

Members do not need to configure everything from scratch, and the risk of wrong provider or key setup is reduced.

Suggested Use

Start gradually:

Install the desktop app or CLI.
Configure one known-good provider.
Run simple tasks in a test directory.
Observe what it reads and executes.
Add MCP extensions.
Try larger repositories later.

Keep a few habits:

Commit important changes before agent work.
Do not store API keys in project files.
Use high-permission modes only in trusted workspaces.
Review company data and provider policy first.
Keep human review for automation results.

Who Should Use It

goose is a good fit if you want a desktop and CLI AI agent, multiple model providers, MCP integration, API embedding, or custom team distributions. It may be heavy if all you need is IDE code completion.

Summary

goose is an open source AI agent under AAIF/Linux Foundation. It provides desktop, CLI, and API entry points, supports 15+ providers, ACP subscription access, and 70+ MCP extensions.

Its value is not only writing code, but placing models, tools, extensions, and local execution into one agent framework. Start small, define permission and data boundaries, then expand usage.

References

24 Claude Code Tips: Plan Mode, Rewind, CLAUDE.md, Skills, Agents, and Plugins

Fri, 08 May 2026 08:54:14 +0800

Claude Code is not just a chat box. It is closer to a coding Agent that can enter a project directory, read and write files, run commands, and maintain context.

If you only throw a requirement at it and wait for code, problems appear quickly: unclear plans, repeated permission prompts, growing context, unsatisfactory output, no clear rollback path, and no persistent place for project rules.

Here is a set of common operations for developers getting started with Claude Code.

Start Inside the Project Directory

Claude Code works best when launched inside the project directory, not from a random terminal location.

Create a folder as the project directory, enter it, open a command line, and start Claude Code:

claude

When first entering a project, if Claude Code asks whether to trust the current folder, confirm before continuing. This lets it read files, create files, and run later operations around the current project.

A simple practice task is to ask it to create a photographer portfolio website. The task is visual enough to inspect, and it also lets you practice file generation, command execution, rewind, and later refactoring.

Use Plan Mode First

For more complex tasks, Claude Code may enter plan mode. Plan mode is meant to discuss requirements and break down steps before you approve execution.

After it writes a plan, you usually see options like:

Approve the plan and automatically allow future edit tools.
Approve the plan, but require manual approval for later edits.
Pause and continue discussing the plan with Claude Code.

If the task is clear, approve and continue. If it is not clear yet, ask it to refine the plan, such as page style, tech stack, directory structure, interactions, and acceptance criteria.

Plan mode reduces rework. If an Agent starts directly, it may quickly generate many files; if the direction is wrong, later changes can get messy.

Switch Modes With Shift + Tab

In Claude Code, Shift + Tab can switch between working modes. A common use is entering plan mode or switching into an auto-approve-edit mode.

Suggested habits:

New projects, new features, major changes: start in plan mode.
Small edits and clear fixes: execute directly.
Deletion, bulk replacement, dependency installation: keep manual approval.

In plan mode, Claude Code may ask project-detail questions. Use arrow keys to choose and Enter to confirm. After submitting feedback, it updates the plan.

Do Not Open All Permissions Blindly

When Claude Code runs commands, edits files, or starts programs, it may request permission.

Common choices include:

Allow only this time.
Allow this command type for the current session.
Reject or pause.

For local preview, dev server startup, or file inspection, approve as needed. But do not permanently use a mode that auto-approves all permissions just to save clicks.

Full automation is only suitable when the task is low-risk, clearly understood, and the project already has Git backups. For daily use, keep human approval for deletion, overwriting folders, dependency installation, networking, commits, and scripts.

Run Local Commands in Terminal Mode

Claude Code can enter a terminal-command mode to run local commands.

For example, after generating a page, you can open an HTML file with:

`1`	`start index.html`

start is a Windows command for opening a file, followed by the filename. This is faster than finding the file manually.

Terminal mode is useful for:

Opening generated pages.
Listing directory contents.
Starting local development servers.
Running tests or builds.

Still, be careful with high-risk commands such as recursive deletion, moving directories, bulk overwrites, and system environment changes.

Rewind When the Result Goes Wrong

If the page or code produced by Claude Code is not what you want, and each correction makes it worse, rewind early.

Rewind can return code or conversation to a previous point. Common options include:

Rewind both code and conversation.
Rewind only conversation.
Rewind only code.
Compress earlier content into a summary.
Cancel.

When the direction is clearly wrong, it is usually better to rewind both code and conversation. That returns context and files to a cleaner state together.

Note that Claude Code rewind usually only covers files it created or changed through built-in tools. Files created through external commands may not be fully rewindable. Important projects should still use Git.

Write Long Prompts in an Editor

Do not squeeze complex requirements into one input line.

If the system supports editing a long prompt in a text editor, open the editor, write the requirement clearly, save it, and then send it to Claude Code.

Long prompts should include:

The goal.
The tech stack.
What not to do.
Which files must be kept.
How to verify completion.
Page or feature acceptance criteria.

For example, if you want Claude Code to refactor a plain HTML page into a more modern stack, do not just say “refactor it.” Explain component structure, visual preservation, responsive layout, and ask it to run a build check.

Restore Sessions After Exit

If you need to quit Claude Code midway, exit normally. Later, return to the same project directory and start again:

claude

If previous records do not appear directly, use history-related commands to view and load recent sessions.

This is useful for continuing interrupted work. But do not treat session history as the only memory. Project rules, tech stack, common commands, and notes should live in project files.

Use CLAUDE.md for Project Rules

CLAUDE.md is an important memory file for Claude Code. It usually sits at the project root and tells Claude Code project rules, tech stack, directory structure, and collaboration constraints.

You can ask Claude Code to initialize it:

/init

CLAUDE.md is good for:

Project goals.
Tech stack.
Common start, test, and build commands.
Directory notes.
Code style.
Forbidden actions.
Commit and deployment rules.

During each conversation, Claude Code can use these rules as part of the context. Think of it as a project manual.

A simple test is to add a clear rule into CLAUDE.md, then ask Claude Code something. If its answer follows the rule, it has read the project memory.

Reference Files With @

Typing @ in the input box lets you select files or Agents and add them to the current context.

This is useful when you want Claude Code to:

Read a config file.
Modify a specific page.
Continue based on CLAUDE.md or another document.
Only inspect a specific file instead of guessing the whole project.

Compared with copying file contents into the input box, @ references are clearer and less error-prone.

View and Compress Context

After a long conversation, context grows. When it gets too long, the model may slow down or start ignoring earlier details.

Use:

`1`	`/context`

If context is long, compress history:

`1`	`/compact`

If the result is still poor, consider clearing the current context:

/clear

After clearing, Claude Code can still understand part of the project through files, CLAUDE.md, and the current directory, but it will not keep the full conversation history.

A practical habit: start a new chat after a task is done, write project rules into CLAUDE.md, and do not let temporary discussion grow forever in one chat.

Skills: Turn Repeated Work Into Instructions

Skills are reusable task instructions for Claude Code. They are not one-off prompts, but packaged workflows.

For example, if you often generate weekly reports, create a weekly-report Skill that defines:

Required input.
Output format.
Tone and structure.
What must be preserved.
What must not be invented.

Skills usually contain name, description, and detailed instructions. Once installed in the global Skills directory, Claude Code can recognize and load them for related tasks.

Good Skill candidates include:

Weekly reports.
Code review templates.
Document cleanup.
Image batch processing.
Fixed-format articles.
Project initialization flows.

If you repeatedly copy the same prompt, consider turning it into a Skill.

Agents: Delegate Subtasks to Independent Helpers

Agents are different from Skills.

A Skill is more like an instruction manual. An Agent is more like an independent helper that can work outside the main conversation and return results.

The value of Agents is context isolation. For code inspection, you can create a read-only Agent that only reads the project and outputs a report, without modifying files. This avoids polluting the main conversation and lowers risk.

When creating an Agent, consider:

Project-level or user-level Agent.
Whether Claude Code should generate the config.
Which tools are allowed.
Which model to use.
Whether memory should be saved.
Whether the Agent prompt is clear enough.

For code-audit Agents, give read-only permissions first. Let it output a report, then decide in the main conversation whether to change code.

Plugins: Package Skills, Agents, MCP, and Hooks

Plugins are more complete capability packages. They may include:

Skills
Agents
MCP
Hooks

Compared with installing one Skill, a plugin is better for a full capability set. For example, a frontend design plugin may package visual rules, layout habits, component preferences, and related Agents together.

When installing a plugin, you may choose:

Install to the user directory, effective for all projects.
Install to the project directory, shareable with the project.
Install to a local project directory, effective only on your computer.

Use the user directory for personal common capabilities, the project directory for team conventions, and local project install for temporary testing.

Plugins Can Improve Specific Tasks

For frontend page generation, plugins can be more stable than raw prompts.

For example, for “make a photographer portfolio website,” a plain prompt may generate an acceptable page. If you explicitly use a frontend design plugin, the structure, visual hierarchy, spacing, colors, and overall finish are often better.

This does not mean plugins replace human taste. A better workflow is to let the plugin generate a stronger first draft, then refine details manually.

A More Stable Claude Code Workflow

Putting these tips together gives a steadier workflow:

Start claude inside the project directory.
Discuss requirements in plan mode first.
Confirm tech stack and acceptance criteria before approving the plan.
Keep manual approval for high-risk actions.
Use terminal mode for local preview and tests.
Rewind early when the result goes off track.
Write project rules into CLAUDE.md.
Check and compress context during long chats.
Turn repeated workflows into Skills.
Delegate inspection, research, and analysis to read-only Agents.
Use plugins for domain-specific tasks.
Always keep Git checkpoints for important projects.

This is much more stable than simply sending one requirement and waiting for generation.

Summary

Claude Code efficiency does not come only from model capability. It also comes from workflow control.

Plan mode sets direction, permission approval controls risk, rewind reduces rework, CLAUDE.md stores project rules, /context, /compact, and /clear manage context, Skills reuse fixed workflows, Agents isolate complex subtasks, and plugins package complete capabilities.

The best way to use Claude Code is to let it move tasks forward inside clear boundaries, not to hand the entire project to it at once.

opencode, Claude Code, and Codex: What's the Difference? A Guide to Open Source AI Coding Tools

Fri, 08 May 2026 08:33:37 +0800

opencode is an open source AI Coding Agent from anomalyco. Its positioning is straightforward: give developers a programmable, extensible coding assistant in the terminal that can connect to multiple model providers.

If you compare it with Claude Code and Codex, all three solve the same broad problem: bringing AI into real codebases so it can understand context, edit files, run commands, and execute tests. But their product directions are different.

opencode emphasizes open source, multi-model support, and a terminal TUI. Claude Code emphasizes Anthropic’s model ecosystem and local engineering collaboration. Codex is OpenAI’s AI coding agent, available through the terminal, IDEs, the Codex app, and cloud tasks.

Who opencode Is For

opencode is a better fit for these kinds of developers:

People who want to complete code changes, project analysis, and engineering tasks in the terminal.
People who do not want their AI Coding Agent tied to a single model provider.
People who prefer open source tools and want to audit, extend, or build on top of them.
People already comfortable with Neovim, TUIs, and command-line workflows.
People who want to eventually drive the same coding agent remotely through a desktop app, mobile app, or other clients.

Its point is not to create another chat window, but to put AI coding capability inside the terminal and project directories developers already use.

Installation

The official README provides several installation methods.

# Direct install
curl -fsSL https://opencode.ai/install | bash

# npm
npm i -g opencode-ai@latest

# Windows
scoop install opencode
choco install opencode

# macOS and Linux
brew install anomalyco/tap/opencode
brew install opencode

# Arch Linux
sudo pacman -S opencode
paru -S opencode-bin

# Other methods
mise use -g opencode
nix run nixpkgs#opencode

The official README also recommends removing versions older than 0.1.x before installing to avoid problems caused by older remnants.

The installation script chooses the installation directory by priority:

$OPENCODE_INSTALL_DIR
$XDG_BIN_DIR
$HOME/bin
$HOME/.opencode/bin

If you need to specify a path, use:

1
2

OPENCODE_INSTALL_DIR=/usr/local/bin curl -fsSL https://opencode.ai/install | bash
XDG_BIN_DIR=$HOME/.local/bin curl -fsSL https://opencode.ai/install | bash

The Desktop App Is Still Beta

In addition to the command-line tool, opencode also provides a desktop app, currently marked as Beta. It can be downloaded from GitHub Releases or opencode.ai/download.

The desktop app covers these platforms:

Platform	File
macOS Apple Silicon	`opencode-desktop-mac-arm64.dmg`
macOS Intel	`opencode-desktop-mac-x64.dmg`
Windows	`opencode-desktop-windows-x64.exe`
Linux	`.deb`, `.rpm`, or `.AppImage`

macOS and Windows users can also install the desktop app through package managers.

# macOS
brew install --cask opencode-desktop

# Windows
scoop bucket add extras
scoop install extras/opencode-desktop

Two Built-In Agent Modes

opencode includes two built-in Agents, switchable with the Tab key.

build is the default mode. It has full development permissions and is suitable for editing code directly, running commands, and moving engineering tasks forward.

plan is read-only mode. It is better for analyzing unfamiliar codebases, understanding project structure, and planning changes. It denies file edits by default and asks before running bash commands.

opencode also includes a general subagent for complex searches and multi-step tasks. Users can invoke it by typing @general in a message.

This design is practical: use plan to understand the project before acting, then switch to build when code needs to change. For large repositories, separating read and write permissions helps reduce mistakes.

What Is Codex?

Codex is OpenAI’s AI coding agent for helping developers write code, review code, fix bugs, and ship engineering tasks.

Unlike a simple code completion tool, Codex is closer to an Agent that can operate on a codebase. It can pair with you in local tools, and it can also take delegated tasks in the cloud. OpenAI’s official materials describe Codex as available through multiple surfaces, including CLI, IDEs, the Codex app, and ChatGPT/Codex cloud workflows.

For developers, Codex has several important traits:

It can read codebases, edit files, run commands, and execute tests.
It supports multiple interfaces, including terminal, IDE, app, and cloud.
It fits bug fixing, feature work, refactoring, migrations, code review, and test generation.
It is more closely tied to OpenAI accounts, models, and the Codex product ecosystem.
Cloud tasks are useful for running multiple well-scoped engineering tasks in parallel.

If opencode is more like an open terminal agent framework, Codex is more like a full AI coding workbench from OpenAI: local pairing, cloud delegation, and longer engineering workflows for teams.

Core Differences

opencode, Claude Code, and Codex are all AI coding tools, but the choice becomes clearer if you look at these dimensions.

Tool	Core Positioning	Main Advantages	Best Fit
`opencode`	Open source AI Coding Agent	Open source, multi-model, TUI, client/server architecture	Developers who want an open toolchain, replaceable models, and a terminal-first workflow
`Claude Code`	Anthropic’s command-line coding tool	Claude model experience, code understanding, long context, engineering task collaboration	Developers already using the Claude/Anthropic ecosystem who want to work on local code tasks
`Codex`	OpenAI’s AI coding agent	CLI, IDE, Codex app, cloud tasks, multi-Agent workflows	Teams already using ChatGPT/OpenAI who want both local pairing and cloud delegation

In short, opencode is about openness and replaceability, Claude Code is about the Claude ecosystem and local engineering agents, and Codex is about the OpenAI ecosystem and multi-surface collaboration.

How It Differs From Claude Code

opencode’s official FAQ directly compares it with Claude Code. The two are similar in capability, but the main differences are these.

First, opencode is a 100% open source project, hosted on GitHub and released under the MIT license.

Second, opencode is not tied to a single model provider. It recommends models provided through OpenCode Zen, but it can also work with Claude, OpenAI, Google, or local models. For developers, this means that when model cost, capability, or availability changes, you are not locked into one platform.

Third, opencode includes optional LSP support. For code completion, navigation, diagnostics, and project understanding, LSP is a very important foundation.

Fourth, opencode emphasizes TUI. It is built by Neovim users and the creators of terminal.shop, so the product focus is clearly on the terminal experience.

Fifth, opencode uses a client/server architecture. That means opencode can run on your computer while being controlled in the future by a TUI, desktop app, mobile app, or other clients. The TUI is only one possible frontend.

When to Choose opencode, Claude Code, or Codex

If you already use Claude Code or Codex, opencode does not have to replace them immediately. A better way to think about it is that opencode provides an open, model-replaceable, terminal-first option.

Consider opencode first when:

You want your AI coding tool to be as open source as possible.
You do not want your workflow tied to one model provider.
You want to test Claude, OpenAI, Google, or local models with the same tool.
You like TUI workflows and do not want a desktop or web app to interrupt your main workflow.
You care about the remote-control potential of a client/server architecture.

Consider Claude Code first when:

You mainly use Claude models.
You care about long context, code understanding, and complex engineering task collaboration.
You want to keep moving edits, tests, and refactors forward in a local repository.
You trust Anthropic’s default Claude Code product experience.

Consider Codex first when:

You already use ChatGPT or the OpenAI account ecosystem.
You want one coding agent across terminal, IDE, desktop app, and cloud tasks.
You want to delegate well-scoped bug fixes, feature work, migrations, or test generation to the cloud in parallel.
You need code review, background tasks, team collaboration, and multi-Agent workflows.

If you care more about an official end-to-end experience, default model configuration, enterprise management, and ready-made integrations, Claude Code or Codex may be easier. If you care more about control, openness, and being provider-agnostic, opencode is worth watching.

Things to Note

opencode, Claude Code, and Codex are all moving quickly. GitHub releases, installation commands, desktop app file names, model availability, and plan access can all change. Before installing or choosing a tool, check the official README, documentation, and release pages.

Also, opencode’s desktop app is still marked as Beta, so it should not be treated as the default stable production tool. For everyday engineering tasks, the terminal version is still the main entry point.

From a tooling trend perspective, opencode represents the open-toolchain direction for AI Coding Agents: replaceable models, replaceable clients, and an open core agent capability. Codex and Claude Code are closer to model companies turning coding agents into complete product surfaces. For developers, both directions will likely coexist for a long time.

References

opencode GitHub: https://github.com/anomalyco/opencode
opencode official site: https://opencode.ai
opencode docs: https://opencode.ai/docs
opencode Releases: https://github.com/anomalyco/opencode/releases
OpenAI Codex: https://openai.com/codex/
Using Codex with your ChatGPT plan: https://help.openai.com/en/articles/11369540-codex-in-chatgpt
OpenAI Codex CLI Getting Started: https://help.openai.com/en/articles/11096431-openai-codex-ci-getting-started

Warp Open Source: From Terminal to Agentic Development Environment

Thu, 07 May 2026 20:15:08 +0800

warpdotdev/warp is the open-source client repository for Warp. Warp now describes itself as an “agentic development environment, born out of the terminal”: it starts from the terminal, but brings AI coding agents, codebase indexing, task management, and development workflows into one environment.

This is not an ordinary open-source terminal emulator repository. It is closer to an answer to a larger question: as agents such as Claude Code, Codex, and Gemini CLI become common, should the terminal itself become a development environment for scheduling, observing, and managing agents?

Warp’s answer is yes.

Current State of the Repository

As of May 7, 2026, warpdotdev/warp is a public repository. GitHub shows roughly 56k stars and 4.1k forks. The README says the Warp client code is now open source and welcomes community contributions.

The main language is Rust. GitHub’s language breakdown shows Rust at over 98%, which matches Warp’s positioning: it is not a web wrapper, but a cross-platform native development tool.

Several README details matter:

Warp is an agentic development environment, born out of the terminal.
It can use its built-in coding agent and can also connect to external CLI agents such as Claude Code, Codex, and Gemini CLI.
OpenAI is the founding sponsor of the newly open-sourced Warp repository.
The agentic management workflows in the repository are powered by GPT models.
Warp UI framework crates use the MIT license, while the rest of the code uses AGPL v3.

This shows that Warp’s open source move is not merely publishing a terminal. It is operating the project as an experiment ground for agent workflows.

Warp Is More Than a Terminal

Traditional terminals mainly do three things:

start a shell;
run commands;
display output.

Warp’s earlier differentiation was making the terminal feel more modern: command blocks, completion, history, collaboration, UI-style interactions, and cross-platform polish. Now the focus has moved further toward organizing development around AI agents.

From the README, Warp no longer only emphasizes “a better terminal.” It emphasizes:

built-in coding agents;
external CLI agent support;
issue triage;
spec writing;
PR review;
contributor coordination;
observable agent sessions.

In other words, Warp wants to turn the terminal from “where you type commands” into “where you work with multiple agents.”

Oz and Open-Source Project Management

The README mentions Oz several times.

Warp’s contribution overview shows thousands of Oz agents working on issue triage, specs, implementation, and PR review. This is interesting because it extends AI agents from “helping one person write code” to “helping manage open-source collaboration.”

The hardest part of many open-source projects is not writing code, but maintenance:

too many issues, not enough classification;
bugs and feature requests mixed together;
new contributors unsure which tasks are approachable;
PR review pressure;
maintainers struggling to follow every community thread.

Warp’s idea is to let agents take on part of the project management and collaboration work first. The README also mentions Oz for OSS, a maintainer-facing program for bringing similar agentic open-source management workflows to other repositories.

This suggests that Warp’s ambition is not only the terminal product itself, but also a new model of open-source maintenance in the AI era.

Repository Structure and Tech Stack

From the repository structure, Warp is a large Rust project.

The root contains:

app/: main application code.
crates/: core Rust crates.
assets/: resource files.
command-signatures-v2/: command signature related content.
docker/, script/, resources/, specs/, and other engineering directories.
.claude/, .warp/, .agents/skills, and other agent-related configuration.

WARP.md gives more engineering detail. It describes Warp as a Rust-based terminal emulator using an in-house UI framework called WarpUI.

The major modules can be roughly understood as:

app/: terminal emulation, shell management, AI integration, Drive, authentication, settings, workspace, and sessions.
crates/warp_core/: core utilities and platform abstraction.
crates/editor/: text editing functionality.
crates/warpui/ and crates/warpui_core/: the in-house UI framework.
crates/ipc/: inter-process communication.
crates/graphql/: GraphQL client and schema.

WARP.md also mentions architectural features such as:

an Entity-Handle system;
a modular workspace structure;
macOS, Windows, Linux, and WASM targets;
AI integration, including Agent Mode, context awareness, and codebase indexing;
Warp Drive cloud sync.

This complexity is closer to a full IDE than a lightweight traditional terminal.

Local Build Commands

The README gives a concise local build flow:

1
2
3

./script/bootstrap
./script/run
./script/presubmit

Where:

./script/bootstrap performs platform-specific initialization.
./script/run builds and runs Warp.
./script/presubmit runs formatting, clippy, tests, and other pre-submit checks.

WARP.md also lists more detailed commands:

cargo run
cargo bundle --bin warp
cargo nextest run --no-fail-fast --workspace --exclude command-signatures-v2
cargo fmt
cargo clippy --workspace --all-targets --all-features --tests -- -D warnings

If you want to contribute to Warp, ./script/presubmit is effectively required.

Contribution Flow

Warp’s contribution flow is not simply “open a PR.”

The README describes a lightweight process from issue to PR:

Search existing issues first.
If there is no duplicate, file a bug or feature request.
Maintainers review the issue and may add readiness labels.
ready-to-spec means the design can be expanded into a spec.
ready-to-implement means the design is clear enough to start an implementation PR.
Contributors can pick up labeled issues.

This process fits a large open-source project. It separates ideas, design, and implementation, reducing the risk that contributors spend time building in the wrong direction.

It also fits AI agents well. An agent can organize issues, draft specs, add tests, and then move into implementation. Warp itself uses this pattern to demonstrate agentic project management.

License: MIT + AGPL v3

Warp uses a dual license structure.

The README says:

the Warp UI framework, namely the warpui_core and warpui crates, uses the MIT license;
the rest of the repository uses AGPL v3.

This matters. AGPL v3 has stronger open-source requirements for network services and distribution. If you are learning, researching, or contributing, it is usually straightforward. But if you want to use Warp code in a commercial product or closed-source derivative, you need to read the license carefully and consult legal advice if necessary.

In short, Warp is open source, but not “take it and close-source it freely” open source.

Why It Is Worth Watching

First, Warp brings the terminal, agents, and project management together.

Many AI coding tools are still CLI tools or editor plugins. Warp starts from the terminal entry point and tries to unify agent tasks, code execution, command output, PR workflows, and team collaboration.

Second, Warp’s open-source approach is a good place to observe agent workflows.

It does not only publish code. It also exposes contribution overviews, agent sessions, issue triage, and spec workflows. For anyone studying how AI can participate in open-source collaboration, the repository itself is a sample.

Third, Warp is a complex Rust desktop application.

If you want to study Rust GUI, terminal emulation, cross-platform apps, GraphQL clients, cloud sync, and AI integration, the repository has a lot to read. But it is not a small project, so new contributors should read the docs and issue process first.

Fourth, Warp supports both a built-in agent and a “bring your own CLI agent” approach.

This is realistic. Developers will not use only one agent. Claude Code, Codex, Gemini CLI, OpenCode, OpenClaw, and similar tools are likely to coexist. If Warp can become a workbench for them, it becomes more valuable than a single-purpose terminal.

Who Should Care

If you are a normal terminal user, Warp matters because the terminal may be changing from a command-line tool into an AI workbench.

If you are a heavy AI coding agent user, Warp is worth watching because it tries to manage multiple agents rather than act as another chat entry point.

If you maintain open-source projects, the Oz for OSS direction is worth attention. It explores agent-based issue triage, PR review, community collaboration, and contributor onboarding.

If you are a Rust developer, Warp is a real large-scale desktop application worth studying for UI organization, terminal internals, cloud sync, AI integration, and cross-platform code.

If you only want a terminal that can replace your current one immediately, it is better to download the stable release first, then decide whether to study the source. Building from source is more suitable for contributors and deep users.

Short Take

The point of Warp going open source is not merely “a modern terminal became open source.”

More precisely, Warp is trying to upgrade the terminal into an agentic development environment: the terminal connects the shell, codebase, command execution, agents, issues, PRs, and collaboration flow.

As AI coding agents keep growing, the entry point of the development environment may change. In the past, the IDE dominated the developer experience while the terminal ran commands. Now the terminal may become the center of agent collaboration. The Warp repository is exploring that possibility.

GitHub repository: https://github.com/warpdotdev/warp
Warp website: https://www.warp.dev
Warp documentation: https://docs.warp.dev
Warp build overview: https://build.warp.dev
WARP.md: https://github.com/warpdotdev/warp/blob/master/WARP.md
CONTRIBUTING.md: https://github.com/warpdotdev/warp/blob/master/CONTRIBUTING.md

Hermes + Qwen3.6: A Low-Cost Local Agent Deployment

Mon, 04 May 2026 06:40:30 +0800

This article documents a local Agent deployment plan: run a Qwen3.6 GGUF model with llama.cpp inside WSL2, then connect Hermes Agent to the local OpenAI-compatible API. This gives you a long-running local AI assistant on your own computer, without paying by online service Token usage.

This setup is suitable for users who want to try local AI Agents while keeping data private and controllable over the long term. It can be used for daily Q&A, writing, coding assistance, document organization, and simple automation tasks. The larger the model, the higher the VRAM requirement. The original example uses Qwen3.6-27B, and 24GB VRAM is more stable. If your VRAM is smaller, choose a smaller model or a lower quantization.

Architecture

The overall chain is simple:

Install WSL2 and Ubuntu 24.04 on Windows.
Install CUDA Toolkit inside WSL2 and compile llama.cpp.
Download the Qwen3.6 GGUF model.
Start a local model service with llama-server.
Install Hermes Agent and configure it to http://localhost:8080/v1.
Optional: write a startup script so the model service starts automatically when WSL2 opens.

Hermes provides the Agent capability, while Qwen3.6 provides the local LLM capability. Together, they turn the computer into a private local AI assistant.

Install WSL2 and Ubuntu

Run in an administrator Windows PowerShell window:

1
2

wsl --install
wsl --set-default-version 2

After rebooting, install Ubuntu 24.04:

`1`	`wsl --install -d Ubuntu-24.04`

After installation, Ubuntu prompts you to set a username and password. Once inside Ubuntu, first check whether the NVIDIA GPU is visible in WSL2:

`1`	`nvidia-smi`

If the GPU cannot be detected, update the NVIDIA driver on Windows first. WSL2 inherits the Windows driver, but CUDA Toolkit still needs to be installed separately inside WSL2.

Install Python and Basic Tools

`1`	`sudo apt update && sudo apt install -y python3-pip python3-venv`

You also need build tools, Git, and CMake:

`1`	`sudo apt install -y cmake build-essential git`

Compile llama.cpp

Clone the repository:

1
2

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

If CUDA is already available in WSL2, compile directly:

1
2

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build -j$(nproc)

CMAKE_CUDA_ARCHITECTURES=89 is suitable for Ada GPUs, such as RTX 40 series cards. Adjust it according to your actual GPU architecture.

If compilation reports that CUDA Toolkit is missing, install CUDA Toolkit inside WSL2 first:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-8

Configure environment variables:

export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
echo 'export PATH=/usr/local/cuda-12.8/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

Then rebuild:

cd ~/llama.cpp
rm -rf build
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build -j$(nproc)

Download the Qwen3.6 GGUF Model

The example uses Qwen3.6-27B-UD-Q4_K_XL.gguf from unsloth/Qwen3.6-27B-GGUF:

1
2
3

hf download unsloth/Qwen3.6-27B-GGUF \
Qwen3.6-27B-UD-Q4_K_XL.gguf \
--local-dir ~/models/

The file is about 17GB. If Hugging Face is slow, use a mirror such as ModelScope. Do not force a 27B model if your VRAM is insufficient; use a smaller model or lower quantization.

Start the Local Model Service

Start llama-server with your own model file name:

~/llama.cpp/build/bin/llama-server \
--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--ctx-size 32768 \
--flash-attn on \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--presence-penalty 1.5 \
--port 8080

After startup, open this in a Windows browser:

`1`	`http://localhost:8080`

For Hermes Agent or other OpenAI-compatible clients, the API endpoint is usually:

`1`	`http://localhost:8080/v1`

Thinking Mode Tradeoff

Qwen3.6 may enable Thinking mode by default. It is suitable for complex reasoning, complicated coding problems, and multi-step analysis, but it is slower.

To disable Thinking mode, stop the service and add --chat-template-kwargs:

~/llama.cpp/build/bin/llama-server \
--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--ctx-size 32768 \
--flash-attn on \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--presence-penalty 1.5 \
--chat-template-kwargs '{"enable_thinking":false}' \
--port 8080

After disabling Thinking, simple Q&A, writing, code completion, and code explanation become faster. For complex algorithm design, difficult debugging, and architecture analysis, Thinking mode is still recommended.

Install Hermes Agent

Keep llama-server running, then open a new WSL2 terminal and install Hermes Agent:

`1`	`curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh \| bash`

The installer handles dependencies such as Python, Node.js, ripgrep, and ffmpeg. When configuring the model endpoint, choose a custom endpoint:

1
2
3

URL: http://localhost:8080/v1
API Key: 12345678
Model: auto-detect

For a local llama-server, the API Key can be any placeholder value. After configuration, you can connect Telegram, WeChat, QQ, Discord, and other chat tools, allowing Hermes Agent to call the local model and execute tasks from those entry points.

Auto-Start the Model Service

You can write a startup script so the model service starts automatically when a WSL2 terminal opens.

Create the script:

cat > ~/start-llm.sh << 'EOF'
#!/bin/bash
echo "Starting Qwen3.6-27B llama-server..."
~/llama.cpp/build/bin/llama-server \
--model ~/models/Qwen3.6-27B-UD-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--ctx-size 65536 \
--flash-attn on \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--presence-penalty 1.5 \
--port 8080 \
--host 0.0.0.0 &
echo "llama-server started, PID: $!"
echo "API: http://localhost:8080/v1"
echo "Chat UI: http://localhost:8080"
EOF
chmod +x ~/start-llm.sh

Write it into .bashrc:

echo '# Auto-start llama-server' >> ~/.bashrc
echo 'if ! pgrep -f "llama-server" > /dev/null 2>&1; then' >> ~/.bashrc
echo '    ~/start-llm.sh' >> ~/.bashrc
echo 'fi' >> ~/.bashrc

Each time you open a WSL2 terminal, it will start llama-server if it is not already running. If it is running, it skips startup and avoids duplicate processes.

Notes

27B models require substantial VRAM; 24GB VRAM is more stable. Use a smaller model if VRAM is limited.
--ctx-size 65536 significantly increases VRAM and RAM pressure. If unstable, reduce it to 32768 or lower.
Both CUDA Toolkit in WSL2 and the Windows GPU driver must work properly. Either side can cause CUDA compilation or runtime failures.
Hermes Agent calls the local service through an OpenAI-compatible API. The key is that http://localhost:8080/v1 responds correctly.
If accessing from a phone or another device, handle Windows Firewall, LAN addresses, and security isolation. Do not expose the local model service directly to the public internet.

Original article: Hermes + Qwen3.6：本地最强 Agent 组合！零成本、无限 Token，太香了！
llama.cpp: ggerganov/llama.cpp
Hermes Agent: NousResearch/hermes-agent
Qwen3.6 GGUF example: unsloth/Qwen3.6-27B-GGUF

How to Use DeepSeek V4 Pro in Cline

Fri, 01 May 2026 20:59:06 +0800

Cline already supports the OpenAI Compatible Provider. DeepSeek API is also compatible with OpenAI SDK-style calls, so connecting deepseek-v4-pro to Cline is not complicated: choose OpenAI Compatible, then fill in DeepSeek’s Base URL, API Key, and model name.

The steps below cover both the VS Code extension UI and Cline CLI.

Prepare a DeepSeek API Key

First, create an API Key on the DeepSeek platform.

You need three values:

Item	Value
Provider	`OpenAI Compatible`
Base URL	`https://api.deepseek.com`
Model ID	`deepseek-v4-pro`

DeepSeek’s official documentation states that the V4 series uses the existing OpenAI-compatible interface. Keep base_url as https://api.deepseek.com, and set model to deepseek-v4-pro or deepseek-v4-flash when calling it.

Configure It in the Cline Extension

If you use the Cline extension in VS Code, configure it this way:

Open Cline from the VS Code sidebar.
Go to Cline settings or model configuration.
Select OpenAI Compatible as the provider.
Enter your DeepSeek API Key.
Set Base URL to:

`1`	`https://api.deepseek.com`

Set Model ID to:

`1`	`deepseek-v4-pro`

Save the configuration and run a simple test in Cline.

Start with a low-risk read-only task:

`1`	`Please read the current project directory structure and summarize what type of project this is. Do not modify any files.`

If Cline can read and answer normally, the model connection is working.

Configure It in Cline CLI

If you use Cline CLI, run cline provider configure openai-compatible to enter interactive configuration.

Example:

`1`	`cline provider configure openai-compatible`

Fill in:

1
2
3

API Key: sk-...
Base URL: https://api.deepseek.com
Model ID: deepseek-v4-pro

After configuration, test it with a read-only task:

`1`	`cline "Summarize this repository structure without changing files."`

If you want to lower cost first, you can temporarily change Model ID to:

`1`	`deepseek-v4-flash`

Then switch back to deepseek-v4-pro for complex planning, fact checking, multi-tool collaboration, or high-risk code changes.

Recommended Model Split

DeepSeek V4 Pro and Flash are better used with a clear split.

Model	Best for
`deepseek-v4-flash`	Routine code reading, small batch fixes, script generation, context summarization, low-risk frontend changes
`deepseek-v4-pro`	Architecture planning, complex bugs, cross-file refactors, fact checking, multi-tool calls, high-risk changes

For Agent tools like Cline, cost mainly comes from long context, repeated file reads, plan generation, and multi-round tool calls. If the task is light, use Flash for volume; if the task needs stronger judgment, switch to Pro.

How to Set Context Length

DeepSeek V4 Pro and Flash both support long context. If Cline requires a manual context window value, you can understand it according to the 1M context listed on DeepSeek’s official model page.

In practice, do not put every file into context at the beginning. Cline reads files according to the task, and a better workflow is usually:

first ask it to inspect the directory structure;
then ask it to locate relevant files;
finally let it modify only the target files.

This saves tokens and keeps the task boundary clearer.

Common Issues

1. Model Not Found

First check that Model ID is exactly:

`1`	`deepseek-v4-pro`

Do not write DeepSeek V4 Pro, deepseek-v4, or another display name.

2. 401 or Authentication Failed

Check the API Key:

whether it was copied completely;
whether it contains extra spaces;
whether it was entered into the provider configuration Cline is currently using;
whether the DeepSeek account has available balance.

3. Connection Failed

Check the Base URL:

`1`	`https://api.deepseek.com`

Do not append /v1/chat/completions at the end. Cline’s OpenAI Compatible Provider will construct compatible interface requests itself.

4. Cline Calls Are Too Expensive

You can switch routine tasks to deepseek-v4-flash and use deepseek-v4-pro only for complex tasks.

Also, make the task description as clear as possible:

`1`	`Only modify files related to the login page. Do not refactor unrelated modules. First provide a plan, and modify code only after confirmation.`

Agent tasks are most expensive when boundaries are unclear. The clearer the boundary, the fewer files it reads, the fewer tool calls it makes, and the more controllable the cost becomes.

5. Error: reasoning_content must be passed back

If you see an error like this:

{
  "message": "400 The `reasoning_content` in the thinking mode must be passed back to the API.",
  "code": "invalid_request_error",
  "modelId": "deepseek-v4-pro"
}

This is usually not a Key, quota, or Base URL problem. It means DeepSeek V4 Pro’s thinking mode and the current client’s multi-round tool-call history are not aligned.

DeepSeek’s official documentation states:

thinking mode is enabled by default;
thinking mode returns reasoning_content;
if a tool call happens in one round, subsequent requests must pass back the reasoning_content from that assistant message;
if the client does not pass it back correctly, the API returns 400.

When Cline connects through the OpenAI Compatible Provider, this error may appear in the second round or after tool calls if the current version does not fully preserve and return DeepSeek’s reasoning_content.

Try this order:

Upgrade Cline to the latest version;
confirm you are using OpenAI Compatible, not the normal OpenAI provider;
if Cline supports a custom request body, try disabling thinking mode:

{
  "thinking": {
    "type": "disabled"
  }
}

if Cline does not support extra body parameters, temporarily use another model or a compatible proxy service;
switch back to deepseek-v4-pro after Cline supports passing back DeepSeek V4 reasoning_content.

Note that disabling thinking mode may reduce complex reasoning ability, but it can work around client compatibility issues where reasoning_content is not passed back.

Copyable Configuration

Provider: OpenAI Compatible
API Key: sk-your DeepSeek API Key
Base URL: https://api.deepseek.com
Model ID: deepseek-v4-pro

For low-cost mode:

Provider: OpenAI Compatible
API Key: sk-your DeepSeek API Key
Base URL: https://api.deepseek.com
Model ID: deepseek-v4-flash

Summary

There are only three key steps to calling DeepSeek V4 Pro in Cline:

choose OpenAI Compatible as the provider;
set Base URL to https://api.deepseek.com;
set Model ID to deepseek-v4-pro.

After configuration, test with a read-only task before giving it real code changes. If you often run Agent tasks, split Flash and Pro: Flash handles high-frequency lightweight work, while Pro handles complex judgment and fallback tasks.

References:

How DeepSeek V4 Price Cuts Rewrite the Cost Model for AI Agents

Fri, 01 May 2026 19:47:47 +0800

DeepSeek V4 did not arrive with an especially loud launch. There was no major event, nor a benchmark story that instantly crushed every competitor. But a few days later, the part that truly affects the industry became visible: repeated price cuts.

The point of this change is not that “the model got a little stronger”, but that “usage cost has been pushed into another tier”. When token prices become low enough that an ordinary Agent task can finish for a few cents or a couple of yuan, the business logic behind many Coding Plans and Token Plans needs to be reconsidered.

Launch Day Was Not Explosive

The first wave of feedback to DeepSeek V4 was not especially heated. Many people expected it to deliver the kind of shock R1 did: across-the-board benchmark leadership, validation of domestic compute, and simultaneous breakthroughs in multimodal and Agent capabilities. After the actual release, however, it looked more like a steady upgrade.

V4 Pro is indeed a strong model, especially in coding, math, long context, and agentic coding. But it is not the kind of product that instantly makes every peer model look outdated. So on launch day, the discussion felt a little awkward: people wanted to praise it, but it was hard to find a sufficiently explosive angle.

The real turning point was not launch day, but the price adjustments that followed.

Successive Price Cuts Are the Key

After DeepSeek V4 was released, prices started to move downward. According to DeepSeek’s official pricing page and the information summarized in the source article, the rough prices at that time were:

DeepSeek V4 Flash: about 1 yuan per 1 million input tokens; about 0.02 yuan per 1 million tokens after a cache hit;
DeepSeek V4 Pro: about 3 yuan per 1 million input tokens; about 0.025 yuan per 1 million tokens after a cache hit;
the cache-hit input price across the model family dropped to one tenth of the launch price;
V4 Pro was once in a 75% discount period, extended until May 31, 2026 at 23:59.

The API prices in US dollars make the difference easier to see:

Model	Cached input	Non-cached input	Output	Context
`deepseek-v4-flash`	$0.0028 / 1M tokens	$0.14 / 1M tokens	$0.28 / 1M tokens	1M
`deepseek-v4-pro` promotional price	$0.003625 / 1M tokens	$0.435 / 1M tokens	$0.87 / 1M tokens	1M
`deepseek-v4-pro` regular price	$0.0145 / 1M tokens	$1.74 / 1M tokens	$3.48 / 1M tokens	1M

Two details matter here.

First, V4 Pro’s $0.435 / $0.87 is a promotional price, not the long-term regular price. In DeepSeek’s official notes, this 75% discount was extended until May 31, 2026 at 15:59 UTC.

Second, cache-hit pricing is the key variable in the Agent cost model. Flash’s cached input price is as low as $0.0028 / 1M tokens, while Pro’s promotional cached input price is $0.003625 / 1M tokens. That means repeated project context, tool definitions, system prompts, and historical summaries no longer need to be charged at the full input price.

The most important thing about this pricing is that it makes the token cost of many tasks “insensitive”. In the past, developers worried that one Agent task would consume a large amount of context, repeatedly read and write code, and call tools frequently. Now, as long as the cache hit rate is high enough, the cost can be pushed very low.

Price Comparison With GPT and Claude

DeepSeek’s own prices alone do not fully convey the gap. The contrast becomes much clearer when placed next to common closed-source models from the same period.

Model	Input	Cached input	Output	Best fit
`deepseek-v4-flash`	$0.14 / M	$0.0028 / M	$0.28 / M	High-frequency Agents, routine coding, batch tasks
`deepseek-v4-pro` promotional price	$0.435 / M	$0.003625 / M	$0.87 / M	Complex coding, planning, fact checking
`deepseek-v4-pro` regular price	$1.74 / M	$0.0145 / M	$3.48 / M	Pro cost baseline after the promotion
GPT-5.5	$5 / M	$0.50 / M	$30 / M	High-quality complex tasks, general reasoning
GPT-5.4	$2.50 / M	$0.25 / M	$15 / M	Mid-range choice for programming and professional tasks
GPT-5.4 mini	$0.75 / M	$0.075 / M	$4.50 / M	Lower-cost general and subtask model
Claude Opus 4.7	$5 / M	$0.50 / M	$25 / M	High-quality writing, complex reasoning, long tasks
Claude Sonnet 4.6	$3 / M	$0.30 / M	$15 / M	Programming, Agents, general work
Claude Haiku 4.5	$1 / M	$0.10 / M	$5 / M	Lightweight tasks, summarization, classification

The most striking number in this table is output price. Agents do not only read context; they also keep generating plans, patches, explanations, logs, and next actions. If there is a lot of output, DeepSeek V4 Pro’s promotional $0.87 / M becomes dramatically cheaper than GPT-5.5’s $30 / M or Claude Sonnet 4.6’s $15 / M.

Even at V4 Pro’s regular output price of $3.48 / M, it is still clearly below GPT-5.4, GPT-5.5, and Claude Sonnet / Opus. If the task can be handled by Flash, the output price drops further to $0.28 / M.

The cached input gap is even more extreme. DeepSeek V4 Flash’s cached input price is $0.0028 / M, while GPT-5.5 and Claude Opus 4.7 are both $0.50 / M. These are not in the same order of magnitude. For Agents that repeatedly read the same code repository, this gap matters more than it does in ordinary chat.

Why Agent Tasks Are Especially Affected

AI Agents are different from ordinary chat. Ordinary chat is usually a question-and-answer flow with relatively limited input context. Agent tasks repeatedly read project files, generate plans, call tools, inspect results, and then modify code again.

These tasks have two traits:

large token consumption;
lots of repeated context.

The second point is crucial. In a code project, the model repeatedly reads the same files, directory structure, error logs, and modification results. If the platform supports cache hits, the cost of repeated input drops sharply.

The source article mentioned a real experience: connecting DeepSeek V4 Pro and Flash to a Claude Code-like tool, asking it to pull a prompt repository and turn it into a local search site. The task was completed, with a total cost of roughly a little over 0.8 yuan, and Pro reached a cache hit rate of 98.7%.

This example illustrates a practical issue: the more an Agent task resembles “repeated work around the same project”, the more valuable cache hits become. If generating a website, fixing a bug, or changing a frontend costs only a few cents to a few yuan, subscription plans become less attractive.

We can estimate the gap with a simplified task. Assume one coding agent task includes:

500,000 input tokens, of which 80% can hit cache;
50,000 output tokens;
no tool calls, search costs, or platform markup included, only model token cost.

The rough costs are:

Model	Estimated cost
DeepSeek V4 Flash	about $0.03
DeepSeek V4 Pro promotional price	about $0.09
DeepSeek V4 Pro regular price	about $0.36
GPT-5.4 mini	about $0.30
GPT-5.4	about $1.01
GPT-5.5	about $1.75
Claude Sonnet 4.6	about $1.11
Claude Opus 4.7	about $1.65

This estimate does not mean DeepSeek is better for every task. Model quality, tool-call stability, long-context retrieval ability, coding style, and factual reliability all need separate evaluation. But from a cost perspective, DeepSeek V4 pushes the marginal cost of “letting the Agent run a few more rounds” very low. That will encourage developers to design longer workflows, more frequent self-checks, and more candidate solutions instead of worrying about the token bill every time.

The Difference Between Coding Plans and Token Plans

Many AI products now offer two types of plans: Coding Plans and Token Plans.

The rough difference is:

Coding Plans are usually mainly for programming;
Token Plans usually cover more capabilities, such as STT, TTS, image generation, search, embedding, and RAG;
STT means speech to text;
TTS means text to speech;
Coding Plans often restrict users to programming scenarios, while other capabilities still require separate purchases.

From a business perspective, a Coding Plan is more like a buffet. Users pay a fixed fee in advance, while the vendor bets that most people will not use up the quota. Some users consume more, others consume less, and the platform can still make money on average.

But if pay-as-you-go token prices are low enough, users start calculating: why do I have to buy a plan? If the real monthly usage cost is only a few yuan or a dozen yuan, a 40-yuan or 200-yuan plan may no longer be worthwhile.

Why Price Cuts Challenge the Subscription Model

Subscription plans rely on one premise: users feel that each individual use is expensive, or they do not want to calculate the cost of every call. When token prices are high, a plan feels reassuring. When token prices are almost negligible, pay-as-you-go becomes more natural.

DeepSeek V4’s price cut effectively reveals the underlying cost:

Agent tasks can be very cheap;
long context is not necessarily too expensive to use;
cache hits can reduce cost significantly;
ordinary developers do not necessarily need a fixed subscription;
the model entry point can shift from a “plan platform” to a “low-cost API”.

This will make platforms built around Coding Plans uncomfortable. If users find pay-as-you-go calls cheaper and freer, they have less reason to be locked into one platform’s subscription.

How to Choose Between Flash and Pro

A practical way to use DeepSeek V4 is to split work between Flash and Pro.

Flash is suitable for high-frequency, lightweight, repeatable tasks:

fixing bugs;
writing frontend code;
writing scripts;
routine code understanding;
processing ordinary information in long context;
running large numbers of subtasks.

Flash is cheap, fast, and also supports very long context. For everyday coding agents, many tasks do not need Pro from the start.

Pro is better for complex judgment and fallback work:

multi-round planning;
complex Agent workflows;
multiple function calls;
fact checking;
financial research;
content production that requires stronger knowledge and judgment;
high-risk code changes.

A reasonable setup is: Flash handles volume, Pro handles fallback. Start ordinary tasks with Flash, then switch to Pro for long-horizon planning, complex judgment, fact checking, or multi-tool collaboration. This keeps cost under control while preserving model quality.

Why DeepSeek Can Price This Way

DeepSeek has a different business structure from many large platforms. It does not have e-commerce, social networking, short video, cloud computing, phones, cars, office suites, operating systems, browsers, or a large enterprise SaaS ecosystem.

That means it does not need to lock users into a complete platform. It can simply sell text model capability: use cheap text models here, and call any other capability elsewhere.

Large platforms usually think differently. If you buy their Coding Plan or Token Plan, you are pulled into their cloud, search, image generation, voice, database, and developer-tool ecosystem. The plan is not merely selling the model; it is competing for the user entry point.

DeepSeek’s approach is more direct: push text model prices down and try to become the default model entry point for Agents. Once the default entry point is occupied, many developers and toolchains will naturally adapt around it.

Open Models and the Default Entry Point

If DeepSeek V4 keeps an open model route, third-party cloud vendors and platforms may deploy it themselves and provide services. For DeepSeek, that is both distribution and potential diversion.

This is where a low-price official API matters. If the official price is already low enough, other platforms will struggle to offer an obvious price advantage even if they can deploy the model. Users will tend to use the default, cheap, stable entry point directly.

This is especially true for Agent tools. Agent tasks depend on long context, caching, tool calls, and stable throughput. Once a model is cheap enough in these scenarios, it has a chance to become the default option.

Coding Plans Are Still Not Useless

This does not mean Coding Plans will disappear immediately. They still fit some users.

If some users are truly heavy users who max out their quota every day, a fixed subscription may still be economical. Just like a buffet, if nobody could ever eat enough to get their money’s worth, users would not buy it.

The problem is that most users are not that kind of extremely high-frequency user. Low-frequency users, lightweight developers, and people who occasionally write scripts or modify projects are better suited to pay-as-you-go. After DeepSeek lowers pay-as-you-go costs, the appeal of plans weakens.

The future is more likely to become a layered choice:

heavy high-frequency users keep buying Coding Plans;
ordinary users move to low-cost APIs;
Agent tools automatically choose Flash / Pro according to the task;
platform plans need to provide more non-model value, such as workflows, IDE integration, deployment, team management, and security auditing.

Summary

DeepSeek V4 did not create its biggest impact through benchmarks. What truly changed industry expectations was the price reduction that followed.

When input tokens and cache-hit pricing are pushed very low, the cost of using AI Agents changes. Long context, code-project analysis, and multi-round tool calls that used to look expensive may now become everyday costs of a few cents to a few yuan.

This directly challenges the business logic of Coding Plans and Token Plans. If users can pay by usage, freely combine models and tools, and keep costs low enough, they may not want to be tied to a specific platform plan.

What DeepSeek V4 truly touches this time is not only the ranking of model capability, but the cost structure of AI Agents and the battle for the default entry point.

References:

NVIDIA Releases Nemotron 3 Nano Omni: An Open Omnimodal Reasoning Model for Agents

Fri, 01 May 2026 12:07:15 +0800

NVIDIA has released Nemotron 3 Nano Omni, an open omnimodal reasoning model designed for agent workflows. Its focus is not simply text question answering, but putting language, vision, and audio into the same reasoning framework so the model can handle inputs that are closer to real work.

In positioning, Nemotron 3 Nano Omni looks more like a foundation model prepared for AI Agents. It can understand information from screens, documents, images, speech, and video, then turn that information into actionable reasoning results. This kind of capability fits computer operation, document intelligence, video understanding, voice interaction, customer service, education, and enterprise process automation.

Model Specs

Nemotron 3 Nano Omni uses a MoE architecture. The key specs NVIDIA lists are:

Item	Information
Model name	`Nemotron 3 Nano Omni`
Architecture	MoE
Parameter scale	30B total / 3B active
Modalities	Text, image, audio, video
Context length	256K tokens
License	Apache 2.0
Main deployment direction	AI Agents, multimodal reasoning, enterprise agents

The most notable point here is 30B-A3B. It means the model has about 30B total parameters, but only activates about 3B parameters during each inference step. This is a tradeoff between capability and inference cost: the model keeps a larger expert capacity while using only part of it at runtime.

That said, MoE active params does not mean VRAM can be estimated as if this were only a 3B model. A full deployment still needs to account for expert weights, KV cache, vision and audio encoder modules, context length, and inference framework overhead.

It Is Not Solving a Single-Modality Problem

Traditional large language models mainly process text. Multimodal models add image understanding. Nemotron 3 Nano Omni has a broader target: it emphasizes omnimodal input, meaning text, images, audio, and video are all brought into a unified reasoning process.

This matters a lot for agents. Real agent tasks are often not “take a piece of text and generate another piece of text”; they are more like:

reading buttons, tables, and windows on a screen;
parsing PDFs, screenshots, charts, and webpages;
listening to spoken instructions or meeting recordings;
understanding actions, scenes, and timing in video;
combining those signals into the next operation.

If a model can only handle one modality, an Agent needs extra glue between multiple specialized models. The value of an omnimodal model is reducing that integration cost and letting the same model directly process more complex environmental inputs.

Built for Computer Operation and Document Intelligence

NVIDIA specifically notes that Nemotron 3 Nano Omni can be used for computer-operation tasks. These tasks usually require the model to understand user interfaces:

what controls are on the screen;
what state the current window is in;
which button or menu is the next target;
what the content in tables, dialogs, and input boxes means.

This is also one of the hard-to-avoid capabilities when AI Agents move into real deployment. If an agent is going to help people operate office software, browsers, enterprise backends, or developer tools, it has to understand the interface, not just read API docs.

Document intelligence follows a similar logic. Enterprise materials often mix text, tables, images, scanned pages, and charts. An omnimodal model can put all of that content into the same context for understanding, making it suitable for contract review, report analysis, invoice processing, knowledge-base QA, and process automation.

Audio and Video Bring Agents Closer to Real Scenarios

Audio and video inputs can noticeably expand the range of agent applications.

Audio scenarios include:

meeting recording summaries;
customer service call analysis;
voice command understanding;
education and training content organization.

Video scenarios include:

instructional video understanding;
security and industrial inspection;
screen recording analysis;
operation workflow review;
temporal reasoning in multi-step tasks.

If these tasks rely only on text transcription, a lot of visual and timing information is lost. An omnimodal model can directly combine voice, frames, and textual clues, giving Agents a more complete sense of their environment.

Deployment and Ecosystem

NVIDIA is placing Nemotron 3 Nano Omni inside an open ecosystem, and the model uses the Apache 2.0 license. That matters for developers and enterprises because it lowers the licensing barrier for experimentation, integration, and secondary development.

From NVIDIA’s introduction, this model is also closely tied to its inference ecosystem. For enterprise users, real deployment usually raises questions like:

whether it can run efficiently on NVIDIA GPUs;
whether it supports long context and multimodal input;
whether it can connect to existing Agent frameworks;
whether it can process internal documents, audio/video, and UI screenshots;
whether it can be deployed in private environments.

NVIDIA emphasizes that the model has a clear throughput advantage and says it can reach up to 9x the throughput of comparable open omnimodal reasoning models. The real value of that number still depends on the specific hardware, context length, input modalities, and inference framework. But the direction is clear: NVIDIA wants to bring open multimodal models and its inference infrastructure together into enterprise Agent scenarios.

Suitable Use Cases

Nemotron 3 Nano Omni is better suited to tasks such as:

Agents that need to understand text, images, audio, and video at the same time;
enterprise document intelligence and knowledge-base QA;
computer operation based on screenshots or web interfaces;
multimodal analysis of meetings, customer service, and teaching content;
video understanding, workflow review, and temporal reasoning;
teams that require open licensing and private deployment.

It is not necessarily a fit for every regular user. If the task is local chat, code completion, or simple QA, a single-modality language model may be lighter, faster, and more resource-efficient. The value of Nemotron 3 Nano Omni mainly appears in complex input and multimodal Agent workflows.

What This Means for AI Agents

For AI Agents to truly enter work scenarios, they cannot only write text. They need to understand interfaces, speech, documents, and changes in video, then turn that information into the next action.

That is where Nemotron 3 Nano Omni matters. It is not simply making the model larger; it is unifying the many kinds of input Agents face into one reasoning model. This can make it easier for developers to build agents for real tasks instead of building only around chat windows.

From this angle, the point of NVIDIA’s release is not just “another multimodal model”. It is part of a continuing effort to connect open models, GPU inference, enterprise Agents, and private deployment. What will be worth watching next is how it performs in concrete Agent frameworks, enterprise workflows, and local deployments.

References:

NVIDIA Technical Blog: NVIDIA Nemotron 3 Nano Omni

FinceptTerminal: An Open-Source Financial Terminal, Quant Research, and AI Agent Workbench

Fri, 01 May 2026 03:47:18 +0800

FinceptTerminal is an open-source financial terminal project from Fincept Corporation.

Based on the README, it is not a simple market quote panel. It is a comprehensive desktop platform for financial analysis, quant research, trading workflows, and AI Agents. Version 4 is built with C++20 and Qt6 as a native desktop application, while embedding the Python ecosystem for analytics, scripting, machine learning, and financial modeling.

If we need a comparison, it is closer to an open-source financial research workbench: connecting data sources on one side, and handling charts, portfolios, quant research, trading, intelligence analysis, and automated workflows on the other.

One thing should be made clear first: tools like this can be used for research, analysis, education, and internal tool building, but no output should be treated directly as investment advice. Financial markets are risky, and data, models, strategies, and execution all require independent verification.

What problem does it solve?

Financial research is often scattered across many tools:

Market data lives in one application
Research code lives in Jupyter
Charts live in another tool
Portfolio analysis lives in spreadsheets
Trading records live in brokerage systems
News and intelligence live in the browser
AI analysis lives in a chat window

This approach works, but collaboration and reproducibility are difficult.

FinceptTerminal tries to integrate these capabilities into one desktop terminal, so users can complete data access, analysis, modeling, visualization, Agent collaboration, and trading-related workflows in the same environment.

Its goal is not to replace every professional system, but to provide an extensible open-source foundation for a financial terminal.

Technical architecture

The README mentions that v4 uses C++20 and Qt6.

This means it is not a pure web panel, but a native desktop application. For a financial terminal, native applications have several advantages:

More stable UI responsiveness
Better fit for complex windows and multi-panel layouts
Easier access to local files and system resources
Ability to embed high-performance components
Better suited for long-running desktop workflows

At the same time, the project also embeds Python.

This is important. In financial research and quant analysis, Python is one of the de facto mainstream languages. Data analysis, machine learning, statistics, backtesting, charting, and financial modeling all rely heavily on the Python ecosystem. C++/Qt handles the application framework and desktop experience, while Python handles research and extensibility. That is a very practical combination.

Data connectors

The README says the project provides 100+ data connectors.

The value of a financial terminal depends heavily on data access. Without data, even the best UI and models are just an empty shell.

These connectors can usually cover different sources:

Market quotes
Macroeconomic data
Company financials
News and intelligence
Exchange data
Crypto asset data
Research data sources
Internal or custom APIs

For users, data connectors reduce the workflow of “download CSV, clean it manually, then import it again”, making analysis closer to real-time and automation.

That said, the quality, licensing, latency, coverage, and cost of financial data are all critical. Before using any data source, its license and usage boundaries need to be confirmed.

AI Agents module

The project emphasizes AI Agents, which is also where it differs from traditional financial terminals.

Traditional terminals are mostly human-operated interfaces: people look at data and make judgments. With AI Agents, the tool can take on more assistant-style work:

Summarize market information
Explain financial reports and announcements
Generate research summaries
Help filter data
Assist with analysis scripts
Organize trading or research workflows
Pass context across modules

This does not mean AI can replace analysts or traders.

A more reasonable position is this: AI Agents help reduce repetitive organization work and provide preliminary analysis and interactive queries, but important conclusions still require data validation, model validation, and human judgment.

Quant research capabilities

FinceptTerminal is also aimed at quant research.

Quant research usually includes:

Data cleaning
Factor construction
Strategy hypotheses
Backtesting
Risk assessment
Portfolio optimization
Trading cost estimation
Result visualization

If a terminal can integrate data connections, Python analysis, charts, and workflows, it can be very useful for quant research. Researchers can move step by step from data to strategy validation in one environment.

However, the biggest danger in quant research is something that “looks effective.” If a strategy does not strictly handle out-of-sample validation, trading costs, slippage, survivorship bias, overfitting, and data leakage, even a beautiful backtest is unreliable.

So this kind of tool should be treated as a research platform, not an automatic money-making machine.

QuantLib and financial modeling

The README mentions QuantLib-related capabilities.

QuantLib is a common open-source library in financial engineering. It is often used for interest rates, bonds, options, derivatives pricing, curve construction, risk calculation, and related areas.

This means FinceptTerminal is not only about viewing stock quotes. It also tries to cover more professional financial modeling scenarios.

These capabilities are suitable for:

Learning financial engineering
Experiments in derivatives pricing
Curve and risk metric calculation
Portfolio risk analysis
Research model prototyping

But financial modeling itself has a high barrier. Model parameters, market assumptions, data sources, and pricing logic all affect the results. A tool can reduce operating costs, but it cannot replace professional judgment.

Node workflows

The README also mentions node-based workflows.

Node workflows are suitable for breaking complex tasks into visual processes:

Read data
Clean data
Run models
Generate charts
Trigger AI analysis
Output reports
Send notifications

For financial scenarios, this approach has two advantages.

First, the process becomes visible. Complex analysis is no longer hidden only inside a pile of scripts, and users can see how data flows.

Second, it is suitable for automation. Repetitive research processes can be saved, reused, and adjusted.

If these workflows can be combined with Python scripts, data connectors, Agents, and reporting systems, this kind of node workflow can become a very valuable module inside a financial terminal.

Trading and portfolio management

The project also mentions trading and portfolio-related capabilities.

This is the area that requires the most caution.

Portfolio management can help users understand asset exposure, returns, drawdowns, volatility, correlation, and risk concentration. Trading modules may involve orders, accounts, execution, and records.

But whenever real trading is involved, the following must be considered:

Data latency
Order execution risk
API permissions
Trading costs
Slippage
Liquidity
Risk control limits
Auditing and logs
Accidental strategy triggers

Trading features in development and research environments should not be equated with production-grade trading systems. Before connecting to live trading, strict testing, permission isolation, risk control mechanisms, and manual review are required.

How is it different from Bloomberg Terminal?

Many financial terminal projects are compared with Bloomberg Terminal.

But the positioning is different.

The value of Bloomberg Terminal is not only its software interface. It also includes:

Data coverage
Data licensing
News network
Trading ecosystem
Customer support
Financial institution workflows
Long-accumulated industry trust

FinceptTerminal is more like an open-source financial terminal framework and research platform. Its strengths are extensibility, customization, localization, and integration with Python and AI workflows.

It should not be understood simply as a free replacement for Bloomberg.

A more reasonable view is this: if you want to study how financial terminals are built, or if you want to build your own financial analysis workbench, FinceptTerminal provides an open-source starting point.

Licensing and commercial boundaries

The README mentions that the project uses AGPL and a commercial licensing model.

AGPL has explicit requirements for network services and derivative works. If you only use it for learning, research, or personal experiments, it is usually not a big issue. But if you plan to turn it into a commercial product, internal platform, or external service, you need to read the license carefully.

Financial tools often enter internal enterprise systems. In that case, open-source licenses, commercial licenses, data licenses, and model licenses all need to be reviewed together, instead of only asking whether the code can run.

Who should pay attention?

FinceptTerminal is suitable for:

Developers interested in financial terminal architecture
People doing quant research or financial engineering experiments
People who want to embed Python analysis into desktop tools
People exploring AI Agent + finance workflows
Teams building internal financial analysis platforms
People learning C++/Qt financial application development

If you only want to watch quotes for a few stocks, ordinary market software may be simpler.

If you want to understand how a financial terminal integrates data, charts, models, Agents, trading, and workflows, this project is more worth studying.

Things to watch when using it

First, distinguish research from trading.

Research environments can tolerate experiments and failure. Trading environments cannot. Do not connect a research tool to real accounts before it has been verified.

Second, take data licensing seriously.

Financial data cannot simply be scraped and used commercially. Different data sources have different licensing terms, especially market data, news, financial statements, and exchange data.

Third, do not blindly trust AI Agents.

AI can help organize information, but financial conclusions must return to data, models, risk, and factual validation.

Fourth, pay attention to security.

If a tool connects to accounts, API keys, trading interfaces, or internal data, key management, permission isolation, logs, and network boundaries must be handled properly.

Fifth, understand the open-source license.

AGPL has important implications for commercial use and service deployment. Before productization, licensing issues should be handled first.

Reference

Fincept-Corporation/FinceptTerminal

Final thought

What makes FinceptTerminal worth watching is that it puts financial terminals, Python quant research, AI Agents, data connectors, and node workflows into the same open-source desktop platform concept.

It is better suited as a starting point for financial technology research and internal tool building than as a finished product that can directly replace professional financial terminals or live trading systems.

mattpocock/skills: A Practical Skill Collection for AI Coding Agents

Fri, 01 May 2026 03:43:20 +0800

mattpocock/skills is a public collection of AI coding agent skills from Matt Pocock.

It is not a full application, nor a new chat client. It is a set of working skills that can be used by AI coding assistants. The idea is practical: break common AI coding problems into small skills that an Agent can call in the right task, instead of relying on one huge prompt every time.

If you often use Claude Code, Codex, Cursor, or similar AI coding tools, this kind of skills collection is worth watching. What really affects the AI coding experience is often not whether the model can write code, but whether it can move through the task in your preferred working style.

What Problem It Solves

AI coding assistants are powerful, but they can easily go wrong.

Common situations include:

Starting code changes before understanding the requirement
Modifying too many files at once
Producing lots of explanation but little useful action
Blindly trying things after errors
Not running tests or checks in time
Ignoring existing project patterns
Introducing unnecessary abstractions to finish a task
Writing code without truly reviewing risks afterward

These problems are not always caused by weak model capability. Often, the workflow is not constrained well enough.

The value of mattpocock/skills is that it turns these common failure modes into reusable operating methods, making the Agent behave more like an experienced engineering collaborator in different scenarios.

What Are Skills

In the AI Agent context, a skill can be understood as a reusable task instruction, working method, or professional workflow.

It does not have to be a code plugin, and it does not always need to call an external service. In many cases, a skill is simply a clear set of rules:

When to use it
What to do first
What not to do
What output is required
How to judge task completion

This is somewhat like a normal prompt template, but the granularity is closer to a task capability.

Normal prompt templates are usually copied and pasted manually by the user. Skills are better as part of an agent toolbox, allowing the Agent to choose the right workflow for the task.

Why Small and Composable Matters

The README emphasizes that these skills are small and composable.

This direction matters.

If one skill tries to handle everything, it quickly becomes a new giant prompt: long, vague, and hard to maintain. The advantage of small skills is clear boundaries.

For example, one skill can focus on:

Planning first
Fixing TypeScript errors
Running tests and fixing based on results
Doing code review
Summarizing project conventions
Improving prompts
Removing unnecessary abstractions

These skills can be combined according to the task. A simple task may need only one skill, while a complex task can chain several together.

This is closer to real engineering work. You do not use the same workflow for every problem; you choose tools according to the situation.

Keeping the Engineer in Control

One important direction of this repository is keeping the engineer in control.

AI coding can easily slide into two extremes.

The first is fully manual. AI only helps write a few lines of code, while all context, planning, and verification still depend on you.

The second is fully hands-off. You throw a task to an Agent, let it change a lot of things, and then face a diff that is hard to review.

Skills help find a more stable middle position.

They let AI take on more repetitive workflow, while still constraining it with rules:

Understand the task before acting
Read relevant files before editing
Keep the modification scope controlled
Report uncertainty
Verify after changes
Do not refactor unrelated code just to show off

This does not weaken AI. It makes AI actions easier for humans to review and take over.

Alignment Problems

The first kind of AI coding failure is often alignment failure.

The user wants a very specific change, but the Agent may understand it as a larger refactor. The user only wants a bug fixed, but it changes styles along the way. The user wants existing architecture to be followed, but it introduces a new pattern.

Skills can help the Agent do several things at the start of a task:

Restate the goal
Identify the impact scope
Recognize existing implementation patterns
Provide a plan
Clarify what will not be done

This step is like an engineer’s self-check before starting work.

If the Agent cannot clearly state the task boundary and starts writing code directly, it is easy for the task to drift.

Feedback Loop Problems

AI should not write code through one-shot generation alone.

In real development, feedback loops matter:

Change a small piece
Run tests or type checks
Read the errors
Fix them
Verify again

Many Agents fail because they skip the middle feedback. They change many things at once and then summarize from intuition that “it should work.”

Skills can make the feedback loop explicit. For example, they can require the Agent to:

Run relevant checks after modification
Read error messages first if checks fail
Avoid blindly changing unrelated files
Re-verify after each round of fixes
Report final verification results

This makes AI coding more like real debugging and less like one-shot writing.

Architecture Control Problems

AI is good at generating abstractions, and also good at over-generating abstractions.

To complete a small requirement, it may create a service layer, helper functions, configuration objects, type wrappers, and adapters, making the code much more complex than the requirement itself.

This is especially dangerous in large projects. AI-generated abstractions often look “professional,” but they may not match existing project style and may increase maintenance cost.

Good skills remind the Agent to:

Prefer existing patterns
Avoid unnecessary new abstractions
Avoid refactoring unrelated areas
Match the change to the size of the task
Understand the code before designing structure

This reduces output that looks engineered but is actually harder to maintain.

Why Review Skills Matter

Writing code and reviewing code are different states.

When an Agent writes code, it usually tends to prove that its implementation works. It may explain why the change should work, but it does not always actively look for risks.

The purpose of a review skill is to switch the Agent’s role:

Find potential bugs
Find behavior regressions
Find missing tests
Find edge cases
Find increased complexity
Find inconsistencies with existing conventions

This matters for AI coding because AI generates code quickly. Without review, users can easily be overwhelmed by large diffs.

A good review output should list issues first, not praise the implementation first. It should help the engineer decide whether the change can be merged.

Difference from Normal Rules Files

Many AI coding tools support rules, instructions, or memory.

These files usually record long-term rules, such as:

Project tech stack
Naming conventions
Test commands
Directories not to modify
Answer style preferences

Skills are more focused on task workflow.

Rules tell the Agent “how to behave in the long term,” while skills tell the Agent “how to execute this kind of task.”

The two work best together.

For example, rules can say the project uses pnpm test, while a review skill requires checking test coverage after changes. Then the Agent knows not only the command, but also when to use it.

Suitable Scenarios

Repositories like mattpocock/skills are suitable for:

Frequent use of AI coding tools
Agents working on real codebases
Reducing out-of-scope AI edits
Making the Agent verify results more actively
Turning your engineering habits into skills
Learning how others design agent workflows
Turning temporary prompts into a maintainable skill collection

If you only occasionally ask AI to write a small function, you may not need to maintain skills.

But if you already treat AI as a long-term development partner, skills become increasingly important. They are like a reusable working method for the Agent.

How to Learn from This Repository

Even if you do not use every skill directly, you can learn several things from this repository.

First, write down failure modes.

Do not only complain when AI makes a mistake. Turn the patterns it often gets wrong into rules, so a skill can prevent them next time.

Second, keep skills short.

One skill should solve one clear problem. The shorter it is, the easier it is to call correctly and maintain.

Third, make output format clear.

If you want the Agent to list a plan first, execute next, and summarize verification results at the end, write that structure clearly. Vague requirements usually produce vague results.

Fourth, keep human handoff points.

A good skill should not let AI run too far alone. When there is uncertainty, expanded impact scope, failing tests, or a product decision, it should stop and explain the situation.

Notes for Use

First, do not turn everything into a skill.

Too many skills make the system complex, and the Agent may not know which one to choose. Start with the highest-frequency and most painful scenarios.

Second, skills need iteration.

The first version of a skill may not be good. Watch how AI actually executes it, then gradually delete, add, and rewrite.

Third, do not let skills replace engineering judgment.

Skills can improve workflow, but they cannot guarantee correct implementation. Tests, review, build checks, and human judgment still matter.

Fourth, pay attention to differences between Agents.

Claude Code, Codex, Cursor, and Copilot support instructions, skills, and rules differently. The same idea can be reused, but the specific format should be adjusted for each tool.

Reference

mattpocock/skills

Final Thought

What makes mattpocock/skills worth watching is not one magic prompt inside it, but the practical AI coding idea it demonstrates: break engineering experience into small skills, then let the Agent combine them by scenario.

As AI coding moves from occasional assistance into daily workflow, skills become important tools for constraining Agents, keeping engineers in control, and improving feedback quality.

free-claude-code: Connecting Claude Code to OpenRouter, DeepSeek, and Local Models Through a Proxy

Fri, 01 May 2026 03:41:49 +0800

free-claude-code is an Anthropic-compatible proxy for Claude Code.

Its idea is not to crack Claude Code, nor to provide an official free Claude service. Instead, it starts a local proxy service that looks like an Anthropic API, then forwards requests from Claude Code to other model backends. The README mentions backends such as NVIDIA NIM, OpenRouter, DeepSeek, LM Studio, llama.cpp, and Ollama.

In simple terms, it solves this problem: you like the terminal experience of Claude Code, but want to send model requests to another provider or a local model.

What Problem It Solves

Claude Code has an interaction model that works well for development tasks.

It can read code, edit files, run commands, and move tasks forward based on project context inside the terminal. But many users may not always want to use the same model backend:

They want to try different models on OpenRouter
They want to use models such as DeepSeek to reduce cost
They want to route requests to local Ollama
They want to run local models through LM Studio or llama.cpp
They want one proxy entry point in the development environment
They want to compare different models inside the Claude Code workflow

free-claude-code is positioned as a compatibility layer between Claude Code and these model services.

Claude Code still sends requests in an Anthropic-like style, while the proxy adapts those requests to different backends.

How It Works

You can think of it as three layers:

The frontend is Claude Code
The middle layer is the free-claude-code proxy
The backend is OpenRouter, DeepSeek, a local model, or another model service

Claude Code believes it is accessing an Anthropic-compatible API.

After the proxy receives a request, it selects a target provider according to configuration, transforms the necessary fields, and returns the response to Claude Code.

The benefit of this structure is that you do not need to modify Claude Code itself, and you do not need every model service to natively support Claude Code. As long as the proxy can align the interfaces, more models can be connected to the same workflow.

Supported Backends

The README lists these directions:

NVIDIA NIM
OpenRouter
DeepSeek
LM Studio
llama.cpp
Ollama

These backends represent different usage styles.

OpenRouter is more like a model aggregation entry point, useful for testing different commercial and open-source models.

DeepSeek is suitable for people who care about Chinese ability, coding ability, and cost.

LM Studio, llama.cpp, and Ollama are more local-model oriented. They are suitable for running models on your own machine or inside an intranet, reducing dependence on external APIs and making offline experiments easier.

NVIDIA NIM is more oriented toward enterprise and GPU inference deployment scenarios.

Why an Anthropic-Compatible Proxy

Claude Code was originally designed around Anthropic interfaces and model conventions.

If you want to connect it to other models, the most direct problem is interface mismatch:

Request fields differ
Model names differ
Streaming formats differ
Tool use is represented differently
Error response formats differ
Token and context limits differ

This is where the proxy layer is useful.

It keeps the interface seen by Claude Code close to the Anthropic shape, then adapts to the backend. For users, after configuring the proxy once, they can test different models inside the same Claude Code workflow.

Suitable Scenarios

free-claude-code is suitable for:

Using the Claude Code terminal workflow
Testing non-Anthropic models in Claude Code
Reducing model calling costs
Connecting Claude Code to OpenRouter
Connecting to compatible model services such as DeepSeek
Running local models through Ollama, LM Studio, or llama.cpp
Giving a team one unified model proxy entry point

If you only use official Claude Code normally and have no special needs around providers, cost, or local deployment, you may not need this type of proxy.

But if you often compare models, or want Claude Code to connect to local and third-party models, this type of tool is useful.

Difference from Directly Using OpenRouter or Ollama

Using OpenRouter, Ollama, or LM Studio directly usually means chatting with a model or calling it through an API.

The point of free-claude-code is not to replace those services, but to connect them to the Claude Code development workflow.

The difference is:

You still use the Claude Code terminal experience
AI can execute tasks around a code repository
The model backend can be changed to another provider
Local models can enter the Claude Code workflow
Configuration is centralized in the proxy layer instead of changed in each tool

So it is more like a bridge than a new chat client.

Notes About Local Models

Connecting Claude Code to local models is attractive, but there are real limitations.

First, model capability differs.

Claude Code tasks are usually not just chat. They include understanding code, planning modifications, editing files, and handling command output. Smaller local models may not complete these tasks reliably.

Second, context window matters.

Code tasks need a lot of context. If the model context is too small, it may fail to read full files, miss constraints, or lose background across multi-turn tasks.

Third, tool use compatibility matters.

Claude Code workflows depend on tool calls and structured behavior. Even if a backend model can chat, it may not follow tool-use protocols well.

Fourth, speed and hardware matter.

Local model speed depends on machine configuration, quantization, and model size. If code tasks respond too slowly, the experience drops noticeably.

So local models are better for experiments, low-risk tasks, and specific scenarios. For truly complex coding tasks, choose carefully according to model capability.

Usage Boundaries

Projects like this are easy to misunderstand from the title, so the boundaries should be clear.

First, it is not an official free Claude Code quota.

It only forwards Claude Code requests to other model backends. When using OpenRouter, DeepSeek, NVIDIA NIM, or other APIs, you still need to follow the pricing, quotas, and terms of the corresponding services.

Second, it is not a tool for bypassing authorization.

When using any proxy tool, you should follow the licenses and terms of Claude Code, model providers, and the project itself. Do not interpret it as a way to avoid official restrictions.

Third, the proxy handles your request content.

Code, command output, and project context may pass through the proxy and backend services. When deploying, consider logs, keys, network boundaries, and privacy. For company code or sensitive projects, use a controlled environment.

Fourth, model performance varies greatly.

The same Claude Code operation may behave very differently after switching models. Do not assume every model can replace Claude.

Relationship with Proxies Such as LiteLLM

Conceptually, free-claude-code belongs to the category of compatible interface proxies.

The shared goal of such tools is to reduce coupling between upper-level applications and lower-level model services. The upper-level application faces a relatively unified interface, while backend providers can be switched by configuration.

Different projects focus on different areas. Some are general model gateways, some focus on OpenAI-compatible APIs, and some specifically adapt tools such as Claude Code.

What makes free-claude-code worth noting is that it puts Claude Code directly at the center, rather than building a generic chat proxy.

Suitable Users

It is better suited to users who are comfortable tinkering:

Familiar with Claude Code
Know how to configure API keys and model providers
Understand proxy service startup and environment variables
Can troubleshoot network, port, model name, and streaming issues
Want to compare different models on coding tasks

If you only want something that works out of the box, the official configuration is usually simpler.

If you are willing to set up a proxy, switch models, tune parameters, and let Claude Code enter more model environments, this project is worth studying.

Reference

Alishahryar1/free-claude-code

Final Thought

The value of free-claude-code is not in the word “free,” but in the bridge it builds between Claude Code and more model backends.

When you want to keep the Claude Code development experience while testing OpenRouter, DeepSeek, local models, or enterprise inference services, an Anthropic-compatible proxy like this becomes useful.

Compound Engineering Plugin: Turning AI Coding into a Plan, Execute, Review Engineering Loop

Fri, 01 May 2026 03:15:39 +0800

Compound Engineering Plugin is an open-source AI coding workflow plugin from Every Inc.

It is not focused on “making AI write a piece of code faster.” Instead, it places AI coding inside a loop that looks more like an engineering team: plan first, implement next, review afterward, then preserve what was learned. For people who frequently use tools such as Claude Code, Codex, Cursor, and Copilot, this kind of plugin solves a workflow problem, not just a prompt problem.

AI coding tools are becoming stronger, but in real projects the hardest part is often not generating code. It is making the AI continuously follow project rules, understand task boundaries, avoid repeating mistakes, and accumulate context across multiple iterations.

What Problem It Solves

Many people use AI coding assistants in a flow like this:

Describe the requirement directly
Ask AI to modify the code
Check whether the result runs
Add more explanation after errors appear
Explain the background again in the next task

This can work for small tasks, but it easily breaks down in complex projects:

Requirements are not clarified before AI starts editing
There is no systematic review after code changes
Project conventions depend on repeated user reminders
Similar mistakes happen again next time
Multiple Agent tools lack a shared working method
Experience is not turned into reusable rules

Compound Engineering Plugin is designed for this class of problems. It splits AI coding into multiple stages, so an Agent is not only executing commands but participating in a more complete engineering process.

What Is Compound Engineering

From the project README, Compound Engineering can be understood as a method for AI-assisted software development.

It emphasizes a loop:

Plan: understand the goal, split the task, confirm the path
Execute: modify code according to the plan, run commands, handle problems
Review: check implementation quality, risks, and test coverage
Learn: preserve experience as reusable rules for future work

This loop resembles how real engineering teams work.

A reliable engineer does not receive a requirement and immediately make random changes, nor does he finish edits and hand them off without checking. He first judges the impact scope, then implements, then checks risks and test results, and finally records the traps he stepped into. AI Agents need similar constraints.

Why a Plugin Is Needed

A prompt can tell AI, “Please plan before executing,” but prompts themselves are not always stable.

Once a conversation becomes long and context becomes complex, the model may skip planning, ignore rules, or become overconfident in order to finish the task. The value of a plugin is that it fixes the workflow so different Agent environments can follow similar methods.

This kind of plugin usually breaks a workflow into commands, rules, templates, or subflows. The user does not need to manually write the full prompt every time. Instead, a fixed entry point triggers a specific stage.

For example:

Ask the Agent to generate a plan first
Implement step by step according to the plan
Trigger review after edits
Return to fixing after problems are found
Write useful experience into memory or rules

This makes AI coding feel more like controlled collaboration instead of one-off chat.

Supported Agent Environments

The README mentions support for multiple AI coding environments, including:

Claude Code
Codex
Cursor
GitHub Copilot
Amp
Factory
Qwen Code

This is worth noting.

Many workflow tools are tied to one client. Once you switch tools, the rules cannot be reused. Compound Engineering Plugin is more like a cross-Agent engineering method, bringing similar planning, execution, and review workflows to different tools.

If you use multiple AI coding assistants at the same time, this unified workflow becomes more valuable. Different tools have different capabilities, but project conventions, review habits, and task decomposition methods should remain as consistent as possible.

Why the Planning Stage Matters

The value of the planning stage is to stop AI from acting too early.

In complex tasks, the truly important questions are usually:

Which files need to change?
Which modules may be affected?
What existing pattern should be followed?
Are there tests?
Where are the risks?
Should documents be read first?
Can the task be split into smaller steps?

If an Agent starts writing code before thinking through these questions, it can easily produce an implementation that looks finished but deviates from the project structure.

A plan does not need to be long. A good plan should be short, specific, and executable. Its purpose is not to create documentation, but to give the following implementation clear boundaries.

What to Avoid in Execution

When AI executes coding tasks, several problems appear easily:

Refactoring unrelated code
Overwriting existing user changes
Only handling the happy path
Ignoring error handling
Not following the existing project style
Not running necessary verification
Blindly trying things after errors

A workflow plugin cannot guarantee these problems will disappear, but it can reduce their probability through rules and staged constraints.

For example, the execution stage can require the Agent to proceed according to the plan. When it discovers something outside the plan, it should explain the risk first. When modifying shared modules, it should add tests or at least run related verification.

This is especially important in large codebases. The faster AI writes code, the more process is needed to constrain its momentum.

Why Review Matters

Many AI coding failures are not caused by code that cannot run at all. They come from detail problems:

Edge cases are not handled
State updates are inconsistent
API contracts are changed quietly
Tests do not cover key paths
Error messages are unclear
Performance or security risks are not mentioned

The review stage switches the Agent from “author mode” to “reviewer mode.”

Author mode tends to justify its own implementation. Reviewer mode should actively look for holes, regression risks, and missing tests. Separating these two stages is more reliable than asking the same response to both implement and self-review.

For users, review output is also more valuable. It helps you quickly judge whether the change is ready to merge or still needs rework.

The Meaning of Learning and Memory

The word “Compound” in the project name suggests an important idea: engineering experience should compound.

If AI fixes a mistake only for the current task and then repeats the same mistake next time, the productivity gain is limited. A better approach is to preserve useful experience:

Directory conventions in this project
Debugging methods for a class of errors
Test commands and notes
Generated files that should not be touched
Code style preferences
Common implementation patterns

These experiences can become rules, memories, documents, or templates. In later tasks, the Agent reads these accumulated notes before starting work.

This is the key to moving AI coding from “one-off Q&A” toward “long-term collaboration.”

Suitable Scenarios

Compound Engineering Plugin is suitable for:

Long-term use of AI Agents for coding
Projects that receive many rounds of modifications
Teams that want AI to plan before implementing
Users who want review thinking after changes
Teams that want a unified AI coding workflow
People who use Claude Code, Codex, Cursor, and other tools at the same time
Teams that want to turn project experience into reusable rules

If you only occasionally ask AI to write a small script, the full workflow may feel heavy.

But if you treat AI coding assistants as daily development partners, the plan, execute, review, learn loop becomes clearly useful.

Difference from Normal Prompt Templates

Normal prompt templates usually solve “how to state the task clearly.”

For example:

Please think step by step
Please read the files first
Please keep code style consistent
Please run tests
Please summarize the changes

These prompts are useful, but they still rely on the user using them correctly every time.

Compound Engineering Plugin operates more at the workflow layer. It organizes these requirements into a repeatable process and adapts them to different Agent tools. You are not writing prompts from scratch every time; you are moving tasks through a workflow.

Simply put, a prompt template is like a reminder, while a workflow plugin is like a system.

Notes for Use

First, do not let the process become a burden.

Small tasks do not always need a full plan and long review. A good workflow should adapt to task complexity: handle simple problems quickly and use the full loop for complex ones.

Second, review cannot replace tests.

Agent review can find many problems, but it can still miss real runtime errors. Final judgment still depends on tests, type checks, build results, and human review.

Third, rules need continuous cleanup.

Preserving experience is important, but rules can become noise as they accumulate. Outdated rules, duplicate rules, and temporary experience that only applied to one task should be cleaned up regularly.

Fourth, cross-tool consistency does not mean everything is identical.

Claude Code, Codex, Cursor, Copilot, and other tools have different capabilities and interaction models. What should be unified is the working method, not necessarily every command or configuration detail.

Suitable Teams

If a team already allows AI Agents to modify real code, it is not enough to discuss only “which model is stronger.”

The more important questions are:

Does AI understand the task before editing?
Does AI follow project boundaries during editing?
Does AI actively review risks after editing?
Can AI learn from historical mistakes?
Does the team have unified Agent usage conventions?

This is where projects such as Compound Engineering Plugin matter. They move AI coding one step away from personal tricks and toward reusable team workflow.

Reference

EveryInc/compound-engineering-plugin

Final Thought

What makes Compound Engineering Plugin worth watching is not that it adds another AI coding command, but that it organizes AI coding into an engineering workflow that can improve over time.

When AI Agents start participating in real projects, planning, execution, review, and experience preservation become more important than one-off code generation.

TradingAgents-CN: A Multi-Agent Financial Trading Research Framework for Chinese Users

Fri, 01 May 2026 03:14:15 +0800

TradingAgents-CN is a multi-agent financial trading research framework for Chinese users.

Its goal is not to give a simple answer such as “which stock should I buy.” Instead, it uses multiple AI Agents to simulate a more complete financial analysis team: one role looks at fundamentals, another looks at technicals, another follows news and sentiment, while others handle risk and final decisions. For people studying LLM + Agent + financial analysis, this kind of project is a good experimental entry point.

One thing should be clear first: tools like this are suitable for learning, research, and auxiliary analysis. They should not be treated as real trading advice. Financial markets involve risk, and model outputs can be wrong, delayed, or overconfident.

What Problem It Solves

Normal chat models can also analyze stocks.

You can directly ask, “Help me analyze whether a company is worth buying.” The model may return an answer that looks complete. But this approach has several problems:

The analysis chain is not transparent
Different dimensions are easily mixed together
There is no clear role division
There is little collision between positive and negative views
Risk warnings may become formulaic
It is hard to reproduce the same analysis workflow

TradingAgents-CN breaks financial analysis into multiple roles. Different Agents are responsible for different perspectives, and the final analysis is formed through collaboration, discussion, and summarization.

This is closer to a real investment research workflow. An investment judgment usually does not rely on one news item or one technical indicator. It needs company fundamentals, market environment, price movement, capital sentiment, policy risk, and position control.

What Multi-Agent Analysis Means

Multi-agent analysis is not simply asking several models to speak in turn.

The more valuable approach is assigning clear responsibilities to different Agents. For example:

Market analysis Agent: focuses on market trends, price changes, and the market environment
Fundamental analysis Agent: focuses on business, financial data, and long-term value
News analysis Agent: focuses on announcements, news, public sentiment, and event impact
Technical analysis Agent: focuses on trends, indicators, support and resistance, and trading signals
Risk management Agent: focuses on volatility, drawdown, positions, and uncertainty
Decision Agent: combines different views and forms a final judgment

This structure reduces the problem of a single model trying to “say everything in one breath.”

When different roles analyze the same target, the system can present multi-dimensional judgments more easily and expose disagreements more naturally. For learners, this is more useful than reading only a summary.

Why a Chinese Version Is Needed

Financial analysis is deeply connected to language and market context.

Chinese users care about different data sources, market habits, stock names, trading systems, news expressions, and financial terms compared with English environments. Using an English framework directly often creates several problems:

Chinese stock names and codes are not handled smoothly
A-share, Hong Kong stock, and US stock contexts are mixed
Chinese financial news is not understood stably
Domestic data sources are inconvenient to access
Output style does not match Chinese users’ reading habits

The value of TradingAgents-CN is that it adapts the multi-agent financial analysis workflow for Chinese users. It makes it easier for Chinese users to set up, run, and understand the entire trading analysis experiment process.

What It Can Be Used For

This project is more suitable for research and auxiliary analysis than for automatic order execution.

Suitable uses include:

Learning how multi-agent systems collaborate
Studying LLM performance in financial analysis
Organizing stock information from multiple perspectives
Comparing different models on investment research tasks
Building your own financial analysis Agent prototype
Reviewing historical information and risk points for a target
Practicing how to break investment research workflows into executable tasks

If you are learning quantitative trading, financial engineering, AI Agent systems, or LLM application development, this kind of project can help you understand the engineering structure behind an “AI investment research assistant.”

What It Is Not Suitable For

It is not suitable as a guaranteed profit tool.

It is especially not suitable for:

Buying or selling with full position based directly on output
Replacing your own risk judgment with model conclusions
Treating short-term price predictions as certain results
Ignoring transaction costs, slippage, and liquidity
Connecting to a real account without backtesting
Replacing a long-term investment strategy with one analysis result

LLMs are good at organizing information, generating explanations, and simulating reasoning workflows, but they do not naturally have stable market prediction ability. Financial markets contain strong noise, sudden events, and behavioral games. Model output can only be one reference material.

Difference from Normal Quant Frameworks

Traditional quantitative frameworks focus more on data, factors, backtesting, portfolio optimization, and trading execution.

For example, you may define strategy rules such as:

Moving average breakout
Momentum factor
Value factor
Volatility filter
Stop loss and take profit
Position management

Then you use historical data to backtest strategy performance.

TradingAgents-CN is more of an “agent analysis framework.” It focuses on how multiple LLM Agents collaborate around financial tasks, how to simulate investment research discussions, and how to organize news, fundamentals, technicals, and risk judgment.

The two are not replacements for each other.

A more realistic usage is: traditional quant systems handle verifiable rules and backtesting, while Agent systems handle information organization, report generation, viewpoint comparison, and decision support. Whether it can enter real trading still requires rigorous backtesting, risk control, and human review.

Difference from Directly Asking ChatGPT

Directly asking a model has the lowest barrier, but the process is loose.

You ask once, it answers once. Change the wording, and the conclusion may change. It is hard to ensure that it analyzes from the same dimensions every time, and hard to make it consistently play multiple mutually checking roles.

The value of TradingAgents-CN is that it structures the analysis process:

Roles are clearer
Steps are more reproducible
Information sources are easier to organize
Viewpoint collision is more natural
Risk checks can be handled separately
Output looks more like the result of an investment research workflow

This is useful for learning and research. You can observe how different Agents affect the final conclusion, replace models, adjust prompts, modify role division, and compare how results change.

Risks to Watch

First, data quality.

Financial analysis depends heavily on data. If market data, financial reports, news, or announcements are incomplete or delayed, even a fluent Agent analysis may be built on the wrong foundation.

Second, model hallucination.

LLMs may fabricate facts, misunderstand data meaning, or treat old information as new. When specific stocks are involved, you must verify against data sources.

Third, over-explanation.

Models are good at giving explanations that sound reasonable, but market price changes may not actually be caused by the reasons listed. Do not mistake post-hoc explanation for causal proof.

Fourth, the gap between backtesting and live trading.

Even if a strategy performs well on historical data, real trading still involves slippage, fees, liquidity, suspensions, limit-up and limit-down rules, and extreme market conditions.

Fifth, license and commercial boundaries.

The README mentions that the project uses a mixed license. Personal learning, research, and commercial use may have different conditions. If you plan to put it into a commercial product or service, read the project license carefully first.

Who Should Study It

TradingAgents-CN is suitable for:

Developers who want to learn AI Agent architecture
People studying LLM financial analysis capability
Quant traders who want to add natural-language analysis
Teams building investment research support tools
People interested in how multi-role collaboration affects decisions
Users who want to experiment with trading Agents in a Chinese environment

If your goal is only to get a simple buy/sell suggestion, this project is not the best way to use it. What is more worth studying is its workflow, roles, collaboration, and risk control, not the conclusion of one output.

Possible Extensions

Frameworks like this have many possible extension directions:

Connect more reliable data sources
Add local model support
Add a backtesting module
Refine rules for A-shares, Hong Kong stocks, and US stocks
Add industry analysis Agents
Add portfolio management and position control
Improve report citations and data traceability
Combine Agent conclusions with traditional quant signals

A truly valuable financial AI system usually does not let the model decide everything alone. It embeds the model into a workflow that is verifiable, traceable, and risk-controlled.

Reference

hsliuping/TradingAgents-CN

Final Thought

What makes TradingAgents-CN worth watching is not whether it can predict the next candlestick, but that it breaks financial analysis into a multi-agent collaboration workflow.

It is more reasonable to treat it as a learning and research tool than as an automatic money-making machine.

qmd: Local Markdown Document Search for AI Agents

Fri, 01 May 2026 03:12:57 +0800

qmd is a search tool for local Markdown documents, with AI Agents as its main target users.

It solves a specific problem: when a project contains many .md documents, AI coding assistants often do not know which file to read, which section to cite, or which instructions are current. Full-text grep can find keywords, but it does not understand meaning well. Putting all documentation into the context wastes window space and easily introduces irrelevant content.

The idea behind qmd is to index Markdown documents first, then return the most relevant snippets through a search interface for AI to use. It can be used as a command-line tool, integrated through an SDK, or exposed as an MCP Server for clients that support MCP.

What Problem It Solves

Real projects usually have more than one or two README files.

You may have:

Architecture notes
API documentation
Development conventions
Deployment procedures
Architecture decision records
Troubleshooting notes
Requirement documents
AI usage instructions
Toolchain notes and reminders

Humans can browse documents through directories, but AI Agents need a clear retrieval entry point. Otherwise, they may:

Read the wrong document
Miss key constraints
Use outdated instructions
Put irrelevant content into context
Invent rules in answers based on experience

This is where qmd is useful. It turns local Markdown documents into a searchable knowledge source, so AI can search first when it needs context, then answer or act based on matched snippets.

Search Approach

The README says qmd combines several retrieval methods:

BM25 keyword search
Vector search
LLM reranking

BM25 is good for clear keywords. If you search for a function name, configuration key, error code, or file name, it is usually direct and effective.

Vector search is better for semantic questions. For example, if you ask “how does this project handle permission validation,” the documentation may not contain that exact phrase, but it may contain related descriptions about authentication, access control, and role checks.

LLM reranking is used to reorder candidate results. The first two steps find potentially relevant content, and the model then judges which snippets best match the current question.

This combination is more suitable for AI Agents than plain keyword search, because Agent questions are often task intentions rather than fixed keywords.

Why Markdown

Markdown is the most common documentation format in development projects.

It is simple enough to store in Git and structured enough to include headings, lists, code blocks, links, and tables. For AI, Markdown is also easier to parse than PDFs, web snapshots, or screenshots.

Because qmd focuses on Markdown, it can process developer documentation more directly:

Split content by headings and paragraphs
Preserve code blocks
Preserve document paths
Return snippets suitable for citation
Let the Agent know which document an answer comes from

This is more stable than asking AI to randomly scan a repository, and it saves more context than putting every document into a prompt at once.

Three Entry Points

qmd provides three entry points: CLI, SDK, and MCP Server.

1. CLI

The CLI is suitable for direct terminal use and for scripts.

You can index a documentation directory and then search related content with commands. For developers, the CLI is the easiest way to validate the tool: first see whether it can find the correct documents, then consider integrating it into more complex workflows.

This kind of tool is useful inside local projects. For example, before changing code you can search design documents; before debugging, search troubleshooting notes; before writing an API, search API conventions.

2. SDK

The SDK is suitable for integrating qmd into your own tools.

If you are building an internal development assistant, documentation Q&A system, code review bot, or project knowledge base, you can call the search capability through the SDK instead of asking users to run commands directly.

The SDK gives more control over:

Search directories
Query content
Number of returned results
Result format
Whether to pass results to a model for summarization

This fits scenarios that need deeper integration.

3. MCP Server

MCP is the most valuable entry point for AI Agents.

Through MCP Server, clients that support MCP can call qmd as a document search tool. This lets an Agent search local Markdown documents before acting, instead of guessing project rules.

A typical workflow could be:

The user asks AI to modify a feature
AI calls qmd to search related design documents
qmd returns the most relevant Markdown snippets
AI modifies code based on those document constraints

This is more natural than manually pasting all rules into a new session, and it is better suited to long-term projects.

Suitable Scenarios

qmd is suitable for:

Projects with many Markdown documents
AI Agents that often need to look up project rules
Teams that want AI answers to cite local documents
Documentation spread across multiple directories
Reusing the same retrieval capability across CLI, SDK, and MCP
Reducing AI coding assistants’ tendency to guess project conventions
Connecting local knowledge bases to Claude Desktop, Claude Code, or other MCP clients

If your project only has one short README, directly asking AI to read the file is enough.

But if the documentation has grown to dozens or hundreds of files, or if you want the Agent to search documents before acting, this type of indexing tool becomes meaningful.

Difference from grep

Tools such as grep and rg are excellent for exact search.

If you know you need DATABASE_URL, authMiddleware, 404, or docker compose, keyword search is usually the fastest.

qmd is better when you do not know the exact words.

For example, you may ask:

What is the release process for this project?
What conventions apply when adding a new API?
Was the caching strategy documented before?
Which documents should AI read before changing code?
Where is the design background for a module?

These questions usually require semantic retrieval rather than matching one word. The BM25 + vector + reranking combination in qmd is intended to make these questions find the right context more easily.

Relationship with RAG

qmd can be seen as a lightweight RAG component for Markdown documents.

It does not try to build a full Q&A system for you. It focuses on one step: finding relevant document snippets. How those snippets are used afterward can be handled by CLI, SDK, an MCP client, or your own Agent workflow.

This positioning is practical. Many projects do not need a large knowledge base system; they only need AI to search local documents more accurately and quickly, then bring the results back into the current task.

Notes for Use

First, documentation quality still matters.

A retrieval tool can only find existing content. If the documents are outdated, duplicated, or contradictory, AI may still receive wrong context. Before connecting qmd to an Agent, clean up the key documents first.

Second, do not make the index scope too broad.

Indexing every Markdown file in the repository is not always better. Dependency documentation, temporary notes, and old draft solutions can pollute results. A better approach is to define which directories are trusted documentation sources.

Third, search results should preserve sources.

When AI uses document snippets, it should know which file and section they came from. This makes human review traceable and reduces the risk of “this looks like a document conclusion, but it is only a model summary.”

Fourth, do not replace human judgment completely.

qmd can improve context recall quality, but it is not a replacement for the source of truth. Important changes still require current code, test results, and the latest requirements.

Suitable Teams

If your team has already started putting AI Agents into daily development workflows, tools like qmd can be valuable.

They are especially suitable for teams that:

Write a lot of documentation
Have a long project history
Need both new people and AI to quickly understand context
Maintain architecture decision records
Have many Markdown convention documents
Want AI to check rules before modifying code

Its goal is not to make AI all-knowing. It is to make AI guess less and look things up more.

Reference

tobi/qmd

Final Thought

The value of qmd is that it turns local Markdown documents into a search entry point that AI Agents can reliably call.

When project documentation moves from “instructions for humans” to “a context source searchable by both humans and AI,” AI coding assistants can follow project rules more easily.

Claude Code Hooks Mastery: An Introduction to 13 Hook Lifecycle Events and Automation Control

Fri, 01 May 2026 03:11:27 +0800

claude-code-hooks-mastery is a learning project focused on Claude Code Hooks.

It is not just a collection of scattered scripts. It explains the Claude Code hook lifecycle, configuration methods, script patterns, and common automation scenarios in one place. For people who want Claude Code to be more controllable and more like an engineering assistant, this kind of material is worth reading.

Claude Code can already read code, edit files, and run commands by default. But if you want it to automatically check permissions, block risky operations, inject project rules, run tests, or remind it of team conventions at specific moments, chat instructions alone are not stable enough. The value of hooks is that they turn “rules I need to remind the AI about every time” into executable workflow.

What Problems Hooks Solve

After using Claude Code for a while, common pain points include:

Every new session needs the same project rules repeated
You worry that it may run commands it should not run
You want checks before and after file edits
You want formatting, tests, or security scans before committing
You want team conventions as fixed workflow instead of verbal reminders
You want context before and after tool calls for logging or blocking
You want complex tasks to trigger subagents or dedicated scripts

Hooks are designed for these “automatic actions at fixed moments.”

You can think of them as event hooks in the Claude Code workflow. When a session starts, a user submits a prompt, the model is about to call a tool, a tool call finishes, or an agent is about to stop, Claude Code can run the scripts you configured.

The 13 Hook Lifecycle Events

One of the main points in the project README is that it systematically covers the 13 Claude Code hook events.

These events span multiple stages, from session startup to tool calls, and from user input to agent termination. By purpose, they can be roughly grouped as:

Session startup: initialize environment and inject project context
User input: inspect prompts, add rules, and perform auditing
Before tool calls: permission checks, command blocking, and security validation
After tool calls: log results, trigger formatting, and run verification
Task ending: summarize, clean up, notify, or save state

This lifecycle design means you do not need to put every rule into one very long prompt.

For example, permission control should happen before tool calls. Formatting checks are better after file edits. Project rule injection is better at session startup or after user input. Putting rules at the right hook point is usually more reliable than stuffing everything into a system prompt.

Where Configuration Lives

Claude Code hooks are usually configured through settings files.

Common locations include:

User-level configuration: ~/.claude/settings.json
Project-level configuration: .claude/settings.json

User-level configuration is good for personal preferences, such as general security rules, command blocking, and log paths.

Project-level configuration is better for repository-specific rules, such as which tests must run, which directories cannot be edited, how generated files are handled, and which checks are required before commit.

If you use Claude Code in a team, it is better to put project-level configuration into the repository. That way everyone opens the project with the same AI collaboration constraints instead of relying on personal memory.

Why Single-File Scripts Matter

The project emphasizes UV single-file scripts.

The benefit is simple deployment. A single Python file can declare dependencies and run without maintaining a complex environment for one hook. This fits hooks well because many hooks only do one small thing:

Check whether a command is allowed
Determine whether a file path is safe
Read project rules and return them to Claude
Scan output for sensitive information
Run formatting or tests after edits
Write events to logs

The smaller a hook script is, the easier it is to maintain, and the less likely it is to become a new complicated system.

What Automation Can Hooks Do

claude-code-hooks-mastery shows many directions. In real work, the most common ones are below.

1. Permission and Security Control

This is the most direct use of hooks.

Before Claude Code executes a command, a hook can inspect the command content. If it contains high-risk actions such as deletion, reset, cleanup, or overwrite, it can block execution or require manual confirmation.

Similar rules can apply to file paths:

Do not modify production configuration
Do not write to secret files
Do not delete migration scripts
Do not touch specific directories
Do not run unapproved network commands

Putting this protection before tool calls is more reliable than writing “do not perform dangerous operations” in a prompt.

2. Context Injection

Many projects have fixed background information:

Tech stack
Coding conventions
Test commands
Branching strategy
Directory structure
Prohibited actions
Rules for generated files

Telling Claude Code this manually every time is annoying and easy to forget. Hooks can automatically inject necessary context at session startup or after the user submits a prompt.

This is like giving Claude Code a project-level work manual. It does not replace the README or development documentation, but it helps AI enter the correct state before executing a task.

3. Verification After Edits

After Claude Code modifies files, hooks can automatically trigger checks.

Common actions include:

Run formatting
Run lint
Run unit tests
Check type errors
Scan generated files
Validate Markdown or JSON format

This helps reduce low-level mistakes. When AI edits multiple files, a lightweight verification pass after modification can reveal problems earlier.

However, hooks should not run heavy tasks by default. Running the full test suite after every file change can make the experience slow. A better approach is to choose checks based on file type, directory, and task risk.

4. Team Rule Validation

If a team already has clear conventions, some of them can be placed in hooks.

For example:

Commit message format
Code style rules
Do not directly edit certain generated files
Documentation must be updated together
API changes must update tests
Certain directories can only be generated by specific tools

This makes Claude Code more like part of the team workflow rather than an unconstrained external assistant.

Of course, hooks should not replace CI. They are better for local reminders and early blocking. Final validation should still belong to CI, review, and test systems.

5. Subagents and Dedicated Tasks

The README also mentions subagent-related content.

This type of usage is suitable for sending complex tasks into more specialized workflows. For example, the main conversation can understand the requirement, while a hook or configuration triggers dedicated checking, auditing, summarizing, or documentation tasks.

For individual users, the first useful step is not complex agent orchestration. It is better to hand repetitive, clear, low-risk actions to hooks first. More complex automation can come after the rules become stable.

Statusline and Output Styles

The project also covers statusline and output styles.

This may look like a small experience detail, but it matters for long-term Claude Code usage. A statusline can show current context, task state, environment information, or hints. Output styles can make Claude Code answers fit your working habits better.

If you collaborate with AI in the same terminal every day, these details affect efficiency. Good status hints reduce mistakes and help you quickly determine whether the current session is in the right project, branch, and environment.

Do Not Make Hooks Too Heavy

Hooks are powerful, but they are not the place to put everything.

Good rules are:

High-frequency actions should be fast
Security blocking should be clear
Output should be short
Failure reasons should be readable
Scripts should have a single responsibility
Heavy checks should be explicit commands or CI tasks

If a hook takes more than ten seconds every time, users will soon want to disable it. If a hook has vague blocking rules, both Claude Code and the user will struggle to understand what to do next.

Hooks are best for tasks with clear boundaries: allow or reject, add context, log events, run lightweight checks, and suggest the next step.

Who Should Use It

If you only occasionally ask Claude Code to edit a small piece of code, you may not need to study hooks deeply yet.

But this project is useful if you:

Use Claude Code frequently
Often let AI modify real project code
Worry about AI running dangerous commands
Want to automatically inject team rules into AI workflows
Want checks to run automatically after edits
Want to turn repeated reminders into configuration
Are building a more stable AI coding workflow

Hooks are especially meaningful in collaborative projects. They can turn part of team experience into scripts instead of relying on every person to remind AI manually.

Notes for Use

First, start with security hooks.

Compared with complex automation, command blocking, path protection, and sensitive file checks are easier to implement and immediately reduce risk.

Second, commit project-level rules carefully.

.claude/settings.json affects everyone who uses the repository. Before committing rules, make sure they do not over-restrict normal development or depend on paths that only exist on your machine.

Third, keep hook output concise.

Claude Code consumes this output. If it is too long, it pollutes the context. If it is too vague, it does not guide the next step. It is best to return only the necessary judgment and next recommendation.

Fourth, keep hooks debuggable.

When hooks increase in number, problems can come from configuration, scripts, permissions, paths, dependencies, or Claude Code itself. Clear logs make later debugging much easier.

Reference

disler/claude-code-hooks-mastery

Final Thought

The value of Claude Code Hooks is turning “rules I hope AI remembers every time” into workflows that actually execute.

If you already use Claude Code in real projects, hooks are a key step from “a coding assistant that can chat” toward “a constrained engineering collaborator.”

Claude-Mem: Adding Cross-Session Long-Term Memory to Claude Code

Fri, 01 May 2026 03:01:02 +0800

Claude-Mem is a persistent memory system for Claude Code.

It tries to solve a very specific problem: every time an AI coding assistant starts a new session, it often forgets earlier architecture decisions, past pitfalls, project preferences, and implementation context.
If a project lasts for a long time, repeatedly explaining the same background becomes a waste of time.

The idea behind Claude-Mem is to compress Claude Code conversations into memories, store them in a local database and vector store, and then retrieve them later through a search tool.

What Problem Does It Solve?

Claude Code is good at code tasks, but session context is still limited.

Common pain points include:

A new session does not know what previous sessions did
Project design decisions need to be explained repeatedly
Problems that were already debugged are easy to repeat
Long-running tasks lack continuity
Project knowledge is hard to accumulate across conversations

Claude-Mem is designed around these problems.

It is not simply saving chat logs. Instead, it compresses conversations into memory fragments that are easier to retrieve. When needed later, semantic search can bring the relevant context back.

How It Works

From the README design, Claude-Mem mainly consists of several parts.

The first part is hooks.

It integrates with the Claude Code session flow and captures conversation data at the right time.

The second part is a background worker.

The worker processes raw conversation content into shorter, more searchable memories.

The third part is local storage.

The project uses SQLite for structured metadata and Chroma for vector indexing. This preserves basic session information while supporting semantic retrieval.

The fourth part is mem-search.

This is the query entry point for Claude Code. When old context is needed, it can search relevant memories through this tool.

The overall flow can be understood like this:

Claude Code sessions generate content
Hooks capture session data
The worker asynchronously compresses and organizes it
Memories are written to SQLite and Chroma
Later sessions retrieve them through mem-search

When Is It Useful?

Claude-Mem is suitable for long-running projects, not one-off small tasks.

For example:

A repository is developed over many days
The code structure is complex and has a lot of background
Project conventions, naming habits, and architecture choices need to be remembered
Claude Code is often used for bug fixes, features, and documentation
You want the AI to remember why something was changed earlier

If you only ask Claude Code to make a one-line change, long-term memory is not very meaningful.
But if you treat Claude Code as a long-term collaborator, it becomes useful.

Installation and Startup

The README gives a direct installation flow:

1
2

npm install -g claude-mem
claude-mem install

Start it with:

`1`	`claude-mem start`

Check status:

`1`	`claude-mem status`

Stop it when needed:

`1`	`claude-mem stop`

The goal behind these commands is to connect the memory system as a long-running local service to the Claude Code workflow.

How to Use `mem-search`

mem-search is the key entry point for retrieving memory.

It is not meant to replace ordinary search. It lets Claude Code query past conversations by meaning.

For example, Claude Code can search for:

Why a module was designed in a certain way
How a bug was debugged earlier
Naming rules agreed on in the project
Technical trade-offs discussed before
The background behind a refactor

This is different from simple keyword search.
If memory compression and vector indexing work well, you can retrieve semantically related content even if you do not remember the exact wording.

How Is It Different from Project Documentation?

Project documentation is good for stable conclusions.

For example:

Architecture notes
Deployment procedures
API conventions
Database structure
Development rules

Claude-Mem is better for context created during conversations.

For example:

Why a plan was rejected
How a temporary issue was worked around
The discussion behind an implementation
Project preferences not yet written into docs
Task background accumulated across multiple conversations

The two are not replacements for each other.
A good workflow is to write stable knowledge into project docs and use the memory system to help retrieve conversational context.

Things to Watch Out For

First, more long-term memory is not always better.

If every conversation is saved without distinction, later retrieval can become noisy. The most valuable memories are project decisions, implementation background, debugging history, and long-term preferences.

Second, memory cannot replace code and documentation.

Old context found by AI is only a reference. Final judgment still depends on the current code, test results, and latest requirements.

Third, pay attention to privacy and local data.

Since it stores conversation content, you should know which projects are suitable for it and which sensitive information should not enter the conversation.

Fourth, memory systems need maintenance.

As a project moves forward, old memories may become outdated. If outdated context is reused incorrectly, it can mislead later tasks.

Why This Kind of Tool Matters

AI coding tools are moving from one-off Q&A toward long-term collaboration.

In one-off Q&A, the model only needs to answer the current question.
In long-term collaboration, it needs to know project history, earlier decisions, team preferences, and pitfalls that have already been found.

This is where tools like Claude-Mem matter: they turn “remembering context” from a temporary chat capability into a local system that can be installed, run, and searched.

For real engineering projects, this is more practical than simply making the model context window longer.
Much information does not need to be stuffed into context all at once; it needs to be retrieved at the right time.

Who Should Try It?

You may want to try it if:

You use Claude Code frequently
You often work on the same project across multiple days
The project context is complex
You repeatedly explain the same background to AI
You want to preserve experience from conversations

If you only use Claude Code occasionally, or the project is small, you may not need this kind of system yet.

Reference

thedotmack/claude-mem

Final Thought

The point of Claude-Mem is not “saving chat logs.” It is helping Claude Code retrieve useful context in later tasks.

As AI coding moves from one-off tasks to long-running project collaboration, memory systems will become increasingly important.
They cannot replace documentation and tests, but they can reduce repeated explanations and make the AI feel more like an assistant that understands project history.

Claude.md Is Not Better When It Is Longer: How to Write Global Memory Files for AI Coding

Wed, 29 Apr 2026 21:07:37 +0800

I recently saw a discussion about global memory files for AI coding: after projects add files such as Claude.md or AGENTS.md, the results do not necessarily improve. In some cases, success rates may even drop while reasoning cost rises.

At first, this feels counterintuitive. We usually assume that if we give AI more project background, more rules, and more explanation, it should write code more accurately.
The real issue is that Claude.md is not an ordinary document. It is a global memory file that gets injected into the context on every conversation. The more it contains, the more the model has to read every time; the vaguer it is, the more judgment the model has to make; and if it contains workflows that should not always run, the model may trigger unnecessary actions in unrelated tasks.

So the hard part of writing Claude.md is not making it complete. It is deciding which pieces of information deserve to occupy context permanently.

What Claude.md Is

In AI coding tools, files such as Claude.md and AGENTS.md are essentially global memory files.

Normal conversation enters the context, but context length is limited. Once the conversation becomes long, historical content is compressed and some details are lost. A global memory file fixes important rules in place so the model can see them at the beginning of every task.

This means two things:

Content written there is harder to forget
Content written there also costs something on every task

It is not like a README that is read only when needed. It is more like a long-lived set of working constraints. Once something is placed there, it affects the model’s judgment by default.

Therefore, Claude.md is not a project introduction, not a collection of tips, and not a place to dump every development process. It should only store rules that the model is likely to violate repeatedly if it does not know them.

Why It Can Make Things Worse

A poorly written global memory file usually causes three kinds of problems.

First, it consumes context.

If Claude.md has one thousand lines, those lines stay in the model context for a long time. Code, error messages, and requirements that are actually relevant to the current task may get squeezed. Context is not free space. The larger the global rule file, the easier it is to dilute the current task.

Second, it can trigger unnecessary behavior.

For example, a global file might say:

1
2

Before every task, fully read the project directory.
After every change, run a complete end-to-end test.

These lines look responsible, but in a global memory file they become “do this for every task.” Even if the task is only changing one line of copy, the model may perform unnecessary exploration and tests because of these rules. The result is slower work, higher cost, and sometimes more interference.

Third, it increases the burden of judgment.

Statements like “keep code elegant, concise, maintainable, and extensible” sound correct, but they are weak constraints. Every time the model generates code, it has to decide what elegant or extensible means, without receiving a clear boundary.

A better approach is to write concrete prohibitions or counterexamples instead of abstract virtues. For example:

1
2
3

Do not add a generic abstraction for a single call site.
Do not change shared parsing logic without test coverage.
Do not put temporary scripts in the application source directory.

These rules are more specific and easier to follow.

What Should Go In

You can use a simple standard to decide whether something belongs in Claude.md:

If the AI will repeatedly make the same mistake without it, then it is worth writing down.

Content suitable for a global memory file usually has these traits:

It is durable
It is strongly tied to the current repository
It cannot be naturally inferred from the code structure
It clearly changes model behavior
It is preferably a constraint, prohibition, path rule, or fixed command

For example:

For all Hugo posts, only edit index.zh-cn.md and do not automatically generate other language versions.
Article front matter must include title/date/draft/tags/categories/slug/description.
Do not modify generated artifacts under public/.
On PowerShell, use scripts/deploy.ps1 for deployment.

These are not vague suggestions. They are tied to how the repository actually works. If the model does not know them, it may make mistakes; once it knows them, it can avoid real missteps.

What Should Stay Out

Many people turn Claude.md into a project manual. That is usually unnecessary.

Content that generally does not belong there includes:

Project vision and background
Large directory structure descriptions
Temporary task plans
One-off debugging steps
Abstract code quality slogans
Long workflows that are only needed in a few situations

For example, a description like “this is an e-commerce project with product, order, and user modules” helps very little with a concrete coding task. During real development, the model should rely on the current requirement, specification, code structure, and tests, not on a rough project introduction in global memory.

The same applies to directory structure. Unless a directory has a special convention, such as “shared components must be imported from this directory,” there is no need to write the entire tree into the file. The model can read the project directory itself. A static directory description is easy to become stale.

Workflows Belong in Skills or Commands

If a section says “first do this, then do that, then do the third thing,” it may not belong in Claude.md.

Long-lived workflows can be turned into skills, scripts, or commands. The benefit is that the global memory only needs to keep the name and trigger condition, while the detailed steps are loaded only when needed.

For example:

1
2

When the user asks to translate a Hugo post, use the post-translate skill.
When the user asks to deploy the site, run the hugo-rsync-deploy workflow.

This is lighter than putting the full translation and deployment processes into Claude.md. Global memory stays short, and detailed workflows live in triggerable tools.

Claude’s newer initialization flow is also moving in this direction. It does not only generate a Claude.md; it also tries to split reusable workflows into skills and fixed events into hooks. The underlying idea is clear: global memory should be an entry point, while details should be loaded on demand.

Claude.md Needs Iteration

Claude.md should not be written once and then ignored.

A better approach is to keep it short at first and let real tasks expose problems. If an error happens once, handle it manually. If the same kind of error appears two or more times, it may deserve to become a global rule.

This kind of iteration is more useful than writing a huge set of rules at the beginning. Early on, you do not know which rules are truly useful or which lines will become noise. As the project grows, collaboration increases, and the model’s behavior becomes clearer, you can gradually add the high-frequency problems.

There is also an important trend: the stronger the model, the shorter the global memory file should become.

Many requirements that once had to be written into prompts are now handled naturally by the model. Continuing to put those basic requirements into Claude.md only increases context load. Global memory should shrink as model capability improves, keeping only what is unique to this repository and cannot be inferred automatically.

A More Practical Way to Write It

When writing Claude.md, think in this order:

What special conventions does this repository have?
Which mistakes has the model made more than once?
Which directories, files, or commands must never be misused?
Which workflows should become skills, scripts, or commands instead of permanent context?
Which parts are merely introductions and can be deleted?

The final file may be only a few dozen lines. It does not need to fully explain the project. It needs to constrain behavior precisely.

A good Claude.md might look like this:

# Working Rules

- Only edit files related to the current task.
- Do not modify generated artifact directories such as public/ or resources/.
- Hugo post rewrites only process index.zh-cn.md and do not generate other language versions.
- If deployment is involved, run the Hugo build first, then execute the existing rsync script.
- When there are existing user changes, do not revert them. Continue from the current state.

It is short, but every line affects real behavior. That is the kind of content worth keeping in context permanently.

Final Thought

The value of Claude.md is not to make AI “know more.” It is to make AI “avoid fixed mistakes.”

It is not a knowledge base or project encyclopedia. It is a long-lived constraint file for AI coding.
The more specific, shorter, and closer to real mistakes it is, the more useful it becomes. The more generic, longer, and more like a project introduction it is, the more likely it is to slow the model down or even make results worse.

Treat global memory as a scarce resource, not an unlimited scratchpad. That may be the most important principle for writing a good Claude.md.

Codex Is Starting to Control the Computer. What Does That Mean for the Future?

Wed, 29 Apr 2026 11:28:25 +0800

The most important part of this Codex update is not that it added another ordinary button. It is that Codex is starting to move toward “controlling the computer.”

In the past, using AI usually meant asking questions in a chat box, copying, pasting, and then manually operating software.
Now that boundary is expanding: AI does not just answer you. It can operate desktop applications according to your goal.

In the short term, this is a new feature. In the long term, it may change how many people use computers.

What This Feature Is

Simply put, Codex’s computer use capability lets it access and operate the desktop environment.

It can do things such as:

select and control an application
receive tasks in natural language
open browsers, AI tools, local files, or other software
enter text, click buttons, and wait for results
connect multiple steps into one task
keep running in the background without requiring the user to follow every step manually

Its role is not just to write a piece of text for you, but to complete an operation flow for you.

That is the key difference between an Agent and an ordinary chatbot:
a chatbot mainly gives answers; an Agent is closer to “receiving a goal and then executing it.”

Why This Matters

In the past, much automation required you to know how to write scripts.

For example, suppose you want to complete a cross-software workflow:

open a web page
find information
copy content
pass it to another AI tool
save a file
open the local directory and check the result

To automate this traditionally, you might need browser scripts, APIs, local programs, and even window automation.

But many ordinary users do not know how to write these things.
Even if they do, it may not be worth writing a script for a temporary task.

This is where computer use matters: it pushes “script-like capability” toward natural language.

You do not necessarily need to tell it exactly where to click.
You can tell it what result you want and let it try to complete the task.

Workflows It May Change

I think the first workflows to change will not be extremely serious or high-risk work, but the tasks that are annoying, fragmented, repetitive, and not worth writing a dedicated program for.

1. Moving Information Across Software

The most typical case is moving information between applications.

Previously, you might switch back and forth between a browser, a document, a chat window, and a local folder.
In the future, you can hand this kind of task to an Agent:

find a certain kind of information
summarize it into a document
save it to a specified directory
open the result for you to review

This work is not hard, but it consumes attention.
The value of an Agent is that it absorbs these small operations.

2. Coordination Between Multiple AI Tools

Many people’s real workflow is no longer based on a single AI tool.

It may look like this:

one tool writes code
one tool researches information
one tool generates images
one tool organizes documents

Previously, these tools were connected by manual copy and paste.
In the future, an Agent can become the middle layer: it opens tools, passes context, waits for output, and organizes results.

This can turn “multiple AI tools working together” from a manual process into a semi-automated process.

3. Office Software Automation

Spreadsheets, presentations, documents, and email share one trait: they are powerful, but many operations are fragmented.

If Agents can reliably control this software, the barrier to office automation will drop noticeably.

You do not need to remember where a menu is or learn complicated shortcuts.
You only need to describe the goal, such as:

turn this spreadsheet into a monthly report
make a one-page summary from this document
combine these materials into a clearly structured explanation

The tedious button operations will gradually be hidden behind natural language.

What It Means for Ordinary Users

For ordinary users, this kind of feature may have a more direct impact than “the model got a bit smarter.”

Because it lowers the operation barrier, not just the knowledge barrier.

Many people can describe what they want, but they do not know where to click or how to combine features inside software.
If Agents can take over this part, using a computer may become:

1
2
3

I describe the goal
Agent operates the software
I check the result

That is closer to real productivity than simple chat.

Its Impact on Software

If this kind of Agent capability continues to mature, software itself will also be affected.

In the past, software design mainly served human clicking.
In the future, software may also need to serve Agent operation.

This means:

interface elements need to be clearer
operation feedback needs to be more stable
local permissions need to be more granular
software may provide interfaces better suited for Agent calls
users may care more about whether software can be operated smoothly by AI

In the long run, the boundaries between applications may become thinner.
Users may care less about “which app should I open” and more about “what task do I want to complete.”

Do Not Overhype It Yet

Of course, it is not time to fully let go yet.

This kind of capability still has several clear limitations:

stability still needs observation
complex tasks may fail in the middle
permission boundaries must be handled carefully
account, payment, and file deletion operations should not be delegated casually
quota consumption is not something you can completely ignore

So at this stage, the best use case is not letting it take over the whole computer, but letting it handle low-risk, reviewable, step-heavy tasks.

For example:

organizing materials
generating drafts
moving content across tools
opening and checking files
running semi-automated workflows that can be reviewed by a human

One Last Line

The real importance of this Codex update is that it pushes AI from “answering questions” toward “operating the environment.”

In the short term, it is a computer use feature.
In the long term, it may mark a shift in how personal computers are used.

In the future, we may spend less time remembering buttons, finding menus, and switching windows.
More often, we will describe the goal, let an Agent execute it, and then let humans make the final judgment.

Why Does a Codex Skill Exist in the Directory but Still Not Show Up?

Wed, 29 Apr 2026 11:18:00 +0800

This problem was easy to miss: several skills were already placed under ~/.codex/skills, but after opening a new Codex thread, the sidebar still showed only a small subset of them.

At first, it looked like a cache or indexing issue. The real cause was more specific: several SKILL.md files started with a UTF-8 BOM. Codex 0.111.0’s skill loader did not skip that byte sequence, so it misjudged the files as having no valid YAML front matter.

Symptom

The local directory contained these skills:

~/.codex/skills/git-commit-push/SKILL.md
~/.codex/skills/hugo-rsync-deploy/SKILL.md
~/.codex/skills/bilibili-speech-transcriber/SKILL.md
~/.codex/skills/product-cutout-normalize/SKILL.md

But after opening a new thread, the actually exposed skills were only:

1
2

bilibili-speech-transcriber
product-cutout-normalize

In other words, a file existing on disk does not mean the current session can load it successfully. Codex parses the front matter of each SKILL.md first. If parsing fails, that skill is excluded directly.

Investigation

Starting a fresh session with codex exec showed a more direct error. In VS Code or other IDEs, these logs may not be visible:

1
2

failed to load skill C:\Users\knightli\.codex\skills\git-commit-push\SKILL.md: missing YAML frontmatter delimited by ---
failed to load skill C:\Users\knightli\.codex\skills\hugo-rsync-deploy\SKILL.md: missing YAML frontmatter delimited by ---

Visually, these files seemed to have a normal header:

---
name: post-rewrite
description: ...
---

The real problem was at the byte level.

The beginning of a failing file was:

`1`	`EF-BB-BF-2D-2D-2D`

The beginning of a file that loaded correctly was:

`1`	`2D-2D-2D`

2D-2D-2D is ---. The preceding EF-BB-BF is the UTF-8 BOM.

Cause

In Codex 0.111.0, the skill loader expects the first byte of SKILL.md to be the first - in ---.

If the file starts with a UTF-8 BOM, the actual beginning becomes:

`1`	`BOM + ---`

So the loader thinks the file does not start with the front matter delimiter and reports:

`1`	`missing YAML frontmatter delimited by ---`

The skill content was not wrong, and the directory was not wrong either. A small encoding detail prevented the parser from recognizing the file.

Fix

Convert the affected SKILL.md files to UTF-8 without BOM.

In PowerShell, this can be done like this:

$paths = @(
  'C:\Users\knightli\.codex\skills\git-commit-push\SKILL.md',
  'C:\Users\knightli\.codex\skills\hugo-rsync-deploy\SKILL.md',
)

$utf8NoBom = New-Object System.Text.UTF8Encoding($false)

foreach ($p in $paths) {
  $text = [IO.File]::ReadAllText($p, [Text.Encoding]::UTF8)
  [IO.File]::WriteAllText($p, $text, $utf8NoBom)
}

After processing, the file header should change from:

`1`	`EF-BB-BF-2D-2D-2D`

to:

`1`	`2D-2D-2D`

Verification

After restarting a Codex session, the visible skills were restored to:

git-commit-push-zh
hugo-rsync-deploy
bilibili-speech-transcriber
product-cutout-normalize

If the sidebar still shows the old list, close the current Codex sidebar or window and reopen the project. The skill list is usually loaded when the session starts, so changes made in the middle of a session may not refresh immediately.

One Last Line

This kind of issue is easy to mistake for “Codex did not re-index” or “the skill was not installed correctly.”

When troubleshooting, check these three things first:

whether SKILL.md is really in the correct directory
whether the file has valid --- front matter at the top
whether the file is UTF-8 without BOM

The key in this case was the third point: the file looked fine, but its first byte was not -, so Codex did not treat it as a valid skill.

What Is the Difference Between ~/.codex/skills and Project .codex/skills in Codex

Wed, 29 Apr 2026 11:08:00 +0800

When organizing Codex skills, people most often get stuck on two questions:

What is the difference between ~/.codex/skills and project/.codex/skills?
Why does a skill exist in the directory but not appear in the current session?

Here is the short version.

The Difference

The simplest way to remember it:

~/.codex/skills is your global skill library
project/.codex/skills is the local skill library for that repository

`~/.codex/skills`

Use it for:

skills you personally reuse across projects
general workflows that are not tied to a specific repository
workflows that clearly belong to your own habits

For example:

post-rewrite
post-translate
git-commit-push
hugo-rsync-deploy
bilibili-speech-transcriber

The key trait of this kind of skill is: it still makes sense outside the current project.

`project/.codex/skills`

Use it for:

workflows that only apply to this repository
rules tightly coupled to the current project structure, scripts, or templates
skills that should be shared by the team

For example:

a publishing workflow specific to this repository
a generation template that only works in this project
automation steps tightly bound to private project scripts

The key trait of this kind of skill is: it stops being meaningful once it leaves this repository.

When to Use Global and When to Use Project Skills

This rule of thumb is enough:

If it is about your personal habits, put it in ~/.codex/skills
If it is about repository rules, put it in project/.codex/skills
If it can be reused across projects, prefer global
If it should be shared by multiple people and evolve with the repository, prefer project-level

The Current Repository

Based on the current state:

your machine has ~/.codex/skills
this repository does not have .codex/skills

So right now, you mainly rely on global skills.

That means workflows such as post-rewrite, post-translate, and git-commit-push are currently more like part of your personal workflow, not something explicitly bundled with this repository.

Why a Skill Exists on Disk but May Not Appear in the Current Session

There are two different things here:

Existing on disk: the skill file exists in a local directory
Exposed to the session: the current session registered it into the available skill list

These are not the same thing.

So this can happen:

a skill already exists under ~/.codex/skills
but it does not appear in the list after /

This usually does not mean the skill is broken. More often, it means: the current session has not re-indexed it.

How to Make a Skill Available in the Current Session

The practical checklist is short.

1. Put It in the Right Directory

Global:

`1`	`~/.codex/skills/<skill-name>/SKILL.md`

Project-level:

`1`	`project/.codex/skills/<skill-name>/SKILL.md`

2. Make the `SKILL.md` Header Recognizable

At minimum, it needs:

---
name: your-skill-name
description: What this skill does
---

3. Open a New Session After Creating or Editing It

In many cases, a skill does not appear because the current session already fixed its available skill list when it started.

So if you create a skill in the middle of a session, it may already exist on disk, but this session may not recognize it.

The most reliable workflow is:

Put the skill in place
End the current session
Re-enter the project
Open a new session
Check whether it appears under /

4. Put Project Skills in Place Before Starting

If you want project/.codex/skills to be recognized more reliably, put those skills into the project before entering the repository and starting the session.

One Last Line

The shortest conclusion is:

~/.codex/skills is your personal skill library
project/.codex/skills is the repository’s local rule library
a skill existing in the directory does not mean the current session will always show it
the most common fix is to put it in the right directory, write a valid SKILL.md, and then start a new session

Ralph and Multi-Agent Collaboration: How to Keep AI Working Reliably Over Long Tasks

Mon, 27 Apr 2026 08:19:02 +0800

If you have been using coding agents lately, you quickly run into a very practical question: AI can work, sure, but how do you keep it working for hours without drifting, forgetting requirements, or redoing the same work?

That is the real question behind many discussions around Ralph and multi-agent collaboration. The point is not simply to compare which model is stronger. The more useful question is this: how do you design a workflow that lets AI stay stable during long tasks?

If you break the problem down, there are usually two main routes:

The Ralph approach: keep starting fresh sessions and connect context through the filesystem
The multi-agent approach: let a lead agent coordinate while worker agents split the execution

Put more simply, the question is not “which model is more powerful,” but “how do you organize AI so it behaves more like a small team that can keep delivering?”

01 Why Long Tasks Go Off the Rails

In short tasks, many problems stay hidden. You give an instruction, the model reads a few files, changes a few lines, and the job is done.

Once the task gets longer, the common failure modes start to pile up:

Conversations grow longer and context starts to bloat
Earlier requirements get squeezed out by newer information
One agent has to plan, implement, and test at the same time
Without a clear acceptance step, “it is done” often just means “it says it is done”

So when AI runs for a long time, the real challenge is often not single-shot model quality. It is task slicing, state handoff, role separation, and feedback loops.

02 The Ralph Approach: Break Long Tasks into Short Rounds

Ralph is a good fit when the main problem is dirty, overloaded context.

Its core pattern is straightforward:

Keep launching new agent sessions in a loop
Let each round handle only one small enough task
Store cross-round state in files instead of forcing everything into one conversation

The benefit is immediate: every round starts with fresh context, so the session stays more focused and is less likely to get dragged down by old history.

If you have already looked at Ralph-style projects, the structure will feel familiar:

Current tasks live in structured files
Intermediate learnings go into progress files
Code changes stay in git history

In other words, Ralph does not try to make one agent remember everything forever. It externalizes memory on purpose so the session itself can stay lighter.

This kind of setup works especially well when:

The work can already be split into small stories
Each story can fit inside one context window
The project already has tests, typecheck, or other checks

It is a solution to the problem of how to keep AI moving forward one round at a time.

03 The Multi-Agent Approach: Split the Work One Agent Cannot Handle Alone

The other route is multi-agent collaboration.

In this kind of workflow design, the more promising pattern is usually this: the lead agent should not do all the work directly. Instead, it coordinates while other agents handle development, testing, checking, and acceptance.

That differs from Ralph in an important way:

Ralph feels more like serial iteration
Multi-agent work feels more like parallel division of labor

When the task naturally contains different roles, multi-agent collaboration becomes easier to use. For example:

One agent breaks down the task and writes the execution plan
One agent implements the actual change
One agent tests and validates the result
One agent checks whether the result still matches the original goal

The point is not to open more windows for the sake of it. The real value is role separation. Tasks that used to be piled onto one agent can now be split into clearer stages.

Once the role boundaries are clear, several problems become lighter:

The person writing does not have to be the same one reviewing
The testing side does not have to reconstruct the full requirement every time
The lead agent is less likely to drown in implementation detail

This is a solution to the problem of how to make AI cooperate more like a small team.

04 The Real Key Is Not Parallelism, but Task Design

Whether you choose Ralph or multi-agent collaboration, the easiest thing to underestimate is this: workflow design matters more than opening more agents.

If the task split is wrong, adding more agents only parallelizes the confusion.

A more stable breakdown usually has a few traits:

One task maps to one clear objective
One role owns one category of output
Every round has a clear done condition
The output of one round can be consumed directly by the next

For example, instead of giving AI one giant instruction like “build the whole feature,” a steadier structure is often:

Break out requirements and boundaries first
Then split implementation
Then split testing
Then make acceptance its own step

The advantage is that when something goes wrong, it becomes easier to tell whether the problem sits in understanding, implementation, testing, or delivery criteria.

05 Why Acceptance Matters So Much

Many AI workflows fail not because nothing happened earlier, but because the last step lacked a genuinely independent confirmation pass.

In long tasks, there is often a wide gap between “a result was produced” and “the result is actually usable.”

So one especially important direction is to separate development from acceptance. Even without a complex process, it is worth asking at least these questions:

Did it really complete the original task?
Did it only patch the surface without fixing the root cause?
Did testing cover only the happiest path?
Did the upstream requirement get silently changed along the way?

Without that layer, AI can easily keep declaring success inside a long workflow.

06 How to Choose Between the Two

If you want a fast rule of thumb:

If your main pain is context bloat and long-session drift, start with Ralph
If your main pain is one agent wearing too many hats, start with multi-agent collaboration

More specifically:

Ralph fits work that is clear, granular, and easy to move forward round by round
Multi-agent collaboration fits work with strong role boundaries and a need for parallelism and cross-checking

In practice, these two approaches are not always competitors. A mature setup often combines them:

Use a Ralph-style outer loop to push the larger task forward
Use multi-agent collaboration inside each round for research, implementation, testing, and acceptance

That gives you both better control over long context and better collaboration inside a single round.

07 One-Sentence Summary

What makes these approaches worth studying is not that they recommend Ralph or multi-agent collaboration in isolation. It is that they make one practical truth very clear: keeping AI stable over long tasks depends less on the model itself and more on whether you designed context, tasks, roles, and acceptance well.

If you are already asking Claude Code, Codex, or other coding agents to handle longer real-world tasks, this kind of workflow thinking is often more valuable than simply switching to a stronger model.

What Ralph Is: Turning Claude Code and Amp into a Repeatable Autonomous Development Loop

Mon, 27 Apr 2026 08:08:55 +0800

If you have been paying attention to long-running coding agent workflows lately, snarktank/ralph is a project worth a close look. It is not another model wrapper or another chat UI. Instead, it organizes Claude Code or Amp into an autonomous loop that keeps running through stories in a PRD until everything is done.

Its core idea is simple: do not force the same agent to keep working inside an increasingly long and messy context. Start a brand-new AI coding session for every iteration instead. That keeps context from bloating and makes task boundaries much clearer.

01 What Ralph Is

Ralph describes itself very clearly: it is an autonomous AI agent loop that repeatedly runs an AI coding tool until the items in a PRD are complete.

The repository currently supports two tools:

Amp CLI
Claude Code

Each iteration starts a fresh instance. In other words, it does not depend on one endlessly extended conversation. Instead, it keeps memory in external state:

git history
progress.txt
prd.json

That detail matters a lot. When people let an agent run on large tasks, the main problem is often not that the model cannot code. It is that the session becomes heavier over time, starts losing context, forgets requirements, and repeats work. Ralph is designed almost entirely around that problem.

02 How It Works

Ralph’s workflow has three steps.

1. Write a PRD first

The README suggests starting with the bundled prd skill to generate a requirements document and break the feature into smaller stories.

2. Convert the PRD into `prd.json`

Then the ralph skill converts the Markdown PRD into a structured prd.json. That file stores the user stories and whether each one has passed.

3. Run the loop script

The actual execution is handled by ralph.sh. The commands look like this:

1
2

./scripts/ralph/ralph.sh [max_iterations]
./scripts/ralph/ralph.sh --tool claude [max_iterations]

The default is 10 iterations. In each round, Ralph roughly does the following:

Create a branch from branchName
Pick the highest-priority story where passes: false
Implement only that story
Run quality checks such as typecheck and tests
Commit if the checks pass
Update prd.json
Append learnings to progress.txt
Continue to the next round

So Ralph is not trying to finish everything in one go. It compresses work into many small loops that can fit inside a single context window.

03 What Makes Ralph Interesting

1. Every round uses fresh context

This is Ralph’s defining design choice. The README emphasizes that every iteration is a brand-new AI instance, and cross-iteration memory lives only in git, progress.txt, and prd.json.

That is very different from the common pattern of keeping Claude Code or another tool inside one long conversation. Once tasks get larger, that approach often slows down under its own history and gradually loses focus. Ralph accepts that no single round should remember everything, then moves memory into files instead.

2. It forces tasks to stay small

The docs explicitly say that each PRD item must be small enough to finish within one context window. Tasks like adding a filter, updating a server action, or adding a database column are about the right size. Tasks like rebuilding the whole API or creating an entire dashboard are too large.

That constraint is practical. Many autonomous agent loops fail not because the loop is bad, but because the task slicing is too coarse and each round carries too much at once.

3. It preserves learnings, not just code

Beyond progress.txt, the README also stresses updating AGENTS.md. The reason is straightforward: future iterations and future developers will read those notes, so patterns, gotchas, and conventions discovered in each round should be written down in the project itself.

Put differently, Ralph is not only trying to keep an agent coding continuously. It is also trying to help the agent build working memory about the codebase over time.

04 When It Fits Best

Ralph is a good fit when your task looks like this:

It can already be broken into a clear set of user stories
The codebase has reliable feedback loops such as tests, typecheck, or CI
You want the agent to keep moving forward without putting everything into one long conversation
You are fine with iterative progress instead of demanding a one-shot completion

On the other hand, if the requirement is still vague, or the work depends on frequent discussion and constant changes of direction, Ralph may not be the first thing to reach for. It fits better once the requirements are already shaped and execution needs to be steady.

05 How It Differs from Normal Claude Code Usage

With plain Claude Code, the usual pattern is simple: open a session and let it keep reading code, editing files, and running commands. That works very well for small and medium tasks, but larger tasks often hit two problems:

Context keeps growing
Intermediate decisions are harder to preserve in a structured way

Ralph turns Claude Code or Amp into something closer to a batch executor:

The task source is prd.json, not ad hoc chat instructions
Each iteration recognizes only one story
Completion state is written back to files
Learnings go into progress.txt
Code changes are preserved in git

So in practice, it feels less like a new AI assistant and more like an iteration controller added on top of a coding agent.

06 One Important Requirement

Whether Ralph works well depends less on the loop itself and more on the quality of your feedback loops. The README says this very directly: without typecheck, tests, and CI, errors will compound across later iterations.

For frontend tasks, the repository even recommends adding browser verification to the acceptance criteria. Without real verification, an agent can easily confuse “it looks done” with “it actually works.”

That point is important. Ralph is not magical automation. It is more like a force multiplier for the engineering discipline you already have. If your project already has clear task breakdowns and reliable checks, Ralph becomes much more useful. If those foundations are missing, the loop will only repeat the confusion.

07 One-Sentence Summary

What makes Ralph worth studying is not that it introduces a huge amount of new infrastructure. It takes a simple but useful idea and turns it into a practical workflow: let Claude Code or Amp handle one small story per round, keep focus with fresh context, and preserve continuity through git, prd.json, and progress.txt.

If you are already using coding agents in real projects and keep getting stuck on how to push long tasks forward reliably, Ralph’s approach is well worth borrowing.

References

GitHub repository: https://github.com/snarktank/ralph
Interactive flowchart: https://snarktank.github.io

nuwa-skill: Turning "distilling a person" from an idea into an executable workflow

Wed, 22 Apr 2026 16:20:00 +0800

[alchaincyf/nuwa-skill](https://github.com/alchaincyf/nuwa-skill) can easily make people think of one thing first: using AI to answer in a famous person’s voice. But what makes it genuinely interesting is not whether it sounds convincing. The key is that it tries to turn “distilling how a person thinks” into a repeatable workflow.

If that works, the value goes far beyond a few entertaining character prompts. It means taking someone’s judgment framework, priorities, common heuristics, and communication habits, and turning them into a skill that can be called again and again. What you want is not a sentence that sounds like something a person might say, but something closer to a working interface for “if this person analyzed the issue, what would they look at first, how would they trade things off, and what would they question?”

It solves modeling, not imitation

Many so-called persona prompts are basically just style overlays.

They usually ask the model to:

speak in someone’s tone
quote their signature lines more often
imitate the phrasing they use in public

That looks great in demos, but it often falls apart in real work. The reason is simple: tone is surface-level, while judgment structure is the core. A person is memorable not because they like a few certain words, but because they reliably approach problems in certain ways.

The direction of nuwa-skill is closer to extracting those stable methods. In other words, it cares less about “how to sound like them” and more about “how to think like them.”

A more complete workflow

From the repository description, nuwa-skill aims to build an end-to-end flow: enter a person’s name, then automatically do the research, extraction, and validation, and finally organize the result into a skill that can be used inside Claude Code.

There are several important shifts behind that idea.

First, it assumes the person being distilled does not have to be your coworker. Many people first encounter this kind of idea in the form of “capture how a strong teammate works.” That is valuable, but it is also limited: the sample pool is small, and it usually only covers internal team experience. nuwa-skill expands the target set to a much broader range of people, such as founders, investors, scientists, product managers, and writers.

Second, it emphasizes automation rather than asking the user to handcraft prompts. What really makes this kind of capability practical is not beautiful prompt wording, but whether you can consistently do source gathering, viewpoint synthesis, pattern extraction, and result validation. As soon as any one of those steps depends entirely on manual work, the reuse cost rises quickly.

Third, it tries to make the output a skill rather than a one-off conversation. The former can be reused, combined, and iterated on. The latter usually only works in the current context and falls apart after a few turns.

Why this direction matters

If you treat AI as a question-answering machine, the natural use case is “give me an answer.” But if you treat AI as a workbench, the question becomes “give me a way to look at this problem.”

That is where the value of nuwa-skill leans.

For example, when facing a product decision, what you want may not be one standard answer. You may want several sharply different analytical frames:

one person starts with long-term compounding
one starts with resource constraints
one starts with consistency of user experience
one starts with timing of market entry

If those frames can be packaged reliably, AI stops being “something that writes a paragraph for you” and becomes “something that helps you switch perspectives quickly.” That is much more useful than simply imitating famous quotes, because it directly affects decision quality.

Its most compelling part: turning tacit knowledge into callable assets

Many high-value capabilities are hard to write down as SOPs in the first place.

Why someone consistently judges better than others is often not because they know more explicit rules, but because they have built a tacit filtering system through years of practice:

which signals deserve attention first
which noise should be ignored immediately
which questions should be broken apart
which questions should be inverted
which conclusions must wait for more evidence

This kind of ability is hard to preserve because people cannot always explain it clearly themselves. That is exactly why structured extraction is so valuable. What makes nuwa-skill appealing is that it is not trying to move around surface knowledge. It is trying to reorganize cognitive habits.

Where it fits best

I think this kind of skill is especially useful in a few scenarios.

1. Multi-perspective review before a decision

If you already have a plan but worry that you are only thinking along the path you already know, switching into different “persona perspectives” to review the same issue is more valuable than asking the model to keep expanding your original wording.

2. Learning the judgment framework of a certain kind of expert

Many people learn from experts by collecting quotes, watching interviews, and copying summaries. In the end, they often only remember a few nice lines. Once a thinking pattern becomes a skill, learning becomes much closer to “repeatedly invoking it with real questions” rather than “making a pile of static notes.”

What teams truly lack is often not just documentation, but a shared answer to “how do we usually think when we hit a problem?” If this workflow matures further, it could also be used in reverse to preserve the methods of strong internal operators. It is just clear that the project does not want to limit the idea to internal use cases.

The hard part of projects like this

Of course, an attractive direction does not mean the hard problems are already solved.

The real challenge is never simply installing a skill. It is things like:

whether the sources are reliable enough
whether the extracted patterns are stable rather than illusions from scattered text
whether the model is actually using a person’s framework or merely repeating common impressions
whether the boundaries between different personas will blur inside the model

In other words, the key question is not “can it generate something that sounds plausible?” It is “can the cognitive framework produced by this skill survive reuse across many tasks?” If the project keeps going deeper on validation, its credibility will improve a lot.

Why it goes beyond a prompt template library

In the past, many projects handled this kind of capability as a prompt template library: one persona, one prompt, and the user copies it into a chat. The problem is that a template library is still basically a static asset. It updates slowly, validation is weak, and it is hard to turn it into a complete production workflow.

What nuwa-skill pushes further is that it turns “persona distillation” from a template problem into a workflow problem.

Once the center of gravity shifts from “write a prompt” to “systematically generate, validate, and iterate on a persona skill,” the whole thing starts to look more like engineering than inspiration. For anyone who wants to use it over the long term, that is the more important shift.

Closing

nuwa-skill is interesting not because it turns AI into a celebrity impression show, but because it pushes “how to learn how someone thinks” one step closer to something executable, reusable, and iterable.

If many persona prompts solve “how to talk like someone,” what this project wants to solve is “how to look at problems the way someone does.” The former is great for demos. The latter is much closer to a real productivity tool.

References

GitHub repository: https://github.com/alchaincyf/nuwa-skill
Project README: https://github.com/alchaincyf/nuwa-skill/blob/main/README.md
Skill definition: https://github.com/alchaincyf/nuwa-skill/blob/main/SKILL.md

RAGFlow Project Notes: Features and Usage of an Open-Source RAG Engine

Wed, 15 Apr 2026 22:09:25 +0800

RAGFlow is an open-source RAG engine from infiniflow. Its goal is not merely to provide a thin “upload documents and ask questions” shell, but to bring document parsing, chunking, retrieval, reranking, citation tracing, model configuration, agent capabilities, and API integration into one complete workflow.

If you are building an enterprise knowledge base, document Q&A, a support assistant, internal information retrieval, or you want to give an LLM a more reliable context layer, RAGFlow is one of the open-source options worth serious attention.

01 What Problem RAGFlow Solves

Most RAG systems run into three common issues:

Document parsing is unstable, especially for PDFs, scanned files, tables, images, and complex layouts.
Chunking strategy is opaque, so retrieval may look correct while the actual context is incomplete.
Answers lack trustworthy citations, making it hard for users to verify where the response came from.

RAGFlow focuses on exactly these problems. The project README emphasizes Deep document understanding, template-based chunking, chunk visualization, citation grounding, and multi-path retrieval with reranking. In other words, it cares more about “high-quality input leads to high-quality answers” than simply wiring a vector database to a chat UI.

02 Core Features

1. Deep Document Understanding

RAGFlow can extract knowledge from complex unstructured data. The README lists formats such as Word, PPT, Excel, TXT, images, scanned documents, structured data, and web pages.

This matters a lot for enterprise knowledge bases. Real-world material is rarely clean Markdown. It is usually a mix of contracts, reports, tables, scanned PDFs, product manuals, screenshots, and web content. If parsing quality is weak, retrieval and LLM answers will both suffer.

2. Template-Based Chunking

RAGFlow provides template-based chunking. The value here is that chunking is not a black box; different document types can use different strategies.

For example, articles, papers, tables, Q&A documents, image explanations, and contract clauses all need different chunk boundaries and granularity. Template-based chunking helps reduce problems like broken sentences, lost table context, and separated headings and body text.

3. Traceable Citations

RAGFlow emphasizes grounded citations, meaning answers can be traced back to source passages. It also offers chunk visualization, making it easier for people to inspect and adjust parsing and chunking results.

This is especially important in production. Internal enterprise Q&A is not only about producing something that “looks right”; it also has to be verifiable. For policy, compliance, finance, technical documents, and customer support content, citations and traceability are close to mandatory.

4. Automated RAG Workflow

RAGFlow turns the RAG lifecycle into a more complete workflow:

Create a knowledge base
Upload or sync data
Parse documents
Review and adjust chunks
Configure LLM and embedding models
Run multi-path retrieval and reranking
Build chat assistants
Integrate through APIs into business systems

That makes it closer to a RAG platform than a single library. For teams, both the UI and the API matter: non-engineers can maintain the knowledge base, while engineers can integrate the capability into existing systems.

5. Agent, MCP, and Workflow Extensions

Recent RAGFlow updates already include Agentic workflow, MCP, Agent Memory, and code execution components. That suggests it is no longer limited to traditional knowledge-base Q&A and is also moving toward agent-oriented scenarios.

A typical pattern is that an agent can use RAGFlow as a reliable enterprise knowledge layer: retrieve from the knowledge base when it needs context, generate answers with citations, and combine that with tools or workflow steps when necessary.

03 Basic Usage Flow

According to the official quickstart documentation, the common usage path for RAGFlow can be summarized in the following steps.

1. Prepare the Environment

The basic requirements listed in the official README are:

CPU >= 4 cores
RAM >= 16 GB
Disk >= 50 GB
Docker >= 24.0.0
Docker Compose >= v2.26.1

If you want to use the sandbox for the code executor, you also need gVisor. Another practical note is that the official Docker images mainly target x86 platforms. For ARM64, the project documentation recommends building the image yourself.

2. Clone the Project

1
2

git clone https://github.com/infiniflow/ragflow.git
cd ragflow/docker

3. Check `vm.max_map_count`

RAGFlow deployment depends on components such as Elasticsearch or OpenSearch, so on Linux you usually need to verify:

`1`	`sysctl vm.max_map_count`

If the value is below 262144, you can set it temporarily:

`1`	`sudo sysctl -w vm.max_map_count=262144`

If you want the change to persist after reboot, add it to /etc/sysctl.conf.

4. Start with Docker Compose

You can start the CPU mode directly:

`1`	`docker compose -f docker-compose.yml up -d`

If you want GPU acceleration for DeepDoc tasks, the README shows enabling DEVICE=gpu in .env before startup:

1
2

sed -i '1i DEVICE=gpu' .env
docker compose -f docker-compose.yml up -d

Then inspect the logs:

`1`	`docker logs -f docker-ragflow-cpu-1`

Once the services are ready, open the machine address in your browser. Under the default configuration, that is typically:

`1`	`http://IP_OF_YOUR_MACHINE`

5. Configure Model API Keys

RAGFlow needs LLM and embedding model configuration. The README mentions choosing the default LLM factory in service_conf.yaml.template and updating the corresponding API_KEY.

In practice, you need to configure models according to your provider:

Chat model
Embedding model
Rerank model
Multimodal model, if you want to understand images inside PDFs or DOCX files

6. Create the Knowledge Base and Upload Documents

After the service starts, the typical workflow is:

Log in to the Web UI.
Create a dataset or knowledge base.
Upload documents or configure a data source sync.
Wait for parsing to finish.
Inspect chunk results and adjust them when necessary.
Create a chat assistant and attach the knowledge base.
Test answer quality and citation sources.

If you need to integrate with a business system, you can continue with the RAGFlow API or SDK and connect retrieval and chat capabilities to your own application.

04 Suitable Scenarios

RAGFlow fits these kinds of needs:

Enterprise internal knowledge-base Q&A
Product manuals, technical documentation, and FAQ retrieval
Customer support and pre-sales assistants
Traceable Q&A over contracts, reports, and policy documents
Unified handling of multi-format materials
Teams that want both UI-based maintenance and API integration
Systems that want to use RAG as the context layer for agents

It is especially suitable when document formats are complex, citations matter, and people want to inspect or intervene in parsing results.

05 What to Watch Out For

First, RAGFlow is not a lightweight script. It has real infrastructure requirements. The official recommendation is at least 4 CPU cores, 16 GB RAM, and 50 GB disk. If you only want Q&A over a small amount of Markdown, a full platform may be unnecessary.

Second, document quality still matters. RAGFlow can improve parsing and chunking, but it cannot magically make low-quality, outdated, or contradictory source material reliable. Knowledge-base governance still matters before production.

Third, model selection directly affects quality. Embedding, rerank, chat, and multimodal model choices all influence retrieval and answer quality. RAGFlow gives you the workflow, but the final result still depends on data, models, and tuning.

Fourth, production deployments need careful attention to permissions and data security. Enterprise knowledge bases often contain internal documents, so deployment model, access control, logs, API keys, and model-provider data policy all need to be designed in advance.

06 Quick Take

RAGFlow’s strength is that it turns the hardest parts of RAG into platform capabilities: complex document parsing, explainable chunking, citation grounding, multi-path retrieval, reranking, model configuration, Web UI, API access, and agent extensions.

If what you need is a verifiable, maintainable enterprise knowledge base that can connect to business systems, RAGFlow is more complete than a “vector database plus a simple chat UI” setup. On the other hand, if you only need small-scale personal Q&A over simple data, a lighter RAG framework may be more resource-efficient.

GitHub project: https://github.com/infiniflow/ragflow
Official docs: https://ragflow.io/docs/dev/
Online demo: https://cloud.ragflow.io

Firecrawl Project Notes: Web Search, Scraping, and Interaction APIs for AI Agents

Wed, 15 Apr 2026 13:45:03 +0800

Firecrawl has a clear purpose: turning web pages into data that AI agents can consume more easily. It is not just a crawler script. It wraps search, single-page scraping, site crawling, page interaction, structured extraction, and agent workflows into APIs, so models and automation systems can spend less effort dealing with web noise.

01 What It Solves

Many AI applications need to read web pages, but real websites are messy: JavaScript-rendered content, pop-ups, pagination, login state, anti-bot defenses, PDFs or DOCX files, and plenty of navigation, ads, scripts, and styling that have nothing to do with the main content.

Firecrawl tries to solve this middle-layer problem. The application asks for data from a page, a site, or a topic; Firecrawl handles opening, scraping, cleaning, and returning output in formats that are easier for LLMs to use, such as Markdown, HTML, screenshots, or JSON.

The value of this kind of tool is not merely whether it can request a URL. The real question is whether it can reliably turn complex pages into usable data. For RAG, AI search, competitive research, automated information gathering, and web content monitoring, this layer often becomes the unpleasant plumbing in the system.

02 Core Features

The Firecrawl README groups its capabilities into several areas:

Search: Search the web and return full page content from the results.
Scrape: Convert a single URL into Markdown, HTML, screenshots, or structured JSON.
Interact: Scrape a page, then use prompts or code to click, scroll, type, wait, and perform other actions.
Agent: Describe what you want, and let the agent search, navigate, and return the result.
Crawl: Scrape multiple pages under a website.
Map: Quickly discover URLs on a website.
Batch Scrape: Asynchronously scrape large batches of URLs.

At first glance, it looks like a scraping service. But as a full set of features, it is closer to a data entry point for AI applications: search discovers sources, scraping cleans content, interaction handles dynamic pages, and Agent pushes the whole “find information” task further toward automation.

03 Why It Fits AI Agents

Traditional crawlers usually assume that you already know the URL and understand the page structure. Agent workflows are often different. A user might simply ask, “Find the differences between the latest pricing plans on a company’s pricing page.” The system then has to search, open pages, compare content, and return sources.

Firecrawl’s Agent endpoint is designed for this kind of task. It can accept only a natural-language prompt, or it can be constrained to specific URLs. If structured results are needed, it can also work with a schema to return fixed fields.

This gives the application layer two benefits:

You do not have to write a separate parser for every website.
The returned result is easier to send into an LLM, a database, or a downstream automation flow.

Of course, this does not mean it replaces every custom crawler. For highly constrained, high-frequency, large-scale tasks with very stable fields, writing dedicated parsing logic may still be cheaper and easier to control. Firecrawl is a better fit when sources are scattered, page structures change often, and you want to connect web data to an AI workflow quickly.

04 MCP, CLI, and Integrations

Firecrawl is also clearly moving toward the agent tooling ecosystem. The README provides MCP Server setup, along with Skill/CLI initialization commands for AI coding agents.

This means it is not only intended for backend API calls. It also wants to plug directly into Claude Code, OpenCode, Antigravity, MCP clients, and similar workflows. For people who frequently ask agents to research, scrape, and organize web content, this kind of integration is lighter than hand-writing API calls.

It also lists integrations with platforms such as Zapier, n8n, and Lovable. That direction is practical: web data does not always go into code. It may flow into automation tables, low-code workflows, content systems, or internal knowledge bases.

05 Open Source, Self-Hosting, and Licensing

Firecrawl is open source. The main repository is primarily licensed under AGPL-3.0; the README also notes that SDKs and some UI components use the MIT license, with details depending on the LICENSE files in each directory.

This matters. If you only use the cloud service, the main concerns are API cost, reliability, and compliance boundaries. If you plan to self-host it and provide a service to others, the obligations of AGPL-3.0 need careful review.

The README also reminds users to respect website policies, privacy policies, and terms of use, and says that Firecrawl respects robots.txt by default. The stronger this type of tool becomes, the more important it is to design compliance and scraping boundaries into the system instead of patching them in after launch.

06 Suitable Use Cases

I would consider Firecrawl first in these scenarios:

Scraping web content for a RAG system and wanting clean Markdown directly.
Building AI search or research assistants that need to read full pages after search.
Scraping JavaScript-heavy sites without maintaining a browser cluster yourself.
Monitoring public information such as competitors, pricing, documentation, news, and job pages.
Giving MCP clients or AI coding agents real-time web reading ability.
Quickly validating a web-data product before building crawler infrastructure.

The less suitable cases are also clear:

The target site has very few fields, a stable structure, and can be handled by a simple script.
The scraping volume is huge, and cost sensitivity matters more than development and maintenance cost.
The business needs very fine control over sources, retry strategy, anti-bot behavior, and audit trails.
Licensing or compliance requirements do not allow AGPL components or external cloud services.

07 Quick Take

Firecrawl’s core value is productizing the messy path from “web page” to “AI-usable data.” It puts search, scraping, cleaning, interaction, batch processing, and agent-style research into one interface, which is convenient for AI application developers.

If your project often needs models to read real web pages, especially when sources are scattered, structures are unstable, and MCP or agent workflows are involved, Firecrawl is worth keeping in the toolbox. If the task is just low-cost bulk collection from fixed websites, a traditional crawler or dedicated parser may still be the better choice.

GitHub project: https://github.com/firecrawl/firecrawl

What Is OpenHarness: What This Open Source Agent Harness Can Do

Sun, 12 Apr 2026 23:45:00 +0800

If you have been following open source AI agent tools lately, HKUDS/OpenHarness is a project worth watching. It is not just another chat wrapper. Instead, it pulls the infrastructure layer for a runnable, extensible, and governable agent into a standalone open source Agent Harness.

According to the official README, OpenHarness provides a lightweight but fairly complete set of agent capabilities, including tool calling, skill loading, memory, permission governance, and multi-agent coordination. The bundled ohmo is the personal AI assistant application built on top of that foundation.

01 What Is OpenHarness

You can think of OpenHarness as the runtime layer that gives a foundation model hands, memory, and boundaries.

A model may already be good at reasoning and generation, but if you want it to function as a long-running agent, it usually still needs these surrounding capabilities:

Calling tools instead of only producing text
Reading and writing files, executing commands, and using search and web access
Preserving context and memory across long sessions
Applying permission controls to risky actions
Splitting larger tasks across multiple sub-agents in parallel

The goal of OpenHarness is to turn that engineering layer around the model into a clear, open source, inspectable Python implementation. It is closer to an agent operating substrate than to a single model experience or a single chat interface.

02 The Project’s Basic Functions

Based on the current GitHub homepage and README, OpenHarness centers on the following capability areas.

1. Agent Loop

This is the core execution loop that lets an agent keep working over multiple steps. The official highlights include:

Streaming tool-calling loops
API retries with exponential backoff
Parallel tool execution
Token accounting and cost tracking

The practical point is that the agent is not limited to a one-shot response. It can observe, reason, call tools, read results, and continue iterating within the same task.

2. Tools, Skills, and Plugins

OpenHarness puts serious effort into the tool layer. The project page says it already includes built-in tools for files, Shell, search, web access, and MCP, and it supports on-demand loading of Markdown skill files.

Its value is not only that it has many tools, but that the composition model is fairly open:

You can use built-in tools directly
You can load skills for a specific task
You can extend hooks, skills, and agents through plugins
It is compatible with the anthropics/skills ecosystem and related plugins

If you want to turn repeated workflows into reusable capabilities rather than re-describing them in prompts every time, this layer is especially useful.

3. Context and Memory

This is one of the more important differentiators in OpenHarness. The official keywords include:

CLAUDE.md discovery and injection
Automatic context compression
Persistent memory through MEMORY.md
Session recovery and history continuation

That means it is not only reacting to the current input. It is designed to preserve project conventions, historical tasks, and long-term preferences, making the agent better suited for ongoing work instead of always starting from scratch.

4. Permission Governance and Safety Boundaries

Once an agent starts interacting with the filesystem, terminal, and network, governance becomes critical. OpenHarness provides:

Multiple permission modes
Rule controls based on paths and commands
PreToolUse / PostToolUse hooks
Interactive approval prompts

In other words, it is not only about enabling the agent to do things. It also defines which things can be done directly and which ones should require confirmation first.

5. Multi-Agent Coordination

OpenHarness also supports delegating work to sub-agents. The currently public materials mention capabilities such as:

Sub-agent creation and delegation
Team registration and task management
Background task lifecycle management

For more complex work, this means it can move beyond a single serial agent and attempt parallel collaboration.

6. Multi-Provider Workflows

OpenHarness does not treat providers as mere API labels. It abstracts them as workflow + profile combinations. According to the README, current directions include:

Claude / Anthropic-compatible
OpenAI-compatible
Codex Subscription
GitHub Copilot
Compatible backends such as Moonshot(Kimi), GLM, and MiniMax

That makes it feel more like a multi-model, multi-entry agent runtime framework rather than something tied to a single vendor.

7. React TUI and Non-Interactive Mode

OpenHarness ships with a terminal UI. Running oh opens a React/Ink TUI, and the official README says it supports:

A command picker
Permission confirmation
Model switching
Provider switching
Session recovery

If you do not want to enter an interactive interface, you can also use non-interactive mode to run a task once and return the result as standard output, JSON, or streaming JSON, which is helpful for scripting and automation.

03 What Is `ohmo`

If OpenHarness is the infrastructure layer, ohmo is the personal agent application built on top of it.

The project homepage is very clear about its positioning: it is not just a generic chatbot, but a personal assistant that can keep working across long conversations. The official description says it can interact with you through channels such as Feishu, Slack, Telegram, and Discord, and carry out tasks like:

forking a branch
writing code
running tests
opening a PR

The README also highlights that ohmo can run on top of your existing Claude Code or Codex subscription, so it does not necessarily require you to provision a new API key. For people already using those subscriptions, that lowers the barrier considerably.

04 What Scenarios It Fits

From the currently public capabilities, OpenHarness is a strong fit for people who:

Want to study what a production-grade agent is actually made of
Want to build an extensible open source agent runtime of their own
Want tools, skills, memory, permissions, and multi-agent coordination in one framework
Do not want to be locked into a single model vendor or client form factor
Want to build vertical agents or personal assistants on top of an existing architecture

If your goal is simply to find a finished assistant that can chat right away, OpenHarness itself may not be the lightest option. But if you care more about agent infrastructure, engineering control, and long-term extensibility, it is a very worthwhile project to study.

05 A Quick Way to Understand Its Positioning

In one sentence:

OpenHarness turns foundation models into agents that can actually execute work, while ohmo packages that capability into a personal assistant that can keep working with you over time.

You can also think of it as two layers:

OpenHarness: an open source Agent Harness, essentially the infrastructure layer
ohmo: a personal-agent app built on top of that infrastructure

As of April 12, 2026, the GitHub homepage shows the project had already advanced to v0.1.6 (April 10, 2026), with continued emphasis on automatic context compression, MCP transport support, the React TUI, and runtime stability for multi-agent workflows. That suggests it is still evolving quickly, but its direction is already quite clear.

References

GitHub repository: https://github.com/HKUDS/OpenHarness
English README: https://github.com/HKUDS/OpenHarness/blob/main/README.md
Chinese README: https://github.com/HKUDS/OpenHarness/blob/main/README.zh-CN.md

Getting Started with Playwright CLI: Installation, Skills, Sessions, and Essential Commands

Sun, 12 Apr 2026 14:36:58 +0800

If you have been using Claude Code, GitHub Copilot, or other coding agents for browser automation, microsoft/playwright-cli is a tool worth watching. It is not the traditional kind of browser helper meant mainly for humans typing commands by hand. Instead, it is a Playwright CLI designed for coding agents, with an emphasis on lower token overhead, a lighter command interface, and integration with Skills-based workflows.

From the official README, the core idea behind Playwright CLI is very clear: compared with MCP, which can push large tool schemas and page structure into the model context, the CLI approach is more compact and better suited for agent workflows that constantly switch between large codebases, tests, and browser automation.

01 What Playwright CLI is

playwright-cli is an open-source Playwright command-line tool from Microsoft. The official description is “CLI for common Playwright actions.” It is mainly used for tasks like these:

Opening pages and driving the browser
Recording and generating Playwright code
Capturing page snapshots to get element references
Taking screenshots and exporting PDFs
Working with coding agents for test automation and web interaction

The current GitHub README is very explicit about its positioning: if you are using coding agents, the CLI is often a better fit than Playwright MCP; if you need persistent state, richer introspection, and longer-running agentic loops, MCP still has its place.

In other words, Playwright CLI feels more like a browser automation interface built for AI coding assistants, not just a tool for engineers to click around manually.

02 Where it stands out

1. It fits agent workflows better

The official README lists Token-efficient as a key feature. It does not force full-page data into the LLM context. Instead, it lets the agent operate the browser through shorter and more focused commands.

That matters a lot for coding agents. In real projects, an agent is not only driving the browser. It also has to read code, edit files, run tests, and inspect logs. If the browser interface itself consumes too much context, the overall workflow becomes less efficient.

2. It works well with Skills

The README specifically highlights playwright-cli install --skills. That shows Microsoft is not treating it as just another shell utility, but as something that can be consumed directly by Claude Code, GitHub Copilot, and similar agents through a Skills-based workflow.

If your setup already relies on Skills, Playwright CLI should slot in naturally.

3. Session management is fairly complete

Playwright CLI supports sessions. By default, the browser profile stays in memory, so cookies and storage state are preserved across multiple CLI calls within the same session. If you add --persistent, the profile can be saved to disk and reused across browser restarts.

That makes it much more practical than tools that simply open a browser for one command and then throw everything away. It is also a better fit for long debugging cycles and longer-running agent flows.

4. It includes a visual monitoring dashboard

The README provides playwright-cli show, which opens a dashboard for observing and controlling all running browser sessions. This is especially useful when an agent is running automation in the background, because you can step in, inspect progress, and help with debugging instead of flying blind.

03 Installation and requirements

According to the current GitHub README, the basic requirements for Playwright CLI are:

Node.js 18 or newer
Claude Code, GitHub Copilot, or another coding agent

The installation commands are:

1
2

npm install -g @playwright/cli@latest
playwright-cli --help

There is one easy mistake worth calling out:

The officially recommended package right now is @playwright/cli
Do not confuse it with the old deprecated npm package playwright-cli

So the package you actually want is the scoped package, not the older historical one.

04 How to start using it

1. Install skills

If you want a coding agent to use Playwright CLI directly, the official recommendation is to install the skills first:

`1`	`playwright-cli install --skills`

The README explicitly says that Claude Code, GitHub Copilot, and similar tools will use the locally installed skills.

2. Let the agent call the CLI directly

If you do not want to handle Skills first, you can also let the agent read the CLI help output directly:

1
2

Test the "add todo" flow on https://demo.playwright.dev/todomvc using playwright-cli.
Check playwright-cli --help for available commands.

The README calls this “Skills-less operation.” The idea is that even without preinstalled skills, the CLI can still describe itself well enough for an agent to use it.

3. Try a minimal flow manually

The README includes a TodoMVC example that works very well as a first hands-on demo:

playwright-cli open https://demo.playwright.dev/todomvc/ --headed
playwright-cli type "Buy groceries"
playwright-cli press Enter
playwright-cli type "Water flowers"
playwright-cli press Enter
playwright-cli check e21
playwright-cli check e35
playwright-cli screenshot

This sequence is useful because it quickly shows how Playwright CLI works in practice:

open opens the page
type and press handle text input
check uses an element reference to toggle checkboxes
screenshot saves the result

05 `--headed`, sessions, and the monitoring dashboard

`--headed`

Playwright CLI is headless by default. If you want to see the browser window directly, you need to pass --headed when using open:

`1`	`playwright-cli open https://playwright.dev --headed`

This is especially helpful when debugging selectors, login flows, or any interaction that is easier to inspect visually.

sessions

The official README places a lot of emphasis on sessions. You can use different sessions to isolate different projects or sites:

1
2
3

playwright-cli open https://playwright.dev
playwright-cli -s=example open https://example.com --persistent
playwright-cli list

If you are letting an agent run over a longer period, you can also pass the session through an environment variable:

`1`	`PLAYWRIGHT_CLI_SESSION=todo-app claude .`

Useful session management commands include:

1
2
3

playwright-cli list
playwright-cli close-all
playwright-cli kill-all

In practice:

list shows all sessions
close-all closes all browsers gracefully
kill-all forcefully terminates all browser processes

Monitoring dashboard

If you want to see what the agent is actually doing in the browser, you can run:

`1`	`playwright-cli show`

According to the README, this dashboard has two main views:

Session grid: shows active sessions by workspace, with live preview, URL, and page title
Session detail: shows a live view of a selected session and lets you take over mouse and keyboard input

That means Playwright CLI is not only usable from the command line. It also has a fairly mature observability layer.

06 Which commands are worth memorizing first

If this is your first time using Playwright CLI, you do not need to memorize every command up front. These are the core ones worth learning first:

Pages and interaction

playwright-cli open [url]
playwright-cli goto <url>
playwright-cli click <ref>
playwright-cli fill <ref> <text>
playwright-cli type <text>
playwright-cli hover <ref>
playwright-cli press <key>

Getting page structure

playwright-cli snapshot
playwright-cli snapshot <ref>
playwright-cli snapshot --depth=N
playwright-cli eval <func> [ref]

snapshot is especially important because many later operations depend on element references stored as ref. In practice, you usually capture a snapshot first, then use the returned element identifiers for clicking, filling, checking, or taking screenshots.

Saving output

1
2

playwright-cli screenshot
playwright-cli pdf

Tabs

playwright-cli tab-list
playwright-cli tab-new [url]
playwright-cli tab-close [index]
playwright-cli tab-select <index>

07 Who should try it

Playwright CLI is especially worth trying in these kinds of scenarios:

You are using Claude Code, Copilot, or another coding agent for E2E testing
You want a lighter browser automation interface without pushing large page structures into model context
You want one browser session to persist across multiple commands
You want to monitor agent-driven web tasks through a dashboard while they run

If your main question is how to make browser automation work efficiently with coding agents, Playwright CLI will likely feel more natural than traditional manual debugging workflows.

References

What Is Hermes Agent: Overview, Strengths, Getting Started, and How It Compares to OpenClaw

Sun, 12 Apr 2026 14:07:58 +0800

If you have been following open-source AI agents lately, Hermes Agent is a project worth paying attention to. Built by Nous Research, its main appeal is not simply that it is “another chat wrapper,” but that it tries to bring long-term memory, reusable skills, context files, MCP extensions, a messaging gateway, and parallel sub-agents into one unified agent runtime.

Based on the official README, Hermes Agent has a very clear goal: it can work like a local CLI assistant in your terminal, or like a cloud-hosted personal assistant that stays available through Telegram, Discord, Slack, WhatsApp, Signal, and other channels. For users who want to combine a coding assistant, an automation assistant, and a personal AI workspace into one system, that positioning is compelling.

01 An overview of Hermes Agent

Hermes Agent is an open-source self-improving AI agent from Nous Research. It supports multiple model providers, including Nous Portal, OpenRouter, OpenAI, and custom OpenAI-compatible endpoints. It can also run across different execution backends such as a local terminal, Docker, SSH, Daytona, and Modal.

What separates Hermes from many “tool-using chatbots” is that it does not focus only on tool calls within a single session. It puts much more emphasis on building persistent capability across sessions. The official docs break this idea down into several parts:

Persistent memory: stores key information about the environment, project, and user preferences through MEMORY.md and USER.md.
Skills system: turns successful workflows into reusable skills that can be loaded on demand.
Context files: automatically reads files such as AGENTS.md, SOUL.md, and .cursorrules to inject project conventions directly into the session.
MCP integration: can connect to any MCP-compatible tool server to extend database, GitHub, filesystem, and scraping capabilities.
Messaging gateway: beyond the CLI, it can also be used through Telegram, Discord, Slack, WhatsApp, Signal, Email, and other entry points.

In one sentence, Hermes Agent feels more like a general-purpose agent operating layer with memory, skills, extensibility, and multi-channel access.

02 Where it stands out

1. It covers both CLI workflows and messaging workflows

Many agent projects lean either toward terminal-based developer assistance or toward chat-platform bots. Hermes tries to combine both. You can run hermes directly in the terminal, or continue with the same assistant through Telegram or Discord after starting the gateway.

The practical benefit is that Hermes is not limited to being useful only when you are sitting in front of your computer. If you deploy it to the cloud or a VPS, it can become a continuously available personal AI assistant.

2. It is designed for long-term use

Hermes does more than chat and call tools. It is also built around long-term accumulation:

Persistent memory with boundaries, instead of endlessly stuffing more context into each conversation.
A skills system that lets you save and reuse successful workflows.
Search across past sessions for retrieval and recall.
Project context files that reduce the need to repeatedly explain the same background.

This matters a lot for people who work repeatedly inside the same repositories, workflows, and team conventions. It means the agent is not just helping once; it can gradually become more familiar with your environment.

3. MCP support gives it strong extensibility

The Hermes documentation explicitly supports MCP and describes both stdio and HTTP integration modes. In practice, that means if an external system already has an MCP server, Hermes can usually connect to it with much lower integration cost.

That is more flexible than writing a custom plugin for every single system. For users who already have tools built around the MCP ecosystem, Hermes should be much easier to extend.

4. It is friendly to OpenClaw users

This part is especially interesting. The Hermes README directly provides hermes claw migrate, and explicitly says it can import configuration, memory, skills, API keys, and messaging platform settings from OpenClaw.

That suggests Hermes is not trying to ignore the existing ecosystem and start from zero. It is clearly positioning some OpenClaw users as a migration audience.

03 How to get started quickly

The officially recommended Hermes Agent installation method is very straightforward:

`1`	`curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh \| bash`

According to the official README, it supports Linux, macOS, WSL2, and Android Termux. One important note is that native Windows is explicitly not supported right now, so Windows users are advised to use WSL2.

After installation, you would usually refresh your shell first:

`1`	`source ~/.bashrc`

Then you can launch it directly:

hermes

If you want to go through a more complete step-by-step initialization flow, the easiest command is:

`1`	`hermes setup`

Based on the official documentation and README, a simple first-time setup path looks like this:

Run hermes setup to finish the base configuration.
Use hermes model to choose a model provider and model.
Use hermes tools to enable the toolsets you want.
Run hermes to enter the interactive CLI.
If you want channels such as Telegram or Discord, continue with hermes gateway.

If you are already an OpenClaw user, it is also worth previewing the migration command:

`1`	`hermes claw migrate --dry-run`

That lets you inspect what can be migrated before doing a real import.

04 How to think about it versus OpenClaw

From the official docs and README, Hermes Agent and OpenClaw are not simply a case of one replacing the other. Their positioning overlaps, but their priorities are clearly different.

What Hermes Agent feels like

Hermes feels more like a product centered on an agent core and workflow system. It emphasizes:

CLI experience
Memory and skill accumulation
Project context files
MCP extensibility
Parallel sub-agents
Switching execution backends across local, container, remote, and serverless environments

If your main goal is to make the agent understand your project better, reuse capabilities over time, and connect more naturally into MCP and developer workflows, Hermes is likely the better fit.

What OpenClaw feels like

OpenClaw feels more like a platform centered on a personal AI assistant plus a messaging gateway. It emphasizes:

Rich messaging channel integration
A continuously running Gateway
A browser-based Control UI
Device pairing, remote access, and status management
Stronger assistant-oriented surfaces such as voice, mobile access, and Canvas

If your main goal is to keep a personal AI assistant reliably available across multiple chat channels and devices, with a control panel to manage it, OpenClaw has a stronger product feel in that direction.

A more practical rule of thumb

You can roughly think of the two like this:

Hermes Agent: more of a “growing general-purpose agent workspace”
OpenClaw: more of a “multi-channel always-on personal AI assistant platform”

That distinction is not absolute, because both projects are still expanding and Hermes also offers a migration path from OpenClaw. But based on the currently public material, Hermes is more prominent on the memory, skills, context, MCP, and developer-workflow side, while OpenClaw looks more mature on the gateway, multi-channel, Control UI, and device-access side.

05 Who should try it

Hermes Agent is especially worth trying first if you fit one of these profiles:

You already rely heavily on AI tools in the terminal and want an agent that better understands your codebase and project rules.
You want to combine AGENTS.md, skills, memory, and MCP into one workflow.
You do not want to be locked into a single model vendor and prefer flexible provider switching.
You already use OpenClaw and want to explore a direction that is more centered on agent workflows.

If you care more about mobile reach, broad IM platform integration, a browser control console, and the feeling of an always-online personal assistant, OpenClaw still has a lot of appeal.

References

Hermes Agent GitHub: https://github.com/NousResearch/hermes-agent
Hermes Agent Docs: https://hermes-agent.nousresearch.com/docs/
Hermes Features Overview: https://hermes-agent.nousresearch.com/docs/user-guide/features/overview
Hermes MCP: https://hermes-agent.nousresearch.com/docs/user-guide/features/mcp/
OpenClaw GitHub: https://github.com/openclaw/openclaw
OpenClaw Getting Started: https://docs.openclaw.ai/start/quickstart
OpenClaw Control UI: https://docs.openclaw.ai/web/control-ui

OpenClaw Dreaming: Machines Start Dreaming While Humans Lose Sleep

Sun, 12 Apr 2026 12:41:34 +0800

Long-term memory has always been a weak point for large models. As context grows, memory becomes harder to manage. An agent may appear to remember everything, yet become worse at judging what matters and what should be forgotten.

On April 5, OpenClaw introduced an experimental feature called Dreaming. It is not just a catchy label. It is a background memory-management system modeled on human sleep, designed to help agents wake up with cleaner and more useful memory.

01 A sleep-based pipeline for memory consolidation

Dreaming does more than index data. It breaks memory processing into three stages that mirror different functions of human sleep.

Light Sleep: the system scans recent conversations and retrieval traces, removes duplication, and builds a candidate list. At this stage, it only buffers information and does not modify the core memory file MEMORY.md.

Deep Sleep: the system applies stricter filters to identify durable information. Only entries that pass thresholds for score, recall count, and distinct query count move forward. Before writing anything, it checks the latest logs again to remove stale content. The final result is appended to MEMORY.md, while a deep-sleep summary is written to DREAMS.md.

REM: after memory consolidation, the system looks for hidden links across recent behavior traces. It extracts patterns and reflective summaries, then stores them in a dedicated REM section to help the agent respond with better structure and broader context.

Dreaming also produces a human-readable dream journal. Once enough material accumulates, a background sub-agent calls the default model and appends a short natural-language entry to DREAMS.md.

02 A scoring system for deciding what deserves to stay

The real point of Dreaming is not just organizing memory, but filtering it. Instead of keeping everything, OpenClaw uses a weighted scoring model to decide what belongs in long-term storage.

The six dimensions are:

Relevance (30%): how useful the information is when retrieved.
Frequency (24%): how often the item appears in short-term signals.
Query diversity (15%): whether it shows up across different prompts and contexts.
Recency (15%): whether the information is still fresh and actionable.
Integration (10%): whether it remains stable across multiple days.
Concept richness (6%): how dense and connected its concept graph is.

In practice, this means the system tries to keep information that is repeated, useful, current, and broadly applicable, while letting lower-value noise fade away.

03 Why it reminds people of Claude’s “dreaming” approach

Some developers have noted that Dreaming resembles the automated dreaming logic described in leaked Claude Code material around the KAIROS system. Older approaches that repeatedly rewrote the entire MEMORY.md could become messy over time. By splitting the flow into light sleep, deep sleep, and REM, Dreaming makes the pipeline more explicit: consolidate first, preserve next, and derive higher-level patterns last.

Others have highlighted the neuroscience angle. Terms like Dreaming, Light Sleep, Deep Sleep, and REM are not random branding. They directly borrow from human models of sleep-based memory consolidation.

OpenClaw already uses files like IDENTITY.md, USER.md, and HEARTBEAT.md to preserve identity, user context, and continuity. DREAMS.md fills in the missing piece: deciding which memories are actually worth keeping.

04 The most ironic part: machines dream, humans stay awake

The value of Dreaming is not that AI remembers everything. It is that AI learns to review short-term traces, extract patterns, and discard noise. A strong agent should not behave like a dumb storage device. It should become better over time at understanding a user’s preferences, recurring goals, and long-term context.

From an engineering perspective, the most interesting part is that the system is not presented as a mystical black box. It is a structured backend process with stages, thresholds, reflection, and forgetting rules. That makes AI memory feel less like uncontrolled context bloat and more like a designed system.

That is also what makes the whole thing feel ironic. We are spending enormous effort teaching machines how to dream, while many people are losing sleep over being replaced by those same increasingly capable systems.

Drop MCP? Why CLI Is Becoming the Default Tool Layer for Agents

Fri, 10 Apr 2026 21:55:12 +0800

Over the last year, debates about agent toolchains have increasingly centered on one question:

Does MCP (Model Context Protocol) make tool calling simpler, or does it make simple tasks more complex?

For most day-to-day engineering tasks, CLI is becoming the more practical default.

Cost gap is not a UX issue, but an order-of-magnitude issue

The biggest practical pressure in MCP is token overhead.

In common scenarios, MCP often has to load large tool schemas before actual execution. Using a GitHub MCP Server as an example, initialization alone can consume tens of thousands of tokens. For long tasks, this directly squeezes context budget.

Community benchmarks keep pointing to the same conclusion:

Single MCP calls commonly cost several to dozens of times more than CLI
Retry recovery is also more expensive (reconnect plus context reload)

This is not just “a little slower.” It scales into API cost, latency, and reliability issues.

Why models are naturally better at CLI

A frequently overlooked fact is training distribution.

LLMs have seen massive amounts of terminal text during training: commands, outputs, errors, scripts, and man pages. In other words, CLI interaction is already close to the model’s native input pattern.

By contrast, MCP’s JSON-RPC and tool schema style became widespread only in recent years. Models can learn it, but familiarity and compression efficiency are often still weaker than long-established CLI patterns.

That also explains why, in many cases:

for the same goal, CLI commands are shorter
outputs are easier to continue reasoning over
error recovery paths are more stable

Security and isolation: MCP still has catching up to do

MCP is not incapable of security, but its ecosystem is still early.

Common concerns today include:

Tool Poisoning in descriptions
behavior drift (Rug Pull)
same-name tool override (Shadowing)

CLI also has security risks (injection, privilege misuse, path risks), but its process model, permission boundaries, and audit chain have been validated through decades of engineering practice. In production, that predictability matters.

This does not mean MCP has no value

I do not think MCP should be abandoned.

A more reasonable positioning is:

CLI handles the execution layer (local, low-latency, high-frequency calls)
MCP handles the connection layer (remote service discovery, unified auth, audit, and multitenancy)

That is the commonly discussed hybrid architecture: CLI + MCP Gateway.

When integrating many remote systems and enforcing unified governance and compliance, MCP still has clear value. But for helping agents complete engineering work quickly, CLI-first usually better matches current model capability boundaries.

In today’s engineering reality, CLI is closer to an agent’s working native language; MCP is better positioned as a connection protocol rather than the only execution protocol.

OpenClaw and Agent Harness: Why It Looks Like AGI

Fri, 10 Apr 2026 09:16:17 +0800

When many people first try OpenClaw, it feels more like a teammate who can get work done than a chatbot.

That feeling is not mysterious. The key is this: OpenClaw is not a jump in one model capability; it is a complete Agent Harness.

Core Conclusion

The essence of OpenClaw can be summarized as:

the model handles understanding and decisions
the harness handles memory, tools, triggers, execution, and outputs
the two collaborate through a loop to create continuous action

So the core reason it “feels like AGI” is not that the model suddenly became all-powerful, but that systems engineering amplifies what the model can execute.

What Is a Harness

You can think of a harness as an exoskeleton for the model.

A standalone LLM usually provides an answer in a single request. A harness adds these capabilities:

session and state management: link multi-turn tasks
memory mechanisms: store and retrieve context when needed
tool system: call browsers, terminals, files, and external APIs
trigger mechanisms: wake on timers or events instead of waiting for a human prompt every time
output channels: write results back to systems, not just return a paragraph

When these capabilities are connected in one loop, the model shifts from a responder to an executor.

Why OpenClaw Feels Different

A traditional chatbot is “ask once, answer once”.

OpenClaw is more like a closed loop of “observe -> use tools -> inspect results -> decide next”. Once this loop is established, the system can keep moving a task forward.

This is also the most valuable lesson from OpenClaw:

it proves the agent experience mainly comes from architecture design
it decomposes “autonomy” into modules that can be engineered

Value and Boundaries

OpenClaw is general and flexible, but the trade-offs are also clear:

the more context and tool definitions you include, the higher the cost
the more general the system is, the more complex debugging and governance become

In production scenarios, many teams choose smaller, more specialized agents instead of one universal agent.

Anthropic and OpenClaw Timeline: The Full Sequence of Events

Wed, 08 Apr 2026 19:48:42 +0800

Background

On April 4, 2026, Anthropic announced that Claude subscriptions would no longer cover third-party tools such as OpenClaw.

The direct user-level impact was that third-party workflows previously relying on the subscription path for Claude access had to move to alternative access methods or switch to other models.

Timeline (January to April 2026)

January 2026

According to public reports, Anthropic asked the project formerly known as Clawdbot to change its name, citing pronunciation similarity to Claude.

During the same period, community feedback began to appear regarding restrictions on third-party access via subscription credentials.

February 2026

The relevant restrictions were written into the terms of service, further clarifying the boundary between subscriptions and third-party automated invocation.

In the same month, OpenClaw released v4.0 and refactored its underlying architecture into a pluggable model backend. In other words, the model was no longer a single hardcoded entry point and could be switched across multiple providers.

March 2026

Anthropic released Claude Dispatch and Computer Use, covering capabilities such as remote task execution and desktop operation.

In subsequent updates, OpenClaw continued building its compatibility layer, unifying differences across model providers in authentication, tool-call formats, and response schemas, thereby reducing migration costs when switching models.

Public reports also noted that OpenClaw and Anthropic communicated in late March, but the overall strategic direction remained unchanged.

April 4, 2026

Anthropic formally executed the subscription coverage cutoff for third-party tools.

This marked the execution phase of policy adjustments that had been underway for several months.

April 5, 2026

OpenClaw released v4.5 with several main actions:

Reprioritizing model entry points in the onboarding flow
Integrating alternative model paths such as GPT-5.4
Continuing adaptation work for task flow and interaction experience

Based on the release timing, OpenClaw’s switchover capability was not built entirely ad hoc, but rested on the multi-model architecture work launched since February.

Two Parallel Directions in the Process

Viewed along the timeline, both parties advanced different priorities during the same period:

Anthropic: tightening subscription boundaries and integrating official product capabilities
OpenClaw: strengthening model replaceability and cross-model compatibility

These two routes are not inherently contradictory, but they do create competition over entry-point ownership and where user workflows accumulate.

Current Status (as of April 2026)

Based on publicly available information, the following can be confirmed:

The subscription coverage cutoff has been executed
OpenClaw has completed its primary model-path transition and continues iterating
Whether users perceive major changes depends on how strongly their workflows rely on any single model

What to Watch Next

Going forward, the more meaningful signals are not from this single event itself, but from three areas:

Whether boundaries between subscription plans and API usage become more explicit
The long-term performance of multi-model agents in stability, cost, and user experience
Whether user workflows settle primarily at the model layer, tool layer, or a hybrid layer between the two

AI Agent on KnightLi Blog

What Is OpenAI Symphony? Codex Orchestration, Issue-Driven Development, and AI Agent Workflows

Symphony is not solving code writing, but Agent management

Why an issue tracker?

Its core workflow

Goal-driven, not a rigid state machine

How is this different from normal Codex usage?

Where does it fit?

Risks and boundaries

My understanding of Symphony

References

How browser-harness domain skills keep AI agents from repeating browser automation mistakes

What domain skills are

They are not about blind clicking

They store site-level knowledge

Example: Amazon product search

Example: LinkedIn invitation management

Example: Shopify Admin

Example: Browser Use Cloud

Why this is more reliable than ad-hoc reasoning

How teams can use it

Boundaries to keep

Conclusion

browser-harness, Playwright, and Puppeteer: which browser automation tool should you choose?

The relationship between Playwright and Puppeteer

Core differences

Browser support

Auto-waiting and stability

Multi-account workflows and context isolation

Tooling differences

Where browser-harness fits

Three-way comparison

Code feel

How to choose

Conclusion

What is browser-harness? A browser automation tool that lets AI agents control real Chrome

What browser-harness is

How it differs from traditional browser automation

Why real Chrome matters

Editable helpers and domain skills

Suitable scenarios

Risks to watch

Why it matters for AI agent tools

Conclusion

GitHub AI Open Source Project Categories: From Coding Agent to RAG Knowledge Bases

Category Summary

AI Coding and Coding Agents

Agent Skills and Workflows

RAG, Knowledge Bases, and Memory

Multimodal and Content Creation

Local Models and Inference

Vertical Applications and Automation

AI Application Development Infrastructure

Google I/O 2026 Summary: Gemini 3.5, Omni, Antigravity, and System-Level Agents

One-Sentence Summary

Gemini 3.5 Flash: From Prompt to Action

Gemini Omni: Video and World-Model Capabilities

Gemini App: From Chat Assistant to Always-On Personal Agent

Antigravity 2.0: Developer Tools Become Agent-First

Gemini API Managed Agents: Hosting Agents as API Capabilities

Google AI Studio: From Prompt Playground to App Generation Entry Point

Android and AppFunctions: Key Interfaces for Mobile Agents

Search, Shopping, and Content Products Are Becoming Agentic Too

Practical Impact for Developers

Impact on Mobile Automation

Security, Permissions, and Auditing Become Hard Requirements

Summary

What Is PageIndex? A Reasoning-Based RAG Document Index Without Vector Databases

What Problem It Tries to Solve

The Basic PageIndex Workflow

How It Differs From Traditional Vector RAG

How to Run It Locally

Agentic Vectorless RAG Example

Cloud Service, MCP, and API

Suitable Scenarios

Things to Watch

Summary

Gemini 3.5 Is Here: Flash Leads as Google Focuses on Agents and Long-Running Tasks

Gemini 3.5 Flash Comes First

The Focus Is Agents and Coding