Codex Goal Deep Dive: A Goal-Driven Workflow That Lets AI Agents Work for Hours

Tue, 26 May 2026 23:44:37 +0800

AI Agents in browsers, terminals, and IDEs are getting better at writing code. But the real pain point for many developers is no longer “it cannot do the task.” It is “it stops halfway and says it is done.”

Small tickets are a good fit for coding agents: fix a button, add an endpoint, change a short piece of copy, or add one test. The target is clear, the boundary is small, and verification is straightforward. But once the task becomes a large migration, a cross-module refactor, a failing test suite, a dependency upgrade, or prompt eval optimization, an Agent can easily fall into a familiar pattern:

It reaches a plausible intermediate state, then stops too early.

Codex Goal / Persistent Goals are meant to address exactly this premature-stop problem. The point is not to make the Agent run more rounds. The point is to make the Agent keep moving toward a clear objective until verifiable completion criteria are met.

Codex Goal Is About Stop Conditions, Not Loops

Many long-running automation attempts start with a rough instruction:

`1`	`Keep checking the code and fixing issues until there are no errors.`

Or in a more mechanical form:

Loop 10 times:
1. Run tests
2. Ask the model to fix failures
3. Run tests again

This kind of rough loop can keep an Agent busy for longer, but it has two serious problems:

It does not know when it should truly stop.
It does not know whether “no more visible errors” is the same as “the task is complete.”

The key to Codex Goal is not the number of iterations. It is the combination of goal, state, and judge stop condition. The Agent needs to know what it is trying to achieve, how far it has progressed, and what evidence proves the task is actually finished.

That is the real dividing line for long-running Agents: not “can it take more steps,” but “can it tell what is still missing.”

Goal vs. Ordinary Prompt

An ordinary prompt feels like a one-off instruction:

`1`	`Fix the tests in this project.`

A Goal prompt is closer to a small task contract:

Goal: Fix the failing tests in the current repository.
Scope: Only modify src/ and tests/. Do not change build scripts.
Definition of done: npm test passes fully, and the changes introduce no lint errors.
Verification command: npm test && npm run lint.
If blocked: After 3 failed attempts, report the remaining failing cases, attempted fixes, and blockers.
Output: Summarize changed files, root causes, and verification results.

The biggest difference is that the Goal prompt defines what “done” means.

Without a definition of done, the Agent can confuse “I changed the code” with “I completed the task.” With clear completion criteria, the Agent has to keep working against external evidence such as tests, logs, diffs, build output, and eval scores.

Why LLM Judge Stop Conditions Matter

The hardest part of a long task is not making the Agent run commands. It is making it decide:

Is the task actually done?
Did only a local test pass?
Did a fix introduce another problem?
Should it keep searching, run another verification step, or roll back a direction?

That is where an LLM judge stop condition becomes valuable.

Ideally, the Agent should not only check whether the last command exited with code 0. It should also reason over:

whether every user-provided completion criterion has been satisfied;
whether changes stayed within the allowed scope;
whether tests, lint, build, and evals were actually run;
whether failure logs still contain unresolved items;
whether the change introduces obvious side effects or risks;
whether the final report gives a human enough evidence to review quickly.

In other words, the judge does not merely declare success. It also prevents the Agent from ending with self-comforting optimism.

Tasks That Fit Goal Well

Codex Goal / Persistent Goals are especially useful for complex coding work that requires multiple rounds of exploration and verification:

Code migration: moving from an old framework to a new one, from CommonJS to ESM, or from an old API to a new API.
Large refactors: splitting modules, cleaning boundaries, replacing duplicated implementations, and reducing complexity.
Test repair: analyzing failing cases, identifying causes, fixing them, and repeatedly verifying the result.
Dependency upgrades: upgrading frameworks, SDKs, or build tools while handling breaking changes.
Prompt eval optimization: running evals, inspecting failure samples, and adjusting prompts or tool-calling behavior.
Technical debt cleanup: replacing old patterns under a clear rule while preserving behavior.

These tasks all have many intermediate states. The failure cause may not be obvious on the first pass, and completion depends on verification.

Tasks That Should Not Rely on Goal Alone

Goal is not magic. These tasks are risky if handled with only one long prompt:

The target is vague, such as “improve product growth.”
The cycle is long, such as SEO, GEO, or ad optimization over several weeks.
The work requires cross-system scheduling across content, data, ads, support, and business metrics.
The task involves production risk, such as database changes, live configuration, finance operations, or account permissions.
There is no verification mechanism: no tests, no metrics, no logs, and no human acceptance criteria.

These are closer to Missions than Goals.

A Goal works well for hour-scale or one-to-two-day deep execution. A Mission needs state, history, scheduling, human approval, staged reviews, and long-term metrics. SEO / GEO / Ads optimization, for example, is not just “make the Agent write content or adjust parameters in a loop.” It needs persistent records of strategy, experiments, data changes, and next actions.

A Template for Better Goal Prompts

A useful Goal prompt should include at least these parts:

Goal:
State the final outcome in one sentence.

Background:
Explain the current problem, related files, business constraints, and previous attempts.

Scope:
List which directories, files, and modules may be changed, and which must not be touched.

Definition of done:
List verifiable completion criteria.

Verification commands:
Specify tests, lint, build, evals, or scripts to run.

Failure strategy:
If completion is not possible, ask the Agent to report causes, attempted approaches, and remaining blockers.

Risk boundaries:
Require human confirmation for destructive actions, production config, secrets, databases, or external services.

Delivery format:
Ask for a summary of changes, verification results, risks, and follow-up suggestions.

What determines the quality of a long-running task is often not how elegant the prompt sounds. It is whether the completion criteria are hard enough.

Why Goal Buddy Helps

Many long-running tasks fail not because the Agent is weak, but because the human did not define the task clearly enough at the start.

A helper such as Goal Buddy is valuable because it prepares the goal, boundaries, completion criteria, and verification plan before the task is handed to the Agent. It acts like a task preflight checklist and asks:

What visible result should this task produce?
Which directories can be changed, and which are off limits?
Which command proves success?
If the task fails, should the Agent keep trying, and how far?
Should the work be split into staged commits?
Which actions require human approval?

This may feel verbose, but it greatly reduces the chance of the Agent drifting, stopping too early, or producing a large diff that is hard to review.

Practical Advice for Codex, Claude Code, and OpenCode Users

If you are using Codex, Claude Code, OpenCode, OpenClaw, or a similar coding agent, use long-running tasks this way:

Commit the current workspace first so you have a clean rollback point.
Write the request as a Goal, not a vague instruction.
Define what may and may not be changed.
Provide verification commands, ideally commands the Agent can run after each round.
Require the Agent to report blockers when it cannot finish instead of inventing a success state.
Put human confirmation around high-risk operations such as deleting files, changing databases, or touching deployment config.
Only accept a final result that includes test results and a change summary.

The right way to use a long-running Agent is not “let it do whatever overnight.” It is to give it a clear target, solid guardrails, and a verifiable exit.

Summary

Codex Goal / Persistent Goals move coding agents from “execute one instruction” toward “keep working toward an objective.”

They fit complex but bounded engineering tasks: migrations, refactors, test fixes, dependency upgrades, and eval optimization. They are a poor fit for vague, long-cycle business work without verification criteria; those should be designed as Mission systems instead.

The future competition among AI Agents may not be only about which model writes better code. It may be about which Agent can keep moving toward a goal, judge the right stopping point, and leave enough evidence for humans to review.

References:

Long-Running Tasks on KnightLi Blog