DeepSeek on KnightLi Blog

Two Ways to Use DeepSeek Models with Codex: Local Gateway and OpenRouter BYOK

Sun, 24 May 2026 09:52:55 +0800

If you want Codex to use DeepSeek, the first instinct is usually to edit ~/.codex/config.toml:

1
2

model = "deepseek-chat"
base_url = "https://api.deepseek.com"

That idea can work in some older versions or in regular OpenAI SDK scenarios. But with the current Codex CLI, it can easily run into a lower-level mismatch: custom model providers in Codex use the OpenAI Responses protocol, while DeepSeek’s official API is mainly exposed through an OpenAI-compatible Chat Completions interface.

My local version is currently codex-cli 0.111.0. codex --help shows support for configuration entry points such as --config, --model, and --profile. The official OpenAI Codex configuration reference is also explicit: model_providers.<id>.wire_api currently supports only responses, and defaults to responses when omitted.

DeepSeek’s official docs, meanwhile, show the call path as https://api.deepseek.com/chat/completions, with examples such as client.chat.completions.create(...). So the issue is not that DeepSeek cannot be called through OpenAI-style tooling. The issue is that the request semantics Codex sends are not exactly the same as what DeepSeek’s native API understands.

That is why changing base_url directly to https://api.deepseek.com may produce symptoms such as:

The request path does not match, resulting in a 404 or an unexpected response format.
Multi-turn conversations, tool calls, or patch generation fail during parsing.
tool_calls order, message structure, or streaming event format does not line up.
The model seems able to answer a plain prompt, but starts failing once Codex does real work.

The steadier approach is to put a translation layer between Codex and DeepSeek. There are two common routes.

Method 1: Bridge DeepSeek Through a Local Gateway

A local gateway should do more than simple forwarding. Its job is to convert Responses-style requests from Codex into Chat Completions-style requests that DeepSeek can handle, then convert DeepSeek’s result back into a format Codex can consume.

If you use a local gateway such as ccx, the configuration idea looks roughly like this:

[profiles.deepseek-ccx]
model = "deepseek-v4-flash"
model_provider = "ccx-bridge"

[model_providers.ccx-bridge]
name = "Local CCX Gateway"
base_url = "http://localhost:3000/v1"
env_key = "DEEPSEEK_API_KEY"

Then set the DeepSeek key in your terminal and start Codex with that profile:

1
2

export DEEPSEEK_API_KEY="your-deepseek-key"
codex --profile deepseek-ccx

In PowerShell:

1
2

$env:DEEPSEEK_API_KEY="your-deepseek-key"
codex --profile deepseek-ccx

There are two details to watch.

First, base_url should point to the gateway endpoint exposed to Codex, not the official DeepSeek address. The gateway calls DeepSeek behind the scenes.

Second, the correct value for env_key depends on how the gateway handles authentication. Some gateways read the official DeepSeek key directly. Others ask you to provide a local proxy key, while storing the DeepSeek key in the gateway backend. In that case, env_key should be changed to whatever environment variable the gateway expects.

This route is local and controllable, and it is easier to reason about latency and cost. The tradeoff is that you must confirm the gateway really supports the current Responses semantics used by Codex, rather than only acting as a basic Chat Completions proxy.

Method 2: Use OpenRouter BYOK as an Online Bridge

If you do not want to run a local gateway, OpenRouter BYOK is another option. BYOK means binding your own upstream provider key to OpenRouter, which then handles routing and forwarding.

The most common mistake here is the environment variable. Codex is calling OpenRouter, so env_key should usually be OPENROUTER_API_KEY, not DEEPSEEK_API_KEY. The DeepSeek key should be added in OpenRouter’s BYOK or provider key settings.

Example configuration:

[profiles.deepseek-openrouter]
model = "deepseek/deepseek-chat"
model_provider = "openrouter"

[model_providers.openrouter]
name = "OpenRouter"
base_url = "https://openrouter.ai/api/v1"
env_key = "OPENROUTER_API_KEY"

Start it like this:

1
2

export OPENROUTER_API_KEY="your-openrouter-key"
codex --profile deepseek-openrouter

PowerShell:

1
2

$env:OPENROUTER_API_KEY="your-openrouter-key"
codex --profile deepseek-openrouter

Then add your DeepSeek provider key in the OpenRouter dashboard. OpenRouter’s BYOK documentation says provider keys are stored encrypted and used for routing to the corresponding provider.

This route saves you from maintaining a local gateway and feels more like using a regular third-party API proxy. The downside is that an online service sits in the middle, so troubleshooting may require checking Codex, OpenRouter, and DeepSeek error messages together.

Should You Keep Using the deepseek-chat Model Name?

In DeepSeek’s documentation as of May 2026, the recommended model names include deepseek-v4-flash and deepseek-v4-pro, with a note that compatibility aliases such as deepseek-chat and deepseek-reasoner will be deprecated after 2026-07-24.

For new configurations, it is better to test:

`1`	`model = "deepseek-v4-flash"`

If you are using OpenRouter, follow OpenRouter’s model naming format, for example:

`1`	`model = "deepseek/deepseek-chat"`

The actual available names depend on your gateway or OpenRouter’s model page. When the model name is wrong, errors usually look like model not found, 404, or the provider failing to find the matching endpoint.

Why Directly Setting DeepSeek’s Official base_url Is Not Recommended

You can certainly try this as an experiment:

[profiles.deepseek-direct]
model = "deepseek-v4-flash"
model_provider = "deepseek"

[model_providers.deepseek]
name = "DeepSeek"
base_url = "https://api.deepseek.com"
env_key = "DEEPSEEK_API_KEY"

But this is more of a debugging experiment than a stable setup. Codex talks to custom providers through the Responses protocol, while DeepSeek’s official examples use /chat/completions. If DeepSeek or Codex adds a full compatibility layer later, direct connection may become simple. Until then, a bridge layer is more reliable.

What If Codex Still Uses OpenAI After Editing the Config?

First, confirm the config file location. The global config should be:

`1`	`~/.codex/config.toml`

The project-level .codex/config.toml is not the right place for machine-level provider settings such as model_provider and model_providers. The official OpenAI docs also note that project-level configuration does not override local provider and authentication fields.

If Codex still asks you to log in through the web, or appears to use the default OpenAI model, log out first:

`1`	`codex logout`

Some older tutorials write this as /logout inside the interactive UI. With the current CLI, running codex logout directly in the terminal is the more reliable option.

You can also run a quick check with a temporary profile:

`1`	`codex --profile deepseek-ccx`

Or:

`1`	`codex -c model_provider=ccx-bridge -c model=deepseek-v4-flash`

If that works, the config itself is readable. If it does not, check the profile name, TOML syntax, and whether the environment variable only exists in the current shell session.

Troubleshooting Checklist

401: The key is wrong, or env_key points to the wrong environment variable.
404: base_url or the model name is wrong, or a Responses request is being sent to an endpoint that only supports Chat Completions.
tool_calls, patch, or streaming parse errors: the protocol bridge is likely incomplete.
Still prompted to log in to OpenAI: run codex logout, then confirm you are using the correct profile.
PowerShell environment variable disappears in a new window: $env:... only applies to the current session. Use user environment variables if you need it to persist.
OpenRouter BYOK is not using your own DeepSeek key: check whether the provider key is bound in OpenRouter, whether the current OpenRouter API key is allowed to use it, and whether fallback is enabled.

Conclusion

Using DeepSeek with Codex is not impossible through config.toml. The catch is that changing only base_url is usually not enough.

The two steadier routes today are:

Use a local gateway as a protocol bridge: Codex talks to the local gateway, and the gateway talks to DeepSeek.
Use OpenRouter BYOK as an online proxy: Codex talks to OpenRouter, while the DeepSeek key is bound in the OpenRouter dashboard.

If you only want a quick test, OpenRouter is easier. If you want tighter control over keys, cost, and logs, a local gateway is better for long-term tinkering.

References:

DeepSeek-TUI: Turning DeepSeek V4 into a Terminal Coding Agent

Sat, 16 May 2026 22:41:41 +0800

DeepSeek-TUI is an open source project that brings DeepSeek V4 into terminal-based development workflows. It is not just a chat wrapper. It is closer to a “command-line coding agent” like Claude Code or Codex CLI: it can read files, edit code, run commands, call tools, and keep working through tasks in a TUI.

If you already switch between an editor and a terminal, the value of this kind of tool is straightforward: you do not need to copy code back and forth into a web chat window, and you do not need to manually describe the whole project structure. You give it a task, and it can read context from the current workspace, plan steps, make changes, then return the result for your review.

It Solves the Entry Point Problem for DeepSeek

DeepSeek models already provide strong reasoning and coding capabilities, but model capability needs an engineering layer before it can land in real development workflows.

Web chat is suitable for asking questions, but not for long-running project edits. APIs are suitable for system integration, but individual developers still need to build tool calling, context management, file operations, and permission control themselves. DeepSeek-TUI tries to fill this layer: it wraps DeepSeek V4 into an Agent that can work inside the terminal.

According to the project description, its main capabilities include:

A terminal TUI;
Conversation and task execution for DeepSeek V4;
Tool calling and file operations;
1M context support;
Auto mode;
Sub-agents;
Sandboxed execution;
A persistent task queue.

Together, these features are not aimed at making the model sound more human. They are aimed at making the model easier to bring into the development environment.

A TUI Fits Long Tasks Better Than Plain CLI Text

Many AI CLI tools start with plain text interaction: enter a prompt, wait for output, then copy commands or add more context. This is simple, but longer tasks quickly become messy.

The advantage of a TUI is that it can place conversations, files, execution results, and task status in a more stable interface. For a coding Agent, that matters. A code task is rarely a single question and answer. It often includes:

Understanding the project structure;
Finding relevant files;
Editing code;
Running tests or commands;
Fixing issues based on errors;
Summarizing changes.

If the interface is only a stream of logs, it is hard for the user to see where the Agent is in the process. A TUI at least provides a better place to observe and take over.

Auto Mode Is Best for Tasks with Clear Boundaries

The Auto mode mentioned by DeepSeek-TUI is best for tasks with clear boundaries. For example: fixing a small bug, adding a script, changing a configuration, organizing a set of documents, or implementing a local feature.

These tasks have something in common: the goal is clear, the verification method is clear, and the impact scope is controllable. The Agent can inspect files, edit files, run commands, and then hand the result back to the user for confirmation.

But Auto mode should not mean unlimited permission. In real projects, file deletion, large-scale refactors, database migrations, and deployment commands should all require explicit confirmation. The efficiency of coding Agents comes from automation, but so does the risk. The more a tool can execute commands, the more it needs sandboxing, permission boundaries, and human review.

Sub-Agents Matter Because They Split Tasks

Sub-agents are not a new concept, but they are useful in coding scenarios.

A moderately complex task usually requires several kinds of work at the same time: someone reads the code, someone changes the implementation, someone checks tests, and someone organizes documentation. Traditional multi-agent systems often feel ornamental because they have no real tools or real workspace; they only discuss inside a conversation.

If sub-agents can work with the file system, command execution, and task queues, they become more like a task decomposition mechanism. For example, one sub-agent can analyze dependencies, another can modify a specific module, and the main agent can integrate the result. This can reduce the problem of putting too much unrelated information into one context.

Of course, sub-agents also add cost: more tokens, more complex state, and responsibility boundaries that are harder to track. They are better suited to medium-complexity tasks and above, not necessarily every small edit.

1M Context Is Not Magic, but It Helps with Projects

1M context sounds exaggerated, but in coding scenarios it is not just a marketing number.

The context of a real codebase is fragmented: README files, configuration files, type definitions, tests, call chains, historical conventions, and error logs can all affect one change. Longer context can reduce the problem of editing after seeing only a local fragment, and it can help the model retain more project constraints.

Still, longer context does not automatically mean better judgment. Code tasks still need retrieval, filtering, and verification. Putting an entire project into context is not necessarily better than reading the relevant files precisely. A good coding Agent should treat long context as a buffer, not as a shortcut that replaces engineering judgment.

Who It Is Best For

DeepSeek-TUI is better suited to several groups:

Developers who want to use DeepSeek for coding tasks in the terminal;
People who do not want to build tool calling and file operation frameworks themselves;
Users familiar with Claude Code or Codex CLI who want to try a DeepSeek-based entry point;
People who need local project context instead of only asking about code snippets in a web page;
Developers who want to put AI coding workflows into a command-line environment.

If you only occasionally ask how to write a function, web chat is enough. If you want the model to participate directly in project edits, a terminal Agent becomes more meaningful.

Risks to Watch

There are three things to watch most closely with this kind of tool.

The first is permissions. As long as a tool can read and write files or execute commands, you need to know what it can access by default, whether it can delete files, whether it can access the network, and whether dangerous commands require confirmation.

The second is rollback. Before using it, it is best to keep the Git working tree clean, so every Agent change can be clearly seen through git diff. Do not let an Agent automatically edit a project while many unrelated changes are already uncommitted.

The third is verification. Code written by an Agent does not mean the task is complete. Tests, builds, linting, and human review still need to remain. AI coding tools can speed up progress, but they cannot replace final engineering confirmation.

Conclusion

The significance of DeepSeek-TUI is not that it adds another chat client. It puts DeepSeek V4 into a terminal environment that is closer to real development work.

For developers, model capability is only the first step. The real experience depends on whether it can read a project, safely edit files, run verification commands, maintain state in long tasks, and let the user take over at any time.

If you want to use DeepSeek for daily code changes, project reading, and automated development tasks, DeepSeek-TUI is worth watching. The direction is also clear: AI coding tools are moving from “answering code questions” to “participating in project execution.”

Running DeepSeek 4 Locally: Antirez's ds4 Experiment on Apple Silicon Mac

Mon, 11 May 2026 08:51:37 +0800

Antirez has open sourced a new project: ds4. It is not a general-purpose LLM framework, but a local inference engine for DeepSeek V4 Flash, with a focus on Apple Silicon and the Metal backend.

Project URL: https://github.com/antirez/ds4

What is ds4?

ds4 has a clear goal: running DeepSeek V4 Flash locally on a Mac.

It currently provides three ways to use it:

Interactive CLI.
HTTP server.
An experimental Agent mode.

Judging from its positioning, it is more like an inference project deeply optimized for one specific model than a replacement for general-purpose tools such as llama.cpp, Ollama, or vLLM.

Why it is worth watching

There are three main reasons this kind of project is worth following.

First, the author is Antirez, the creator of Redis. He has long focused on low-level systems, performance, and simple tools, and his projects are usually quite direct in style.

Second, DeepSeek V4 Flash points toward efficient inference. If the local running experience is good enough, it could be very attractive for Mac users.

Third, ds4 directly targets Apple Metal. Compared with the route of supporting every platform first and optimizing later, it feels more like a project trying to go deep on one well-defined scenario.

Who should try it

ds4 is better suited for users who:

Use an Apple Silicon Mac.
Want to run DeepSeek V4 Flash locally.
Care about Metal inference performance.
Are willing to try an alpha-stage project.
Want to study lightweight inference engines and model runtime details.

If your goal is stable deployment, cross-platform operation, or OpenAI API-compatible infrastructure, it may not be the first choice at this stage. It is better treated as an experimental tool and a technical project to watch.

How to use it

The basic workflow in the project README is to build it first, then run it.

1
2
3

git clone https://github.com/antirez/ds4.git
cd ds4
make

Run it interactively:

./ds4

Start the HTTP server:

`1`	`./ds4 --server`

Agent mode:

`1`	`./ds4 --agent`

For exact parameters and model file preparation, follow the repository README, because the project is still changing quickly.

Current risks

ds4 is still at an early stage, so set expectations before using it:

Features may be incomplete.
Parameters, model formats, and command-line behavior may change.
Compatibility mainly revolves around Apple Silicon and Metal.
Agent mode is more experimental and is not suitable for direct production use.
When something breaks, you may need to read the README, issues, or source code yourself.

In other words, it is currently more of an open source experiment worth trying than a one-click tool for ordinary users.

How it differs from general inference tools

General-purpose inference tools usually aim for broad compatibility across model formats, platforms, backends, and APIs. ds4 takes a narrower path: local DeepSeek V4 Flash inference on Metal.

That choice has both benefits and trade-offs.

The benefit is that the implementation can stay focused, making performance and user experience easier to optimize around a single target. The trade-off is a limited scope: it is not meant to run every possible model, nor to replace a complete deployment platform.

If you already use llama.cpp or Ollama, ds4 is better treated as a supplementary testing tool, not an immediate replacement for your existing workflow.

Summary

The interesting part of ds4 is not that it is yet another local LLM tool. It is that its scope is intentionally narrow: DeepSeek V4 Flash, Apple Silicon, Metal, and local inference.

If you have a suitable Mac and are willing to tinker with an early-stage project, it is worth watching its performance, model support approach, and server/agent capabilities. For production environments, it is better to keep observing until the interfaces and usage patterns become more stable.

References

GitHub project: https://github.com/antirez/ds4

Why DeepSeek Became the Cost-Saving Key in This Round of AI Coding Tools

Mon, 11 May 2026 04:59:00 +0800

In this round of AI coding tool competition, the surface battle is about model capability, plugin ecosystems, and agent automation. But once you actually use these tools, the first wall you hit is cost.

Claude Code, Codex, OpenClaw, and Superpowers are all useful, but they share one trait: once a task becomes complex, they eat tokens aggressively. They need to read the project, build a plan, call tools, summarize context, repeatedly check results, and sometimes launch multiple subtasks. The smarter the model and the more automated the workflow, the easier it is for the bill to quietly grow.

That is why DeepSeek has become important in this cycle. Not merely because it can write code, but because its long context and cache pricing happen to hit the most expensive part of AI coding tools.

Why Agent Tools Burn So Many Tokens

Traditional chat-style coding assistants usually work in question-and-answer mode. You ask how to write a function, and the model returns a code snippet. This still costs tokens, but it is relatively controllable.

Agent tools are different. They do not just answer questions. They enter the project like a temporary engineer:

scan directories and key files;
understand the requirement and existing architecture;
make a plan;
modify files;
run commands or tests;
keep fixing based on errors;
summarize what changed at the end.

During this process, the model repeatedly reads the same context. Project descriptions, code snippets, tool outputs, conversation history, plans, and error logs all get placed back into the context. Once the task is a little complex, hundreds of thousands of tokens can disappear quickly.

If you add more aggressive plugins, the cost becomes even more obvious. Some OpenCode or Claude Code enhancement tools may organize a whole agent team by default. You only wanted to change a small feature, but it may still start planning, review, execution, and retrospective steps. The task may look more “intelligent”, but the token count keeps climbing.

The Advantage of Superpowers Is On-Demand Activation

One advantage of tools like Superpowers is that they do not force a full agent workflow onto every task.

Most of the time, you can still let Claude Code, OpenCode, or Codex work in their normal mode. Only when you explicitly call a skill, such as brainstorming, planning, executing a plan, or doing a retrospective, does it enter a heavier automation flow.

That matters for cost.

AI coding should not use heavy artillery for every task. Changing one config line, checking one error, or writing a small script can be handled through ordinary conversation. Only complex refactors, cross-file changes, long-document processing, and multi-round validation deserve a full agent workflow.

The stronger the tool, the more you need to control when it triggers. Otherwise, more automation simply means more waste.

DeepSeek’s Key Advantage Is Cheap Cache Hits

One important reason DeepSeek fits these agent tools is its low cache-hit cost.

AI coding tasks contain a lot of repeated prefixes: project background, system prompts, tool instructions, file content, and earlier conversation turns often appear again in later requests. If the model service supports prompt caching, those repeated parts become much cheaper after a cache hit.

For many models, a cache hit is only somewhat cheaper than a miss, perhaps around one third of the original price. DeepSeek’s advantage is that the gap after a cache hit can be much larger. For long-context, multi-round agent workflows that repeatedly read the same project, this gap shows up directly on the bill.

In other words, DeepSeek is not necessarily the strongest answer on every single turn. But in scenarios with long tasks, many rounds, and repeated context reads, its cost structure is unusually suitable for AI coding.

Long Context Makes Claude Code More Useful

When Claude Code or similar tools are connected to DeepSeek V4, another clear advantage is long context.

AI coding tools fear insufficient context. Once context runs short, compression becomes frequent. Once compression becomes frequent, previously read details may be lost. The model may start forgetting the project structure, constraints, or why a certain file was changed, and quality declines afterward.

DeepSeek V4’s long-context capability makes it better suited for code repositories, document batch processing, subtitle translation, and site article cleanup. Especially when connected to tools like Claude Code or OpenClaw, the right configuration can delay context compression and preserve more project detail.

That is why some tasks feel “durable” when run on DeepSeek. It may not be dazzling at every step, but it can tolerate long-running, low-cost, repeated calls.

How to Split Work Between V4 Pro and V4 Flash

DeepSeek V4 Pro and V4 Flash should not be mixed casually.

For simple tasks, DeepSeek V4 Flash is usually a better fit. It is fast and cheap, and is often enough for:

subtitle translation;
document cleanup;
ordinary script generation;
small code edits;
lightweight OpenClaw tasks;
simple site content processing.

For complex tasks, consider DeepSeek V4 Pro:

large-scale refactoring;
multi-module code understanding;
complex reasoning;
long-chain agent tasks;
high-risk code changes;
engineering tasks that require stronger planning.

Many people want to attach the strongest model immediately, but that is often uneconomical. The practical way to use AI coding tools is to layer tasks: let the cheaper model handle a large amount of routine work, and reserve the expensive model for key decision points.

MiniMax, Doubao, and DeepSeek Occupy Different Positions

Among domestic models and plans, MiniMax, Doubao, Kimi, and DeepSeek each have their own place.

MiniMax’s advantage is generous quota, low price, and broad functionality. It may not be the smartest coding model, but it is cost-effective for translation, lightweight cleanup, and batch processing. For example, batch subtitle processing, format conversion, and simple proofreading are good fits for MiniMax-style plans.

Doubao’s advantage is a broader tool ecosystem: image, video, search, TTS, possible STT, and embedding can be connected together. It feels more like a comprehensive toolbox.

DeepSeek’s position is clearer: text, code, long context, and low-cost caching. It lacks a complete image generation, voice, and video ecosystem, and its weaknesses are obvious. But in AI coding and long-text agent workflows, its strengths are long enough to matter.

So this is not about one tool replacing another. It is about splitting the task and using each tool where it fits.

Saving Money Is Not Just Choosing a Cheap Model

Saving money in AI coding does not mean simply switching every request to the cheapest model.

The effective methods are:

Do not start a heavy agent for simple tasks.
Do not use Pro when Flash is enough.
Use cache as much as possible for long tasks.
Keep repeated context stable, so meaningless changes do not break cache hits.
Let a cheaper model draft and batch-process first, then use a stronger model for key reviews.
Tell the agent clearly not to repeat facts or summarize the same point again and again.

The last point matters more than it looks. AI tools are prone to verbosity, and verbosity is not only a reading problem; it is also a cost problem. Putting “describe each fact once and state each opinion once” into the prompt can improve both article quality and token consumption.

What AI Coding Workflows DeepSeek Fits Best

DeepSeek is best suited for:

reading long code repositories;
lightweight multi-file edits;
batch document cleanup;
batch subtitle translation;
Hugo article cleanup;
agent plan execution;
low-cost automation with lots of repeated context.

It is not the best fit for every task. If you need especially strong frontend taste, complex product judgment, or cross-modal creation, you may still need Claude, GPT, Gemini, Doubao, or other tools.

But whenever a task is long-text, long-context, repeated-call, and cost-sensitive, DeepSeek can easily become the first choice.

Summary

In this round of AI coding tools, DeepSeek’s value is not just that a domestic model can write code. Its real value is that it addresses the most practical pain point of agent tools: long tasks are too expensive.

Tools like Claude Code, OpenClaw, and Superpowers make the development process increasingly automated, but behind that automation are massive context reads and multi-round calls. Whoever can lower this part of the cost can make AI coding go from “fun once in a while” to “affordable every day”.

DeepSeek’s long context, low cache cost, and layered use of V4 Flash / V4 Pro put it in exactly that position.

The real cost-saving key in this cycle is not avoiding good models. It is combining good models, cheap models, cache, and agent workflows properly. Once you understand that bill, AI coding tools can become real productivity rather than a beautiful but expensive toy.

DeepSeek-TUI: Run a DeepSeek Coding Agent in Your Terminal

Fri, 08 May 2026 13:41:15 +0800

DeepSeek-TUI is an AI coding agent that runs in the terminal. It is built around DeepSeek V4 models and starts from the deepseek command. Inside a keyboard-driven TUI, it can read and edit files, run shell commands, search the web, manage git, connect to MCP servers, and coordinate sub-agents.

It is closer to a terminal workbench than a simple chat CLI. The goal is not only to send a question to a model, but to combine code reading, file edits, commands, diagnostics, session recovery, and workspace rollback in one local workflow.

The repository is mainly written in Rust and uses the MIT license. Its GitHub description is: “Coding agent for DeepSeek models that runs in your terminal.”

Who It Is For

DeepSeek-TUI fits developers who prefer terminal workflows and want to use DeepSeek models for real local development tasks.

It is useful when you want to:

Use DeepSeek models for code changes and project analysis.
Work without opening a full IDE.
Let an AI tool read and modify a local workspace.
Switch between Plan, Agent, and YOLO modes.
Save sessions and resume long tasks.
Roll back workspace changes.
Connect MCP, LSP diagnostics, HTTP/SSE runtime APIs, and skills.

If you only need simple Q&A, a web client or lightweight CLI is enough. DeepSeek-TUI is better when the model should become part of your local development loop.

Installation

DeepSeek-TUI ships Rust binaries. The common entry command is deepseek, and the companion TUI binary is deepseek-tui.

Using npm:

1
2
3

npm install -g deepseek-tui
deepseek --version
deepseek --model auto

The npm package is an installer and wrapper that downloads prebuilt Rust binaries. It requires Node.js >=18.

Using Cargo:

1
2

cargo install deepseek-tui-cli --locked
cargo install deepseek-tui --locked

Using Homebrew:

1
2

brew tap Hmbown/deepseek-tui
brew install deepseek-tui

You can also download prebuilt binaries from GitHub Releases. The README lists Linux x64/ARM64, macOS x64/ARM64, and Windows x64 builds.

Docker:

docker run --rm -it \
  -e DEEPSEEK_API_KEY \
  -v "$PWD:/workspace" \
  ghcr.io/hmbown/deepseek-tui:latest

In mainland China, use npm or Cargo mirrors, or download release binaries manually.

Configure the API Key

On first launch, DeepSeek-TUI asks for your DeepSeek API key and saves it to:

`1`	`~/.deepseek/config.toml`

You can also configure it explicitly:

1
2

deepseek auth set --provider deepseek
deepseek auth status

Or use an environment variable:

1
2

export DEEPSEEK_API_KEY="YOUR_KEY"
deepseek

Check the setup:

`1`	`deepseek doctor`

If the wrong key source is used, run deepseek auth status. Saved config keys take precedence over the keyring and environment variables.

Clear a saved key:

`1`	`deepseek auth clear --provider deepseek`

Auto Mode

Auto mode:

`1`	`deepseek --model auto`

Inside the TUI:

`1`	`/model auto`

Auto mode chooses both:

Model: deepseek-v4-flash or deepseek-v4-pro
Thinking: off, high, or max

Before the real request, DeepSeek-TUI makes a small routing call to analyze the latest request and recent context. Simple tasks can stay on Flash with thinking off; coding, debugging, architecture, release, security review, or ambiguous multi-step work can move to Pro or higher thinking.

auto is local to DeepSeek-TUI. The upstream API receives the concrete model and thinking setting chosen for that turn. For benchmarks, strict cost control, or fixed behavior, use an explicit model.

Modes

Mode	Use
Plan	Read-only exploration and planning
Agent	Default interactive mode with approval gates
YOLO	Auto-approve tools in a trusted workspace

Plan is for investigation. Agent is safer for everyday coding. YOLO is fast but risky and should only be used in trusted workspaces and low-risk branches.

Tooling

The README lists a broad tool set:

File reads/writes and apply patch.
Shell execution.
Git operations.
Web search and browse.
Sub-agents.
MCP servers.
LSP diagnostics.
Session save/resume.
Workspace rollback.
Durable task queue.
HTTP/SSE runtime API.
Skills system.

LSP diagnostics are especially useful because errors from rust-analyzer, pyright, typescript-language-server, gopls, clangd, and similar tools can be fed back after edits.

Workspace rollback uses side-git snapshots and provides /restore and revert_turn. It does not touch the repository’s own .git, but normal git commits are still the safest baseline.

Common Commands

deepseek
deepseek "explain this function"
deepseek --model deepseek-v4-flash "summarize"
deepseek --model auto "fix this bug"
deepseek --yolo
deepseek auth set --provider deepseek
deepseek doctor
deepseek doctor --json
deepseek models
deepseek sessions
deepseek resume --last
deepseek resume <SESSION_ID>
deepseek fork <SESSION_ID>
deepseek serve --http
deepseek serve --acp
deepseek pr <N>
deepseek mcp list
deepseek mcp validate
deepseek update

Zed and ACP

DeepSeek-TUI can run as an Agent Client Protocol server:

{
  "agent_servers": {
    "DeepSeek": {
      "type": "custom",
      "command": "deepseek",
      "args": ["serve", "--acp"],
      "env": {}
    }
  }
}

The README notes that the first ACP slice supports new sessions and prompt responses, while tool-backed editing and checkpoint replay are not exposed through ACP yet.

Configuration and Providers

User config:

`1`	`~/.deepseek/config.toml`

Workspace overlay:

`1`	`<workspace>/.deepseek/config.toml`

Some sensitive fields, such as api_key, base_url, provider, and mcp_config_path, are denied in workspace overlays.

DeepSeek-TUI supports the default deepseek provider plus NVIDIA NIM, Fireworks, OpenAI-compatible endpoints, SGLang, vLLM, Ollama, and others.

OpenAI-compatible example:

1
2

deepseek auth set --provider openai --api-key "YOUR_OPENAI_COMPATIBLE_API_KEY"
OPENAI_BASE_URL="https://openai-compatible.example/v4" deepseek --provider openai --model glm-5

Ollama example:

1
2

ollama pull deepseek-coder:1.3b
deepseek --provider ollama --model deepseek-coder:1.3b

Cost and Context

DeepSeek-TUI is designed around DeepSeek V4. The README mentions deepseek-v4-pro and deepseek-v4-flash, 1M-token context windows, token usage, cost estimates, and prefix-cache telemetry.

For light tasks, deepseek-v4-flash or auto mode may be enough. For complex refactors, long-context debugging, and architecture work, use higher thinking or Pro.

Pricing and discounts change, so check DeepSeek’s official pricing page and the current TUI cost estimates before relying on numbers.

Suggested Workflow

Start safely:

Try it in a small test repository.
Run deepseek doctor.
Use Plan mode for read-only exploration.
Use Agent mode for small edits.
Review changes with git diff and tests.
Learn /restore and session recovery.
Use YOLO only in trusted temporary branches.

Do not store API keys in project files. For company code, confirm provider, logging, web search, and compliance rules first.

Summary

DeepSeek-TUI is a full terminal AI coding agent. It brings DeepSeek V4, TUI interaction, tool calls, LSP diagnostics, session recovery, rollback, MCP, and skills into one Rust-based workflow.

It is not the lightest DeepSeek client, but its strength is moving from chat to executable local development.

References

DeepSeek V4 Local Private Deployment: Choosing Domestic Chips or Consumer GPU Clusters

Fri, 08 May 2026 09:39:35 +0800

After DeepSeek V4 was released, many enterprises started asking one question: can we avoid external APIs and deploy the model in our own data center, private cloud, or dedicated cluster?

This is a very practical need. Finance, healthcare, government, manufacturing, legal, and R&D teams often cannot send internal documents, code, contracts, tickets, or customer data directly to public cloud models. For these scenarios, DeepSeek V4 is attractive not only because of model capability, but because it gives enterprises an option closer to controllable LLM infrastructure.

However, local deployment of DeepSeek V4 is not as simple as downloading a model and finding a few GPUs. Especially for very large MoE models such as Pro, total parameter size, active parameters, context length, KV cache, concurrency, and inference framework all directly affect hardware cost. What enterprises really need is not blindly chasing the full version, but first deciding what deployment shape the business actually needs.

Clarify the Deployment Goal First

Enterprise local private deployment usually has three goals:

Keep data inside the domain: internal documents, code, customer materials, logs, and knowledge bases do not leave the enterprise environment.
Make operations stable and controllable: model services, permissions, audit, logs, and upgrade cadence are controlled by the enterprise.
Reduce long-term cost: for high-frequency calls, local inference may be more controllable than long-term external API purchases.

If only a few employees ask occasional questions, local deployment may not be cost-effective. Private deployment is truly suitable for high-frequency, stable, data-sensitive, and workflow-defined scenarios, such as:

Internal knowledge-base Q&A.
Code review and development assistants.
Customer-service ticket summarization.
Contract, medical-record, and report analysis.
Database query assistants.
Agent workflow automation.

These scenarios share the same traits: sensitive data, stable call patterns, and the ability to fit into enterprise governance through permissions and logs.

Do Not Chase Full Pro From Day One

Common DeepSeek V4 versions include Pro and Flash. In public materials, Pro targets stronger reasoning and complex Agent tasks, while Flash emphasizes cost and response speed. Enterprises should not assume every workload needs Pro.

You can split tasks by complexity:

Simple Q&A, summarization, classification, and tag generation: prioritize Flash or smaller models.
Internal knowledge-base retrieval augmentation: Flash is enough for many cases; RAG, permissions, and retrieval quality matter more.
Code Agents, complex reasoning, and long-context analysis: then evaluate Pro.
High-value, low-frequency tasks: Pro can be used, but high concurrency may not be necessary.
Regular office assistants: there is no need to occupy the most expensive inference resources for long periods.

The advantage of MoE models is that each inference only activates part of the parameters, but this does not mean the hardware pressure is small. Weight storage, expert parallelism, network communication, context cache, and concurrent scheduling are still heavy. With 1M-token-level long context in particular, the real resource consumer is often not a single answer, but long context, multi-user concurrency, and persistent sessions.

Domestic Chip Route: Better for Enterprise Private Cloud

If an enterprise already has a domestic compute pool, or has requirements around Xinchuang, compliance, or supply-chain control, it can first evaluate domestic chips such as Ascend and Cambricon.

The advantages of this route are:

Better alignment with localization and supply-chain control requirements.
Suitable for enterprise data centers, dedicated clouds, and government/enterprise projects.
Easier to unify permissions, audit, resource isolation, and operations.
Friendlier to long-term stable services.

But the domestic chip route also has three practical issues.

First, framework adaptation. Whether the model can run depends not only on chip compute power, but also on the maturity of the inference framework, operators, communication libraries, quantization formats, MoE expert parallelism, and long-context optimization.

Second, engineering experience. Enterprises need more than “it starts successfully”; they need stable services: multi-tenancy, rate limiting, monitoring, failure recovery, gray releases, log audit, and permission isolation all need to be built.

Third, ecosystem differences. The same model will not have identical performance, accuracy, quantization support, or deployment tools on NVIDIA, Ascend, Cambricon, and other platforms. Before launch, real stress testing is required instead of relying only on nominal compute.

Therefore, domestic chips are more suitable for enterprises with clear budgets, high compliance requirements, and willingness to invest in platform engineering. It is not the easiest route, but it may be the route that best fits long-term governance.

Consumer GPU Clusters: Better for Pilots and Small Teams

If the goal is to validate business value first, a consumer GPU cluster is easier to start with. GPUs such as RTX 4090, RTX 5090, RTX 3090, and RTX 3060 12GB have more community tools, quantized models, and local inference references, so trial-and-error cost is lower.

The consumer GPU route fits:

Internal pilots by R&D teams.
Knowledge-base Q&A for small and medium businesses.
Low-concurrency code assistants.
Offline document processing.
Internal tools without strict SLA requirements.

But it also has obvious limits:

VRAM is small, making it hard to host a full large model directly.
Multi-GPU communication is weak, and cross-machine communication is more troublesome.
Long-term full-load stability is weaker than server-grade solutions.
Chassis, power, cooling, drivers, and operations become hidden costs.
It is not suitable for promising enterprise-grade high availability from the start.

A more realistic approach is to first run Flash, distilled versions, quantized versions, or smaller models on consumer GPUs, get the business workflow working, and then decide whether to migrate to server GPUs or a domestic compute platform after call volume, quality, and data governance have been validated.

A Possible Deployment Architecture

A relatively stable enterprise private architecture can be divided into six layers:

Model layer: DeepSeek V4 Pro, V4 Flash, or smaller distilled models selected by task.
Inference layer: SGLang, vLLM, llama.cpp, vendor NPU inference stacks, or enterprise self-developed services.
Gateway layer: unified authentication, rate limiting, audit, model routing, and call logs.
Knowledge layer: vector database, full-text search, document parsing, permission filtering, and RAG.
Application layer: customer service, code assistants, document analysis, report Q&A, and Agent workflows.
Operations layer: monitoring, alerts, cost statistics, gray releases, rollback, and security audit.

The gateway layer and knowledge layer are the easiest to underestimate. Many projects fail not because the model is completely unusable, but because permissions, retrieval, logs, context management, prompt templates, and business workflows were not done well.

When deploying LLMs internally, enterprises should treat the model as infrastructure, not as an isolated chat page. The real value appears only when the model enters workflows and can stably process the enterprise’s own data and tasks.

Hardware Selection

Hardware selection should not only ask “can it run”; it should also ask “can it serve stably”.

You can choose by stage:

Validation Stage

The goal is to prove whether the business is worth doing.

Use 1-4 consumer GPUs.
Prioritize Flash, smaller models, distilled models, or quantized models.
Keep concurrency low and focus on task completion rate.
Do not promise high availability.

Do not buy large-scale hardware too early at this stage. First confirm whether employees actually use it, whether the business really saves time, and whether answers can enter real workflows.

Pilot Stage

The goal is to let one department or one business line use it steadily.

Use 4-16 GPUs or a set of domestic NPU nodes.
Add a unified gateway, logs, and permission controls.
Build RAG, document parsing, model routing, and caching.
Start tracking tokens, concurrency, latency, and failure rate.

At this stage, operations begin to matter. Model quality is only one part; stability, cost, and data governance are equally important.

Production Stage

The goal is to enter enterprise-grade service.

Use server GPUs, domestic compute clusters, or private-cloud resource pools.
Build multi-replica deployment, rate limiting, failover, and capacity planning.
Route models by task: simple tasks use lightweight models, complex tasks use Pro.
Connect to enterprise identity systems, audit systems, and security policies.

In production, it is not recommended to send every request to the strongest model. Proper model routing usually saves more money than simply adding hardware.

Choosing an Inference Framework

Models such as DeepSeek V4 have high requirements for inference frameworks. When MoE, long context, sparse attention, quantization, and multi-GPU parallelism are involved, framework maturity directly affects speed and stability.

Common choices can be understood this way:

SGLang: suitable for teams focused on high-performance inference, Agents, multi-turn tool calls, and complex service orchestration.
vLLM: mature ecosystem, suitable for general LLM services, but actual support depends on version and model adaptation progress.
llama.cpp: better for small models, quantized models, and edge deployment; not suitable for directly hosting a full very large MoE model.
Domestic NPU inference stacks: suitable for Xinchuang and domestic compute environments, but operator, quantization, and long-context support must be carefully verified.

Do not choose a framework only by benchmark. Enterprises should test their own real inputs: internal document length, concurrency, average output length, RAG hit rate, number of Agent tool calls, and retry count after failures.

Data Security Must Be Built Outside the Model

Private deployment does not automatically mean security. Running the model locally only solves part of the question of whether data leaves the enterprise.

You still need:

Accounts and permissions: different departments can only access their own knowledge bases.
Log audit: who asked what, which model was called, and which documents were accessed.
Data masking: customer information, ID numbers, phone numbers, contract amounts, and other sensitive fields must be handled.
Prompt security: prevent users from bypassing permissions or leaking system prompts through prompts.
Output review: important scenarios need human review or rule-based review.
Data lifecycle: uploaded documents, vector indexes, caches, and session records must be deletable.

Enterprise local LLM deployment cannot involve only the algorithm team. Security, legal, operations, and business owners should all participate; otherwise, risks will be exposed after launch.

Cost Is More Than GPUs

The cost of local deployment is often underestimated. Beyond GPUs or NPUs, you also need to count:

Servers, racks, power, cooling, and networking.
Storage and backup.
Inference framework adaptation and engineering development.
Operations monitoring and incident handling.
Model upgrades, rollback, and compatibility tests.
Security audit and permission systems.
Business-side prompts, RAG, and workflow construction.

If call volume is very low, external APIs may be cheaper. If call volume is high, data is sensitive, and workflows are stable, local deployment is more likely to amortize cost.

A more reasonable strategy is hybrid deployment:

Highly sensitive data goes to local models.
Low-sensitivity general tasks can use external APIs.
Simple tasks use small models.
Complex tasks use DeepSeek V4 Pro.
High-frequency tasks prioritize caching, retrieval, and model routing optimization.

Recommended Rollout Path

Enterprises can proceed in this order:

Choose 2-3 high-value scenarios first; do not roll out company-wide.
Use consumer GPUs or small-scale compute for a PoC.
Run Flash, distilled models, or quantized models first, and connect RAG and permissions.
Introduce Pro for comparison tests on complex tasks.
Record real call volume, latency, failure rate, and time saved by humans.
Then decide whether to purchase domestic chip clusters or server GPUs.
Before production, complete gateway, audit, monitoring, rate limiting, and rollback.

This path is more stable than buying a large cluster from the start. The biggest enterprise risk is not that the model is not strong enough, but that a lot of money is spent before the business workflow is ready to absorb the model capability.

Summary

DeepSeek V4 gives enterprises more room to imagine local private deployment, but it is not simply a “local ChatGPT”. The real difficulty is engineering: hardware, frameworks, model routing, permissions, RAG, audit, monitoring, and cost control all need to be considered together.

The domestic chip route better fits enterprises with high compliance requirements and long-term private cloud plans. Consumer GPU clusters are better for pilots and quick validation by small and medium teams. Pro fits complex reasoning and Agent tasks; Flash or smaller models fit many ordinary tasks.

If you only remember one sentence: DeepSeek V4 private deployment should not start with hardware procurement, but with business scenarios, data boundaries, and call volume. First get the scenario working, then decide whether to use a large model, how large it should be, and what compute platform to use.

References

How to Use DeepSeek V4 Pro in Cline

Fri, 01 May 2026 20:59:06 +0800

Cline already supports the OpenAI Compatible Provider. DeepSeek API is also compatible with OpenAI SDK-style calls, so connecting deepseek-v4-pro to Cline is not complicated: choose OpenAI Compatible, then fill in DeepSeek’s Base URL, API Key, and model name.

The steps below cover both the VS Code extension UI and Cline CLI.

Prepare a DeepSeek API Key

First, create an API Key on the DeepSeek platform.

You need three values:

Item	Value
Provider	`OpenAI Compatible`
Base URL	`https://api.deepseek.com`
Model ID	`deepseek-v4-pro`

DeepSeek’s official documentation states that the V4 series uses the existing OpenAI-compatible interface. Keep base_url as https://api.deepseek.com, and set model to deepseek-v4-pro or deepseek-v4-flash when calling it.

Configure It in the Cline Extension

If you use the Cline extension in VS Code, configure it this way:

Open Cline from the VS Code sidebar.
Go to Cline settings or model configuration.
Select OpenAI Compatible as the provider.
Enter your DeepSeek API Key.
Set Base URL to:

`1`	`https://api.deepseek.com`

Set Model ID to:

`1`	`deepseek-v4-pro`

Save the configuration and run a simple test in Cline.

Start with a low-risk read-only task:

`1`	`Please read the current project directory structure and summarize what type of project this is. Do not modify any files.`

If Cline can read and answer normally, the model connection is working.

Configure It in Cline CLI

If you use Cline CLI, run cline provider configure openai-compatible to enter interactive configuration.

Example:

`1`	`cline provider configure openai-compatible`

Fill in:

1
2
3

API Key: sk-...
Base URL: https://api.deepseek.com
Model ID: deepseek-v4-pro

After configuration, test it with a read-only task:

`1`	`cline "Summarize this repository structure without changing files."`

If you want to lower cost first, you can temporarily change Model ID to:

`1`	`deepseek-v4-flash`

Then switch back to deepseek-v4-pro for complex planning, fact checking, multi-tool collaboration, or high-risk code changes.

Recommended Model Split

DeepSeek V4 Pro and Flash are better used with a clear split.

Model	Best for
`deepseek-v4-flash`	Routine code reading, small batch fixes, script generation, context summarization, low-risk frontend changes
`deepseek-v4-pro`	Architecture planning, complex bugs, cross-file refactors, fact checking, multi-tool calls, high-risk changes

For Agent tools like Cline, cost mainly comes from long context, repeated file reads, plan generation, and multi-round tool calls. If the task is light, use Flash for volume; if the task needs stronger judgment, switch to Pro.

How to Set Context Length

DeepSeek V4 Pro and Flash both support long context. If Cline requires a manual context window value, you can understand it according to the 1M context listed on DeepSeek’s official model page.

In practice, do not put every file into context at the beginning. Cline reads files according to the task, and a better workflow is usually:

first ask it to inspect the directory structure;
then ask it to locate relevant files;
finally let it modify only the target files.

This saves tokens and keeps the task boundary clearer.

Common Issues

1. Model Not Found

First check that Model ID is exactly:

`1`	`deepseek-v4-pro`

Do not write DeepSeek V4 Pro, deepseek-v4, or another display name.

2. 401 or Authentication Failed

Check the API Key:

whether it was copied completely;
whether it contains extra spaces;
whether it was entered into the provider configuration Cline is currently using;
whether the DeepSeek account has available balance.

3. Connection Failed

Check the Base URL:

`1`	`https://api.deepseek.com`

Do not append /v1/chat/completions at the end. Cline’s OpenAI Compatible Provider will construct compatible interface requests itself.

4. Cline Calls Are Too Expensive

You can switch routine tasks to deepseek-v4-flash and use deepseek-v4-pro only for complex tasks.

Also, make the task description as clear as possible:

`1`	`Only modify files related to the login page. Do not refactor unrelated modules. First provide a plan, and modify code only after confirmation.`

Agent tasks are most expensive when boundaries are unclear. The clearer the boundary, the fewer files it reads, the fewer tool calls it makes, and the more controllable the cost becomes.

5. Error: reasoning_content must be passed back

If you see an error like this:

{
  "message": "400 The `reasoning_content` in the thinking mode must be passed back to the API.",
  "code": "invalid_request_error",
  "modelId": "deepseek-v4-pro"
}

This is usually not a Key, quota, or Base URL problem. It means DeepSeek V4 Pro’s thinking mode and the current client’s multi-round tool-call history are not aligned.

DeepSeek’s official documentation states:

thinking mode is enabled by default;
thinking mode returns reasoning_content;
if a tool call happens in one round, subsequent requests must pass back the reasoning_content from that assistant message;
if the client does not pass it back correctly, the API returns 400.

When Cline connects through the OpenAI Compatible Provider, this error may appear in the second round or after tool calls if the current version does not fully preserve and return DeepSeek’s reasoning_content.

Try this order:

Upgrade Cline to the latest version;
confirm you are using OpenAI Compatible, not the normal OpenAI provider;
if Cline supports a custom request body, try disabling thinking mode:

{
  "thinking": {
    "type": "disabled"
  }
}

if Cline does not support extra body parameters, temporarily use another model or a compatible proxy service;
switch back to deepseek-v4-pro after Cline supports passing back DeepSeek V4 reasoning_content.

Note that disabling thinking mode may reduce complex reasoning ability, but it can work around client compatibility issues where reasoning_content is not passed back.

Copyable Configuration

Provider: OpenAI Compatible
API Key: sk-your DeepSeek API Key
Base URL: https://api.deepseek.com
Model ID: deepseek-v4-pro

For low-cost mode:

Provider: OpenAI Compatible
API Key: sk-your DeepSeek API Key
Base URL: https://api.deepseek.com
Model ID: deepseek-v4-flash

Summary

There are only three key steps to calling DeepSeek V4 Pro in Cline:

choose OpenAI Compatible as the provider;
set Base URL to https://api.deepseek.com;
set Model ID to deepseek-v4-pro.

After configuration, test with a read-only task before giving it real code changes. If you often run Agent tasks, split Flash and Pro: Flash handles high-frequency lightweight work, while Pro handles complex judgment and fallback tasks.

References:

How DeepSeek V4 Price Cuts Rewrite the Cost Model for AI Agents

Fri, 01 May 2026 19:47:47 +0800

DeepSeek V4 did not arrive with an especially loud launch. There was no major event, nor a benchmark story that instantly crushed every competitor. But a few days later, the part that truly affects the industry became visible: repeated price cuts.

The point of this change is not that “the model got a little stronger”, but that “usage cost has been pushed into another tier”. When token prices become low enough that an ordinary Agent task can finish for a few cents or a couple of yuan, the business logic behind many Coding Plans and Token Plans needs to be reconsidered.

Launch Day Was Not Explosive

The first wave of feedback to DeepSeek V4 was not especially heated. Many people expected it to deliver the kind of shock R1 did: across-the-board benchmark leadership, validation of domestic compute, and simultaneous breakthroughs in multimodal and Agent capabilities. After the actual release, however, it looked more like a steady upgrade.

V4 Pro is indeed a strong model, especially in coding, math, long context, and agentic coding. But it is not the kind of product that instantly makes every peer model look outdated. So on launch day, the discussion felt a little awkward: people wanted to praise it, but it was hard to find a sufficiently explosive angle.

The real turning point was not launch day, but the price adjustments that followed.

Successive Price Cuts Are the Key

After DeepSeek V4 was released, prices started to move downward. According to DeepSeek’s official pricing page and the information summarized in the source article, the rough prices at that time were:

DeepSeek V4 Flash: about 1 yuan per 1 million input tokens; about 0.02 yuan per 1 million tokens after a cache hit;
DeepSeek V4 Pro: about 3 yuan per 1 million input tokens; about 0.025 yuan per 1 million tokens after a cache hit;
the cache-hit input price across the model family dropped to one tenth of the launch price;
V4 Pro was once in a 75% discount period, extended until May 31, 2026 at 23:59.

The API prices in US dollars make the difference easier to see:

Model	Cached input	Non-cached input	Output	Context
`deepseek-v4-flash`	$0.0028 / 1M tokens	$0.14 / 1M tokens	$0.28 / 1M tokens	1M
`deepseek-v4-pro` promotional price	$0.003625 / 1M tokens	$0.435 / 1M tokens	$0.87 / 1M tokens	1M
`deepseek-v4-pro` regular price	$0.0145 / 1M tokens	$1.74 / 1M tokens	$3.48 / 1M tokens	1M

Two details matter here.

First, V4 Pro’s $0.435 / $0.87 is a promotional price, not the long-term regular price. In DeepSeek’s official notes, this 75% discount was extended until May 31, 2026 at 15:59 UTC.

Second, cache-hit pricing is the key variable in the Agent cost model. Flash’s cached input price is as low as $0.0028 / 1M tokens, while Pro’s promotional cached input price is $0.003625 / 1M tokens. That means repeated project context, tool definitions, system prompts, and historical summaries no longer need to be charged at the full input price.

The most important thing about this pricing is that it makes the token cost of many tasks “insensitive”. In the past, developers worried that one Agent task would consume a large amount of context, repeatedly read and write code, and call tools frequently. Now, as long as the cache hit rate is high enough, the cost can be pushed very low.

Price Comparison With GPT and Claude

DeepSeek’s own prices alone do not fully convey the gap. The contrast becomes much clearer when placed next to common closed-source models from the same period.

Model	Input	Cached input	Output	Best fit
`deepseek-v4-flash`	$0.14 / M	$0.0028 / M	$0.28 / M	High-frequency Agents, routine coding, batch tasks
`deepseek-v4-pro` promotional price	$0.435 / M	$0.003625 / M	$0.87 / M	Complex coding, planning, fact checking
`deepseek-v4-pro` regular price	$1.74 / M	$0.0145 / M	$3.48 / M	Pro cost baseline after the promotion
GPT-5.5	$5 / M	$0.50 / M	$30 / M	High-quality complex tasks, general reasoning
GPT-5.4	$2.50 / M	$0.25 / M	$15 / M	Mid-range choice for programming and professional tasks
GPT-5.4 mini	$0.75 / M	$0.075 / M	$4.50 / M	Lower-cost general and subtask model
Claude Opus 4.7	$5 / M	$0.50 / M	$25 / M	High-quality writing, complex reasoning, long tasks
Claude Sonnet 4.6	$3 / M	$0.30 / M	$15 / M	Programming, Agents, general work
Claude Haiku 4.5	$1 / M	$0.10 / M	$5 / M	Lightweight tasks, summarization, classification

The most striking number in this table is output price. Agents do not only read context; they also keep generating plans, patches, explanations, logs, and next actions. If there is a lot of output, DeepSeek V4 Pro’s promotional $0.87 / M becomes dramatically cheaper than GPT-5.5’s $30 / M or Claude Sonnet 4.6’s $15 / M.

Even at V4 Pro’s regular output price of $3.48 / M, it is still clearly below GPT-5.4, GPT-5.5, and Claude Sonnet / Opus. If the task can be handled by Flash, the output price drops further to $0.28 / M.

The cached input gap is even more extreme. DeepSeek V4 Flash’s cached input price is $0.0028 / M, while GPT-5.5 and Claude Opus 4.7 are both $0.50 / M. These are not in the same order of magnitude. For Agents that repeatedly read the same code repository, this gap matters more than it does in ordinary chat.

Why Agent Tasks Are Especially Affected

AI Agents are different from ordinary chat. Ordinary chat is usually a question-and-answer flow with relatively limited input context. Agent tasks repeatedly read project files, generate plans, call tools, inspect results, and then modify code again.

These tasks have two traits:

large token consumption;
lots of repeated context.

The second point is crucial. In a code project, the model repeatedly reads the same files, directory structure, error logs, and modification results. If the platform supports cache hits, the cost of repeated input drops sharply.

The source article mentioned a real experience: connecting DeepSeek V4 Pro and Flash to a Claude Code-like tool, asking it to pull a prompt repository and turn it into a local search site. The task was completed, with a total cost of roughly a little over 0.8 yuan, and Pro reached a cache hit rate of 98.7%.

This example illustrates a practical issue: the more an Agent task resembles “repeated work around the same project”, the more valuable cache hits become. If generating a website, fixing a bug, or changing a frontend costs only a few cents to a few yuan, subscription plans become less attractive.

We can estimate the gap with a simplified task. Assume one coding agent task includes:

500,000 input tokens, of which 80% can hit cache;
50,000 output tokens;
no tool calls, search costs, or platform markup included, only model token cost.

The rough costs are:

Model	Estimated cost
DeepSeek V4 Flash	about $0.03
DeepSeek V4 Pro promotional price	about $0.09
DeepSeek V4 Pro regular price	about $0.36
GPT-5.4 mini	about $0.30
GPT-5.4	about $1.01
GPT-5.5	about $1.75
Claude Sonnet 4.6	about $1.11
Claude Opus 4.7	about $1.65

This estimate does not mean DeepSeek is better for every task. Model quality, tool-call stability, long-context retrieval ability, coding style, and factual reliability all need separate evaluation. But from a cost perspective, DeepSeek V4 pushes the marginal cost of “letting the Agent run a few more rounds” very low. That will encourage developers to design longer workflows, more frequent self-checks, and more candidate solutions instead of worrying about the token bill every time.

The Difference Between Coding Plans and Token Plans

Many AI products now offer two types of plans: Coding Plans and Token Plans.

The rough difference is:

Coding Plans are usually mainly for programming;
Token Plans usually cover more capabilities, such as STT, TTS, image generation, search, embedding, and RAG;
STT means speech to text;
TTS means text to speech;
Coding Plans often restrict users to programming scenarios, while other capabilities still require separate purchases.

From a business perspective, a Coding Plan is more like a buffet. Users pay a fixed fee in advance, while the vendor bets that most people will not use up the quota. Some users consume more, others consume less, and the platform can still make money on average.

But if pay-as-you-go token prices are low enough, users start calculating: why do I have to buy a plan? If the real monthly usage cost is only a few yuan or a dozen yuan, a 40-yuan or 200-yuan plan may no longer be worthwhile.

Why Price Cuts Challenge the Subscription Model

Subscription plans rely on one premise: users feel that each individual use is expensive, or they do not want to calculate the cost of every call. When token prices are high, a plan feels reassuring. When token prices are almost negligible, pay-as-you-go becomes more natural.

DeepSeek V4’s price cut effectively reveals the underlying cost:

Agent tasks can be very cheap;
long context is not necessarily too expensive to use;
cache hits can reduce cost significantly;
ordinary developers do not necessarily need a fixed subscription;
the model entry point can shift from a “plan platform” to a “low-cost API”.

This will make platforms built around Coding Plans uncomfortable. If users find pay-as-you-go calls cheaper and freer, they have less reason to be locked into one platform’s subscription.

How to Choose Between Flash and Pro

A practical way to use DeepSeek V4 is to split work between Flash and Pro.

Flash is suitable for high-frequency, lightweight, repeatable tasks:

fixing bugs;
writing frontend code;
writing scripts;
routine code understanding;
processing ordinary information in long context;
running large numbers of subtasks.

Flash is cheap, fast, and also supports very long context. For everyday coding agents, many tasks do not need Pro from the start.

Pro is better for complex judgment and fallback work:

multi-round planning;
complex Agent workflows;
multiple function calls;
fact checking;
financial research;
content production that requires stronger knowledge and judgment;
high-risk code changes.

A reasonable setup is: Flash handles volume, Pro handles fallback. Start ordinary tasks with Flash, then switch to Pro for long-horizon planning, complex judgment, fact checking, or multi-tool collaboration. This keeps cost under control while preserving model quality.

Why DeepSeek Can Price This Way

DeepSeek has a different business structure from many large platforms. It does not have e-commerce, social networking, short video, cloud computing, phones, cars, office suites, operating systems, browsers, or a large enterprise SaaS ecosystem.

That means it does not need to lock users into a complete platform. It can simply sell text model capability: use cheap text models here, and call any other capability elsewhere.

Large platforms usually think differently. If you buy their Coding Plan or Token Plan, you are pulled into their cloud, search, image generation, voice, database, and developer-tool ecosystem. The plan is not merely selling the model; it is competing for the user entry point.

DeepSeek’s approach is more direct: push text model prices down and try to become the default model entry point for Agents. Once the default entry point is occupied, many developers and toolchains will naturally adapt around it.

Open Models and the Default Entry Point

If DeepSeek V4 keeps an open model route, third-party cloud vendors and platforms may deploy it themselves and provide services. For DeepSeek, that is both distribution and potential diversion.

This is where a low-price official API matters. If the official price is already low enough, other platforms will struggle to offer an obvious price advantage even if they can deploy the model. Users will tend to use the default, cheap, stable entry point directly.

This is especially true for Agent tools. Agent tasks depend on long context, caching, tool calls, and stable throughput. Once a model is cheap enough in these scenarios, it has a chance to become the default option.

Coding Plans Are Still Not Useless

This does not mean Coding Plans will disappear immediately. They still fit some users.

If some users are truly heavy users who max out their quota every day, a fixed subscription may still be economical. Just like a buffet, if nobody could ever eat enough to get their money’s worth, users would not buy it.

The problem is that most users are not that kind of extremely high-frequency user. Low-frequency users, lightweight developers, and people who occasionally write scripts or modify projects are better suited to pay-as-you-go. After DeepSeek lowers pay-as-you-go costs, the appeal of plans weakens.

The future is more likely to become a layered choice:

heavy high-frequency users keep buying Coding Plans;
ordinary users move to low-cost APIs;
Agent tools automatically choose Flash / Pro according to the task;
platform plans need to provide more non-model value, such as workflows, IDE integration, deployment, team management, and security auditing.

Summary

DeepSeek V4 did not create its biggest impact through benchmarks. What truly changed industry expectations was the price reduction that followed.

When input tokens and cache-hit pricing are pushed very low, the cost of using AI Agents changes. Long context, code-project analysis, and multi-round tool calls that used to look expensive may now become everyday costs of a few cents to a few yuan.

This directly challenges the business logic of Coding Plans and Token Plans. If users can pay by usage, freely combine models and tools, and keep costs low enough, they may not want to be tied to a specific platform plan.

What DeepSeek V4 truly touches this time is not only the ranking of model capability, but the cost structure of AI Agents and the battle for the default entry point.

References:

free-claude-code: Connecting Claude Code to OpenRouter, DeepSeek, and Local Models Through a Proxy

Fri, 01 May 2026 03:41:49 +0800

free-claude-code is an Anthropic-compatible proxy for Claude Code.

Its idea is not to crack Claude Code, nor to provide an official free Claude service. Instead, it starts a local proxy service that looks like an Anthropic API, then forwards requests from Claude Code to other model backends. The README mentions backends such as NVIDIA NIM, OpenRouter, DeepSeek, LM Studio, llama.cpp, and Ollama.

In simple terms, it solves this problem: you like the terminal experience of Claude Code, but want to send model requests to another provider or a local model.

What Problem It Solves

Claude Code has an interaction model that works well for development tasks.

It can read code, edit files, run commands, and move tasks forward based on project context inside the terminal. But many users may not always want to use the same model backend:

They want to try different models on OpenRouter
They want to use models such as DeepSeek to reduce cost
They want to route requests to local Ollama
They want to run local models through LM Studio or llama.cpp
They want one proxy entry point in the development environment
They want to compare different models inside the Claude Code workflow

free-claude-code is positioned as a compatibility layer between Claude Code and these model services.

Claude Code still sends requests in an Anthropic-like style, while the proxy adapts those requests to different backends.

How It Works

You can think of it as three layers:

The frontend is Claude Code
The middle layer is the free-claude-code proxy
The backend is OpenRouter, DeepSeek, a local model, or another model service

Claude Code believes it is accessing an Anthropic-compatible API.

After the proxy receives a request, it selects a target provider according to configuration, transforms the necessary fields, and returns the response to Claude Code.

The benefit of this structure is that you do not need to modify Claude Code itself, and you do not need every model service to natively support Claude Code. As long as the proxy can align the interfaces, more models can be connected to the same workflow.

Supported Backends

The README lists these directions:

NVIDIA NIM
OpenRouter
DeepSeek
LM Studio
llama.cpp
Ollama

These backends represent different usage styles.

OpenRouter is more like a model aggregation entry point, useful for testing different commercial and open-source models.

DeepSeek is suitable for people who care about Chinese ability, coding ability, and cost.

LM Studio, llama.cpp, and Ollama are more local-model oriented. They are suitable for running models on your own machine or inside an intranet, reducing dependence on external APIs and making offline experiments easier.

NVIDIA NIM is more oriented toward enterprise and GPU inference deployment scenarios.

Why an Anthropic-Compatible Proxy

Claude Code was originally designed around Anthropic interfaces and model conventions.

If you want to connect it to other models, the most direct problem is interface mismatch:

Request fields differ
Model names differ
Streaming formats differ
Tool use is represented differently
Error response formats differ
Token and context limits differ

This is where the proxy layer is useful.

It keeps the interface seen by Claude Code close to the Anthropic shape, then adapts to the backend. For users, after configuring the proxy once, they can test different models inside the same Claude Code workflow.

Suitable Scenarios

free-claude-code is suitable for:

Using the Claude Code terminal workflow
Testing non-Anthropic models in Claude Code
Reducing model calling costs
Connecting Claude Code to OpenRouter
Connecting to compatible model services such as DeepSeek
Running local models through Ollama, LM Studio, or llama.cpp
Giving a team one unified model proxy entry point

If you only use official Claude Code normally and have no special needs around providers, cost, or local deployment, you may not need this type of proxy.

But if you often compare models, or want Claude Code to connect to local and third-party models, this type of tool is useful.

Difference from Directly Using OpenRouter or Ollama

Using OpenRouter, Ollama, or LM Studio directly usually means chatting with a model or calling it through an API.

The point of free-claude-code is not to replace those services, but to connect them to the Claude Code development workflow.

The difference is:

You still use the Claude Code terminal experience
AI can execute tasks around a code repository
The model backend can be changed to another provider
Local models can enter the Claude Code workflow
Configuration is centralized in the proxy layer instead of changed in each tool

So it is more like a bridge than a new chat client.

Notes About Local Models

Connecting Claude Code to local models is attractive, but there are real limitations.

First, model capability differs.

Claude Code tasks are usually not just chat. They include understanding code, planning modifications, editing files, and handling command output. Smaller local models may not complete these tasks reliably.

Second, context window matters.

Code tasks need a lot of context. If the model context is too small, it may fail to read full files, miss constraints, or lose background across multi-turn tasks.

Third, tool use compatibility matters.

Claude Code workflows depend on tool calls and structured behavior. Even if a backend model can chat, it may not follow tool-use protocols well.

Fourth, speed and hardware matter.

Local model speed depends on machine configuration, quantization, and model size. If code tasks respond too slowly, the experience drops noticeably.

So local models are better for experiments, low-risk tasks, and specific scenarios. For truly complex coding tasks, choose carefully according to model capability.

Usage Boundaries

Projects like this are easy to misunderstand from the title, so the boundaries should be clear.

First, it is not an official free Claude Code quota.

It only forwards Claude Code requests to other model backends. When using OpenRouter, DeepSeek, NVIDIA NIM, or other APIs, you still need to follow the pricing, quotas, and terms of the corresponding services.

Second, it is not a tool for bypassing authorization.

When using any proxy tool, you should follow the licenses and terms of Claude Code, model providers, and the project itself. Do not interpret it as a way to avoid official restrictions.

Third, the proxy handles your request content.

Code, command output, and project context may pass through the proxy and backend services. When deploying, consider logs, keys, network boundaries, and privacy. For company code or sensitive projects, use a controlled environment.

Fourth, model performance varies greatly.

The same Claude Code operation may behave very differently after switching models. Do not assume every model can replace Claude.

Relationship with Proxies Such as LiteLLM

Conceptually, free-claude-code belongs to the category of compatible interface proxies.

The shared goal of such tools is to reduce coupling between upper-level applications and lower-level model services. The upper-level application faces a relatively unified interface, while backend providers can be switched by configuration.

Different projects focus on different areas. Some are general model gateways, some focus on OpenAI-compatible APIs, and some specifically adapt tools such as Claude Code.

What makes free-claude-code worth noting is that it puts Claude Code directly at the center, rather than building a generic chat proxy.

Suitable Users

It is better suited to users who are comfortable tinkering:

Familiar with Claude Code
Know how to configure API keys and model providers
Understand proxy service startup and environment variables
Can troubleshoot network, port, model name, and streaming issues
Want to compare different models on coding tasks

If you only want something that works out of the box, the official configuration is usually simpler.

If you are willing to set up a proxy, switch models, tune parameters, and let Claude Code enter more model environments, this project is worth studying.

Reference

Alishahryar1/free-claude-code

Final Thought

The value of free-claude-code is not in the word “free,” but in the bridge it builds between Claude Code and more model backends.

When you want to keep the Claude Code development experience while testing OpenRouter, DeepSeek, local models, or enterprise inference services, an Anthropic-compatible proxy like this becomes useful.

DeepSeek V4 Pro vs GPT-5.5: After Testing Frontend, Writing, and Coding, the Gap Feels Bigger Than Expected

Sat, 25 Apr 2026 11:12:00 +0800

Comparisons between DeepSeek V4 Pro and GPT-5.5 are getting more attention lately. The reason is no longer whether either model is usable. The real question is: when the work lands in frontend development, writing, and coding, which one is better suited to be your main tool?

When people compare models like this, they often start by asking which one is stronger.
But the more useful question is usually different: in a real task, which one is steadier, cheaper to communicate with, and more likely to produce something you can keep building on immediately?

If we simplify the conclusion first, it roughly looks like this:

When you want more balanced output and a more complete productized experience, many people still look at GPT-5.5 first
When you need high-frequency iteration in Chinese, care more about cost, and want fast response cycles, DeepSeek V4 Pro becomes a serious candidate
What really determines the experience is often not the model name itself, but the task type, the prompting approach, and whether you need to keep revising afterward

Let’s break this down through the three most common comparison scenarios.

1. Frontend tasks: the real question is not whether it can build a page, but whether it can keep improving it

Frontend work looks ideal for model comparisons because the result is easy to see.
Can the page run? Does it look good? Is the structure clean? You can judge all of that quickly.

But the real difference usually does not appear in whether the first draft works. It shows up in questions like these:

Is the structure clear enough?
Is the component split natural?
Does changing one part accidentally break another?
Can it keep following the same implementation logic across multiple rounds of instructions?

That is also why many frontend demos that look impressive in the first round do not necessarily stay ahead in real workflows.

If your task is something like:

Quickly generate a runnable page prototype
Draft a landing page idea
Fill in required styles, buttons, cards, forms, and other basic elements

then both models will often get you fairly close, and the difference is more about output style.

But if the task becomes:

Repeatedly revising the UI over multiple rounds
Reading existing code and continuing from there
Balancing component structure, style consistency, and maintainability
Gradually turning a static page into real project code

then what you should watch is no longer “who looks better in round one,” but “who is less likely to drift off by round five.”

So in frontend work, the key comparison is not whether the model can generate a page. It is whether, after you keep adding constraints, it can still maintain stable structure, consistent naming, and manageable modification costs.

2. Writing tasks: the real difference is not how much it writes, but how stable the style stays and how well rewrites go

Writing is another area where people can misjudge models very easily.

A big reason is that first drafts often look fine from both sides.
The structure is complete, the paragraphs are there, and the tone is smooth enough that it is easy to think they are basically similar.

But as soon as you push the task one step further, the differences show up:

Can it accurately understand your intended audience?
Can it switch tone while staying on the same topic?
Does it lose key points when rewriting?
Does it stay stable when compressing, expanding, retitling, or restructuring?

The biggest problem in writing is usually not “it cannot write,” but “it wrote something that still needs a lot of fixing.”

So when comparing DeepSeek V4 Pro and GPT-5.5, the more useful method is not to ask each to write one article. It is to run several rounds like this:

Write the first draft
Rewrite it in a different tone
Compress it into a shorter version
Rework it into something better suited for click-driven headlines or search distribution

If a model can keep the key points intact, the wording stable, and the structure clean through those rounds, then it has much more value in a real writing workflow.

In other words, what writing tasks really measure is not “literary flair,” but revision ability, instruction following, and the feeling of continuous collaboration.

3. Coding tasks: the real gap shows up in long-chain stability

Coding tasks expose a model’s real level more easily than frontend work, because they are not just about generating output. They have to connect with reality.

Very quickly, you run into questions like:

Can it understand an existing project structure?
Can it modify multiple files at once?
Does it introduce new problems after making changes?
Can it keep debugging by following logs and errors?
After several rounds, does it still remember what it already changed?

In this kind of work, what users care about most is usually not whether a single code snippet looks elegant. It is: can this model keep moving the task forward, instead of leaving me to clean up the mess?

So when comparing DeepSeek V4 Pro and GPT-5.5, the most meaningful thing to look at is usually not isolated coding prompts, but a process closer to real work:

Read an existing repository
Find a bug
Modify several related files
Continue fixing based on error messages
Summarize the result clearly at the end

Once the task enters that kind of continuous workflow, context retention, execution habits, explanation quality, and rework rate all matter more than single-turn answer quality.

That is also why many users eventually do not settle on “using only one model forever” for coding. Instead, they switch their main tool depending on the stage of the task.

4. What is really worth comparing is not who wins, but which tasks are more cost-effective to assign to whom

If you put DeepSeek V4 Pro and GPT-5.5 side by side and only try to pick one overall champion, the result is usually an empty conclusion.

That is because real tasks are not one standard exam:

Some are one-off generation
Some are multi-round collaboration
Some are Chinese writing
Some are engineering changes
Some prioritize speed
Some prioritize stability
Some prioritize cost

So the approach that is closer to real usage is usually to divide by task goal:

If you want a more complete overall experience, more mature interaction, and steadier general output, try GPT-5.5 first
If you want high-frequency experimentation in Chinese, fast iteration, and better efficiency for the money, DeepSeek V4 Pro deserves a serious place in your workflow
If the task itself is long-chain, multi-round, and collaborative, do not stop at the first result—look at who stays steadier after five rounds

In other words, the real question is not “who is absolutely stronger,” but this:
for frontend work, writing, and coding, which model feels more like the most practical tool for your current stage?

5. How to run a comparison that actually means something

If you want to test DeepSeek V4 Pro and GPT-5.5 yourself, a more reliable method is usually not to run a single round, but to do something like this:

Give both models the same initial requirement
Keep the same constraints on both sides
Continue asking follow-up questions for three to five rounds
Record output quality, drift frequency, and rework amount
Only then compare speed, cost, and final usability

That kind of test will get you much closer to real work than simply asking who looks more impressive in the first round.

Especially in frontend, writing, and coding, what often determines the actual experience is not the starting line, but who can stay with you and help finish the work.

6. A simple way to remember it

If you just want a practical summary, you can remember it like this:

GPT-5.5: more like a broad, productized, mainstream default workspace
DeepSeek V4 Pro: more like a strong competitor worth bringing into daily workflows in Chinese and in high-frequency trial-and-error work
The real comparison point: not flashy first-round output, but who stays steadier and saves more effort after multiple rounds of revision

So in this kind of comparison, what really matters is never just “who won.” It is this:
for your frontend, writing, and coding tasks, which model makes continuous progress easier, reduces rework, and gives you more stable output?

DeepSeek-V4 Preview Released: 1M Context, Two Models, and API Migration Notes

Fri, 24 Apr 2026 22:39:46 +0800

DeepSeek released DeepSeek V4 Preview Release on 2026-04-24. Based on the official announcement page, the update is centered on a few very clear themes: 1M context, a two-model lineup with V4-Pro and V4-Flash, dedicated optimization for agent scenarios, and API-side model migration.

If we reduce the release to one sentence, the main signal is this: DeepSeek is not just trying to make a stronger model. It is pushing ultra-long context and agent capabilities toward something that is ready for practical deployment.

1. What was released this time

According to the official page, DeepSeek-V4 Preview mainly includes two product lines:

DeepSeek-V4-Pro
DeepSeek-V4-Flash

The official descriptions are also very direct:

DeepSeek-V4-Pro: 1.6T total / 49B active params
DeepSeek-V4-Flash: 284B total / 13B active params

The naming already makes the strategy clear. This is not a single-model upgrade. DeepSeek is launching a higher-end model and a more cost-efficient model at the same time.

V4-Pro is positioned around performance ceiling, with DeepSeek saying it can compete with the world’s top closed-source models. V4-Flash, by contrast, is positioned around speed, efficiency, and lower cost, making it more suitable for workloads that care more about latency and API pricing.

2. `1M context` is the most visible headline

One of the most prominent lines on the official page is: “Welcome to the era of cost-effective 1M context length.”

DeepSeek is not merely saying the model supports long context. It is presenting 1M context as a default capability of this generation. The page is explicit that:

1M context is now the default standard across official DeepSeek services
Both V4-Pro and V4-Flash support 1M context

The importance of this is not just that you can fit more tokens. It directly affects tasks like:

Understanding large codebases
Long-document Q&A and information synthesis
Multi-turn agent workflows
Complex tasks spanning multiple files, tools, and stages

When the context window is large enough, the model is less likely to lose context midway and re-read material repeatedly. That matters a lot for agentic coding and complex knowledge work.

3. What `V4-Pro` is mainly emphasizing

From the wording on the official page, DeepSeek-V4-Pro focuses on three things:

Agentic coding capability
World knowledge
Reasoning ability

The page says V4-Pro reaches open-source SOTA on agentic coding benchmarks. It also claims leadership among current open models in world knowledge, trailing only Gemini-3.1-Pro, and states that its math, STEM, and coding performance surpasses current open models while rivaling top closed-source models.

In other words, V4-Pro is not positioned as a simple question-answering model. It is aimed much more at high-difficulty reasoning, complex coding, and long-horizon task execution.

4. `V4-Flash` is not just a cut-down version

Another notable point is that DeepSeek does not present V4-Flash as a low-end model. Instead, it stresses that the model is already strong enough for many practical tasks.

According to the announcement, V4-Flash:

Has reasoning ability that comes close to V4-Pro
Performs on par with V4-Pro on simple agent tasks
Uses fewer parameters, responds faster, and is more economical for API usage

That means the lineup is not a very split “one flagship, one entry-level” structure. It is closer to:

V4-Pro: optimize for higher performance and a stronger ceiling
V4-Flash: optimize for lower latency and better cost efficiency

For developers, that is often a more practical combination, because many production tasks do not need the absolute strongest model in theory. They need something strong enough, fast enough, and affordable enough.

5. The release puts clear emphasis on agent optimization

Another strong signal from the announcement page is that DeepSeek is actively pushing V4 toward agent use cases.

The page says DeepSeek-V4 has been seamlessly integrated with several leading AI agents, including:

Claude Code
OpenClaw
OpenCode

DeepSeek also says that V4 is already being used in its in-house agentic coding workflows.

That means the target is no longer limited to chat or ordinary completion. The model is being positioned for longer workflows: reading code, understanding structure, calling tools, generating outputs, and connecting the whole process together.

If you have been paying attention to coding agents recently, this is worth noticing. Model providers are no longer only competing on benchmarks. They are also competing on whether the model can actually plug into real workflows.

6. Structural innovation is serving long context efficiency

On the technical side, the page summarizes this release’s structural work as:

token-wise compression
DSA (DeepSeek Sparse Attention)

The direction is clear: make long context cheaper and more efficient while reducing compute and memory cost as much as possible.

The announcement page does not go into full technical detail, but it at least suggests that DeepSeek is not relying only on brute-force scaling to support longer windows. It is also making architecture-level optimizations specifically for long-context efficiency.

For actual users, that often matters more than just seeing a bigger context number, because real usability depends on more than whether 1M is technically available. It also depends on:

Whether speed stays acceptable
Whether cost stays acceptable
Whether long-context tasks remain stable in practice

7. The API is already available, but model migration matters

The official page clearly states that the API is available today.

The migration path is also relatively simple:

Keep the same base_url
Switch the model name to deepseek-v4-pro or deepseek-v4-flash

The page also says both models support:

1M context
Dual Thinking / Non-Thinking modes
OpenAI ChatCompletions
Anthropic APIs

That means if you already use the DeepSeek API, the upgrade path is not especially difficult. The main work is updating model names and validating behavior.

8. The retirement schedule for old models is explicit

For developers, one of the most important details on the page is actually the retirement notice for older models.

DeepSeek explicitly says:

deepseek-chat
deepseek-reasoner

will be fully retired and inaccessible after July 24, 2026, 15:59 UTC.

The page also notes that these two models are currently being routed to the non-thinking and thinking modes of deepseek-v4-flash.

That means if your project still directly references deepseek-chat or deepseek-reasoner, now is the time to plan the migration instead of waiting until the formal shutdown date gets close.

9. How this release is worth reading

If we compress the update into a few main takeaways, they look like this:

DeepSeek is turning 1M context from a premium feature into a default standard
The two-model strategy is clearer: one targets performance ceiling, one targets speed and cost efficiency
Agent capability has been moved into a very central role
The API upgrade path is relatively direct, but the old-model retirement timeline needs attention soon

For general users, the most visible change may be that long documents, long code contexts, and long workflows become easier to fit into one session.
For developers, the more important point is that if you are already building agents, coding assistants, knowledge workflows, or complex automation pipelines, this generation is very clearly designed for those scenarios.

This is not just a routine model update from DeepSeek. It reads more like a clearer statement of its next product direction: ultra-long context, agent optimization, and more practical API readiness.

DeepSeek official news page: https://api-docs.deepseek.com/news/news260424
Tech Report: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
Open Weights: https://huggingface.co/collections/deepseek-ai/deepseek-v4