Anthropic updated the Claude Platform release notes on June 26, 2026: Claude API rate limits have been raised, and Claude Sonnet and Claude Haiku limits are now aligned with Claude Opus across usage tiers. The old usage tiers have also been consolidated into three simpler levels: Start, Build, and Scale.
According to Anthropic, most organizations will move into a higher tier, no organization will receive lower limits than before, and developers do not need to take manual action. You can check your tier and current limits in Claude Console.
Release notes:
https://platform.claude.com/docs/en/release-notes/overview
Rate limits documentation:
https://platform.claude.com/docs/en/api/rate-limits
Rate Limits API documentation:
https://platform.claude.com/docs/en/api/rate-limits-api
What matters in this update
If you only use the Claude API occasionally for scripts or small tools, you may not notice the change right away. But if you run Claude Code, AI Agents, batch summarization, RAG Q&A, or backend queues, this update is worth checking.
The change can be summarized in three points:
- Overall Claude API limits have increased.
- Sonnet and Haiku limits now align with Opus within each usage tier.
- Usage tiers have been simplified to
Start,Build, andScale.
What does that mean in practice? In the past, some developers had to double-check model-specific limits when switching between Opus, Sonnet, and Haiku. The new tier structure and aligned model limits should be easier to reason about for multi-model apps, Agent products, and internal platforms.
But this does not mean you can open unlimited concurrency. Claude API requests are still limited by request count, input tokens, output tokens, and traffic growth speed.
Why developers should care
For many Claude API integrations, the real problem is not whether the model can answer. It is suddenly hitting 429 after going live.
Common examples include:
- A local script sends hundreds of files to Claude for summarization.
- An Agent app runs many tool calls and long-context requests at the same time.
- A RAG system packs retrieval results, chat history, and system prompts into one prompt.
- A backend queue consumes jobs too quickly and burns through token capacity in minutes.
- Failed requests trigger automatic retries, making congestion worse.
The higher limits will help some workloads run more smoothly. But if your app can amplify requests, you still need to handle rate limits seriously. Higher limits are good news, but rate limiting, queues, and retry strategy still matter.
How to think about Start, Build, and Scale
The new usage tiers have three levels:
| Tier | A practical fit |
|---|---|
Start |
Individual developers, small scripts, early prototypes |
Build |
Apps with steady traffic, team-internal tools |
Scale |
Production workloads, high-concurrency Agents, batch jobs, enterprise integrations |
Do not copy quota numbers from an article. Use Claude Console and the official docs as the source of truth. Anthropic limits can vary by account, organization, workspace, model, and product policy.
In plain terms: if you only write the occasional script, the main thing is not to crank concurrency too high. If you are building a real product, treat Claude as an external service that needs capacity planning, not just a normal function call.
You still need RPM, ITPM, and OTPM
Claude API rate limits are not just “requests per minute.” The docs commonly use three metrics:
| Metric | Meaning | Common failure pattern |
|---|---|---|
RPM |
requests per minute | Too many small requests, high concurrency, excessive automatic retries |
ITPM |
input tokens per minute | Long prompts, large context windows, too many RAG results |
OTPM |
output tokens per minute | Oversized max_tokens, batch generation of long articles or code |
Many 429 errors happen because token volume is high, not because the request count is high. For example, you might send only 10 requests per minute, but if each request carries hundreds of thousands of input tokens, you may hit ITPM first. Conversely, if prompts are short but the model generates long reports in bulk, you may hit OTPM first.
So do not only count API calls when debugging. At minimum, log the model name, workspace, input tokens, output tokens, response status, and retry count.
Agents and batch jobs benefit most
The raised limits help normal chat-style requests too, but Agent and batch workloads are likely to feel the difference more.
A single “user request” in an Agent app may hide a chain of Claude API calls:
- Read files.
- Summarize context.
- Call tools.
- Inspect tool results.
- Plan the next step.
- Produce the final answer.
If several users run this at the same time, or if backend batch jobs are also running, token usage rises quickly. The higher limits give these workloads more room, and model switching should be smoother. Still, production systems should separate lanes: online requests get a low-latency path, batch jobs go through queues, and long-running tasks get their own concurrency limits.
Do not blame only the model for 429
When you hit 429, do not immediately switch models or raise retries to the maximum. A more useful debugging order is:
- Read the error message and confirm whether it is rate limit, quota, or another restriction.
- Check response headers such as limit, remaining, and reset.
- Calculate recent
RPM,ITPM, andOTPM. - Check whether frontend, backend, queue, and SDK layers are all retrying.
- Check whether backend tasks and user requests share the same organization or workspace.
- Check whether a recent traffic spike triggered acceleration limits.
Anthropic’s docs also mention acceleration limits for sudden traffic growth. In other words, even if the average request volume looks reasonable, a sharp ramp-up can still trigger limiting.
When launching a new feature, ramp gradually. For example, enable it for 5% of users first, then watch 429, latency, token usage, and cost curves before sending all traffic to the Claude API.
Rate Limits API belongs in monitoring
Anthropic also provides a Rate Limits API for querying organization and workspace limit configuration. It is useful for internal monitoring, admin dashboards, and operations scripts.
It can help with:
- Confirming workspace limits before deployment.
- Showing available capacity to different business lines.
- Explaining why staging works but production gets
429. - Adjusting queue concurrency based on current limits.
- Creating capacity alerts before users report failures.
But it should not replace application-level throttling. Your service still needs queues, concurrency caps, exponential backoff, and maximum retry limits.
What to change now
If you already use the Claude API, start with a few practical checks:
- Open Claude Console and confirm whether your tier is now
Start,Build, orScale. - Check current rate limits for the models you actually use. Do not rely on old screenshots or memory.
- Make concurrency, requests per minute, and maximum output tokens configurable.
- Put batch jobs behind a queue instead of hammering the API in a raw
forloop. - Use exponential backoff for
429, with a maximum retry count. - Log input tokens, output tokens, model name, workspace, and request latency.
- If you reuse long context, evaluate prompt caching, but do not treat caching as completely limit-free.
This update is clearly positive: Claude API has more capacity, and usage tiers are easier to understand. For developers, the right move is not to “send everything harder.” It is to use the extra room to clean up your call chain, monitoring, and retry strategy. That is how higher limits become stability instead of just a faster path to the next bottleneck.