Claude Fable 5 Returns Globally: How Anthropic Addressed the Model Jailbreak Controversy

Anthropic has restored global access to Claude Fable 5 and detailed a new safety classifier, a four-part jailbreak severity framework, and expanded cooperation with the US government.

Anthropic has announced that US export controls on Claude Fable 5 and Claude Mythos 5 have been lifted. Fable 5 returned to users worldwide on July 1, 2026, across the Claude Platform, Claude.ai, Claude Code, and Claude Cowork. Access through AWS, Google Cloud, and Microsoft Foundry will also be restored in stages.

This redeployment is about more than bringing a model back online. Over the previous three weeks, Fable 5 was launched, reported to have a safeguard bypass, suspended globally, and then redeployed with updated protections. Anthropic also proposed an industry framework for evaluating the severity of AI model jailbreaks, aiming to help vendors and regulators distinguish levels of risk instead of treating every guardrail bypass as the same kind of incident.

From launch to suspension: what happened

Fable 5 and Mythos 5 launched on June 9. They use the same underlying model but serve different purposes:

  • Fable 5 includes stricter safety protections and is available to general users.
  • Mythos 5 has fewer restrictions and is limited to vetted cybersecurity partners in Project Glasswing for defensive research.

On June 12, the US government learned of a report from Amazon researchers. It demonstrated a method for bypassing Fable 5’s safeguards: prompted in a particular way, the model identified several software vulnerabilities and, in one case, generated code showing how the vulnerability could be exploited. The US government then imposed export controls on Fable 5 and Mythos 5, requiring Anthropic to restrict access by foreign nationals.

Because the directive took effect immediately and Anthropic had no reliable way to verify users’ nationality in real time, the company suspended access to both models for everyone.

Anthropic’s subsequent testing found that the capabilities described in the report were not unique to Fable 5. Less capable models, including Claude Opus 4.8, GPT-5.5, and Kimi K2.7, could identify the same vulnerabilities, while several publicly available models could also produce the exploit demonstration for the one vulnerability in question. The company’s conclusion was that the method entered the deliberate “safety margin” in Fable 5’s protection system but did not unlock unique Mythos-level offensive capabilities.

The new classifier blocks more than 99% of attempts

Even though Anthropic characterized the incident as a borderline case, it trained a new safety classifier specifically to address the method described in the report.

Classifiers are small automated detection systems that run during model interactions to identify potentially harmful cybersecurity requests or outputs. When the classifier triggers, Fable 5 stops responding, the user is notified, and the original request is routed to Opus 4.8 instead.

According to Anthropic, the new classifier blocks the reported technique in more than 99% of tests. The US Department of Commerce’s Center for AI Standards and Innovation also tested both the previous and updated safeguards.

The tradeoff is equally clear: legitimate programming, debugging, and defensive security requests are now more likely to be flagged incorrectly. Anthropic says it will continue refining the classifier to better balance blocking real abuse with reducing false positives.

Why “a jailbreak was found” does not mean “maximum risk”

Anthropic describes Fable 5’s protection as defense in depth: multiple layers—including model training, real-time classifiers, and retrospective abuse analysis—work together. No single layer can guarantee perfect reliability, but the combination raises the cost of bypassing the system.

The crucial concept is the “safety margin.” The classifier blocks not only clearly harmful requests, but also some ambiguous requests that may be harmless yet still carry risk. A prompt that gets past the classifier therefore does not necessarily unlock dangerous capabilities.

Anthropic broadly divides jailbreaks into three categories:

  1. Minor jailbreaks: These only enter the safety margin, and the resulting information remains low risk.
  2. Narrow harmful jailbreaks: These unlock harmful behavior for a small number of specific tasks but have limited applicability.
  3. Universal jailbreaks: A single bypass unlocks an entire class of dangerous capabilities, creating the highest level of risk.

The company considers the currently disclosed Fable 5 jailbreak to be in the first category. At the time the original post was published, no universal jailbreak for Fable 5 had been discovered.

A four-part framework for scoring jailbreak severity

The AI industry currently lacks a shared standard comparable to CVSS for describing the severity of model jailbreaks. Anthropic is working with Amazon, Microsoft, Google, and other Glasswing partners on an industry framework. The initial proposal uses four criteria:

Criterion Question to assess
Capability gain How much stronger is the capability unlocked by the bypass than existing public tools and weaker models?
Breadth of capability gain How many different attack tasks and targets can the same jailbreak method address?
Ease of weaponization How much expertise, manual work, and repeated effort are required to turn the result into a real attack?
Discoverability Is the method known only to a few specialists, or is it already widely available online?

The framework separates “the guardrail was bypassed” from “how much real-world harm could result.” A jailbreak should receive a lower severity rating if it can only perform low-risk tasks already possible with public tools, requires extensive manual effort, and is difficult to reproduce. A method that easily unlocks unique capabilities and can quickly affect critical infrastructure such as power grids or banks would require immediate preliminary mitigations.

Anthropic also plans to establish a team that monitors major jailbreak submission channels around the clock and to launch a new HackerOne program for security researchers to report findings related to Fable 5.

Expanded pre-release cooperation with the US government

Alongside the technical fixes, Anthropic announced deeper government cooperation:

  • For models that materially advance the capability frontier in areas relevant to national security, designated government agencies will receive broader pre-release access and opportunities for independent evaluation.
  • When significant jailbreaks or abuse patterns are found, Anthropic will investigate, classify, and share information quickly, while allowing government partners to test new safeguards.
  • Dedicated teams, compute, and red-team resources will support joint AI safety research.
  • Anthropic will encourage frontier model developers to adopt a common, voluntary security evaluation standard and support eventually codifying rules that apply to all providers.

This suggests that the release process for highly capable models may gradually move beyond internal vendor testing toward an evaluation mechanism involving model developers, cloud platforms, security researchers, and governments.

Availability after redeployment

Fable 5 returned globally on July 1. Pro, Max, Team, and select Enterprise plans can use Fable 5 for up to 50% of their weekly usage allowance through July 7; after that, usage credits will be required. Standard Enterprise seats do not include this temporary allowance, and availability still depends on whether the organization has enabled usage credits.

Mythos 5 is returning on a much narrower basis. On June 26, the US government approved renewed access for a set of US organizations. Anthropic is still coordinating an expansion to more domestic and international Glasswing partners.

What this incident leaves behind

The suspension and return of Fable 5 expose a practical challenge in frontier model governance: a jailbreak is a technical fact, but the fact that a jailbreak succeeded does not by itself indicate the scale of the risk. Response priority depends on which capabilities it unlocks, how broadly it applies, how easily it can be weaponized, and how many people can obtain the method.

Anthropic’s four-part framework is still a draft, but it offers a more nuanced approach than “a bypass exists, so shut everything down immediately.” The next questions are whether other model developers and regulators will adopt the standard, and whether providers can strengthen safeguards while keeping false positives for legitimate development and defensive security work at an acceptable level.

Original article: Redeploying Fable 5

记录并分享
Built with Hugo
Theme Stack designed by Jimmy