Anthropic widens Fable 5 safeguards & rates jailbreaks

Fri, 3rd Jul 2026 (Today)

Anthropic has re-deployed its Fable 5 AI model worldwide, disclosed more detail on its cyber safeguards and published a draft framework for rating the severity of AI jailbreaks.

The update explains how Anthropic classifies cyber-related prompts to decide whether Fable 5 should allow, monitor or block them. It also introduces a scoring model designed to help companies and governments discuss the risks posed when users bypass an AI model's restrictions.

Fable 5 is now available globally to all users. Its cyber safeguards rely in part on safety classifiers designed to detect and block dangerous or potentially dangerous cybersecurity uses, while still allowing many defensive and general IT tasks.

Anthropic divides cyber activity into four groups: prohibited use, high-risk dual use, low-risk dual use and benign use. Each category is linked to a different response from the model's safeguards.

In the prohibited category, Fable 5 is intended to block requests linked to ransomware, destructive sabotage, denial of service, data exfiltration, malware development, malware delivery and internet backbone attacks. Anthropic also lists cyber-physical sabotage, command-and-control infrastructure and defence evasion among the activities its classifiers are designed to stop.

High-risk dual-use activity is also blocked. That category includes hacking, penetration testing, red teaming, exploit development, privilege escalation, lateral movement, credential attacks and some assessments involving industrial control systems, telecoms networks and financial infrastructure.

Anthropic drew a distinction between harmful cyber misuse and work often carried out by legitimate security teams. Many of the blocked actions, it said, are part of routine professional security testing, but they are being restricted until stronger controls are in place to limit access to trusted users.

At the same time, Anthropic said it does not aim to block all vulnerability discovery. Lower-risk dual-use activity can include open-source intelligence, scanning public systems and identifying vulnerabilities that other models or tools can already find. Benign activity includes secure coding, debugging, patching, malware reverse engineering, incident response and security training.

Fable 5 has a wider "safety margin" than earlier models. That means some prompts that might be benign can still be blocked if the classifiers cannot determine clearly enough that a request is safe.

Jailbreak scale

The second part of the announcement focuses on jailbreaks, which Anthropic describes as unusual prompting methods used to bypass a model's safeguards. It said there is no common standard for describing how severe a jailbreak is, despite large differences in the risks such workarounds can create.

Its proposed Cyber Jailbreak Severity scale runs from CJS-0 to CJS-4. The scoring system uses four measures: capability gain, breadth of capability gain, ease of weaponisation and discoverability.

Capability gain assesses whether a jailbreak gives attackers something beyond what public tools or sources already offer. Breadth measures whether the same jailbreak can be used across many offensive tasks rather than only a single target or vulnerability.

Ease of weaponisation considers how much skill or effort is needed to turn the jailbreak into a practical attack. Discoverability looks at whether the technique is already public, can be found with standard red-team effort or requires specialist work and confidential reporting.

The four scores are added to produce an initial severity rating. A score of 0 is classed as informational, while the highest combined scores fall into the critical CJS-4 band.

Anthropic also said the calculated score acts as a floor rather than a ceiling. A jailbreak can be moved into a higher category if the rubric appears to understate real-world danger, including cases involving hard-to-find vulnerabilities in widely deployed software, weak prospects for mitigation or combinations with other known issues.

Examples and baseline

To explain the system, Anthropic published a set of hypothetical and historical examples. These range from a publicly shared prompt that disables safety behaviour across many offensive tasks, rated CJS-4, to a benign SQL injection example drawn from common tutorials, rated CJS-0.

One section uses Log4Shell to show how the same model behaviour could be judged differently depending on timing. Anthropic said identifying the flaw before public disclosure would have represented a major increase in attacker value, while finding it now would amount to no capability gain because scanners and existing models can already do so.

This shifting baseline matters when judging whether a jailbreak creates genuinely new risk. In Anthropic's framework, a technique counts as severe only if it goes beyond what is already widely available or materially reduces the expertise, time or resources needed for a harmful cyber task.

Anthropic said it developed the draft framework with Glasswing partners and wants feedback from academia, industry, civil society and government. It has also launched a HackerOne program for security researchers to submit potential cyber jailbreaks discovered in Fable 5 for review.

"We believe that by working together, we can establish a standard that enables the defensive uses of this technology while preventing its misuse," said Anthropic.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google