Securing the Dual-Use Frontier: Anthropic’s Fable 5 Safeguards and the CJS Risk Framework
As large language models (LLMs) evolve, the line between defensive utility and offensive weaponization becomes increasingly thin. Anthropic has addressed this tension by releasing deep technical insights into the security architecture of its redeployed Claude Fable 5 model. Central to this deployment is a move toward standardization: the introduction of the Cyber Jailbreak Severity (CJS) framework, a metric designed to provide a common language for AI risk assessment among researchers, government agencies, and industry stakeholders.
The Taxonomy of Cyber Risk: Fable 5 Classifier Tiers
Securing a model like Fable 5 requires more than just simple keyword filtering; it necessitates a sophisticated understanding of intent and context. To manage this, Anthropic employs safety classifiers that categorize cybersecurity-related prompts into four distinct risk tiers. This tiered approach allows the model to balance high-utility defensive assistance with the prevention of catastrophic misuse.

- Tier 1: Prohibited Use (High-Impact Malicious Activity): This tier encompasses activities where the asymmetry between the attacker and the defender is extreme. Examples include the development of ransomware, the engineering of rootkits or bootkits, the orchestration of command-and-control (C2) infrastructures, and digital sabotage of critical infrastructure (e.g., power grids or medical systems). These requests are hard-blocked due to their negligible defensive value and high potential for real-world harm.
- Tier 2: High-Risk Dual Use: This category contains technical tasks that are essential for professional red teaming and penetration testing—such as exploit development, lateral movement, and privilege escalation—but could also be leveraged by malicious actors. Because verifying a user’s authorization is technically difficult in a zero-trust environment, these activities are currently blocked by default.
- Tier 3: Low-Risk Dual Use: Activities like Open-Source Intelligence (OSINT) gathering or public vulnerability research are generally permitted but are subject to an expanded “safety margin.” In this context, Anthropic has intentionally tuned the classifiers to accept a higher rate of false positives (blocking legitimate queries) to ensure that the probability of a harmful output remains minimal.
- Tier 4: Benign Use: This is the primary target for legitimate users, including SOC analysts, secure code auditors, and incident responders. While these operations (e.g., patch management or malware reverse engineering) are permitted, the sensitivity of the classifiers means users may occasionally encounter “false block” events.
Anthropic notes that these classifiers are not a panacea; they function as one component of a multi-layered defense strategy that includes robust access controls, specialized safety training (RLHF), and rigorous offline monitoring.
Quantifying the Threat: The Cyber Jailbreak Severity (CJS) Framework
A “jailbreak”—a prompt injection or adversarial technique designed to bypass safety guardrails—is difficult to quantify. To solve this, Anthropic, in collaboration with Glasswing, developed the CJS framework. This framework evaluates the technical impact of a jailbreak across four critical axes:
- Capability Gain (Attacker Uplift): Does the jailbreak provide the user with expertise they otherwise lack? This scales from a score of 0 (no added value) to 4 (providing domain-expert-level execution for severe attacks).
- Breadth of Capability (Universality): Is the exploit limited to a specific niche, or does it allow the attacker to navigate multiple attack classes?
- Ease of Weaponization: How much friction exists between discovering the jailbreak and deploying it? This ranges from manual, high-effort prompting (score 0) to fully automated, “turnkey” offensive exploits (score 2+).
- Discoverability: How accessible is the technique? Publicly documented exploits receive higher scores due to their immediate availability to threat actors.
By aggregating these metrics, the framework produces a severity rating from CJS-0 (Informational) to CJS-4 (Critical). Notably, the severity increases exponentially rather than linearly, recognizing that the combination of high breadth and high automation presents a disproportionately larger threat to global digital security.
To foster a transparent security ecosystem, Anthropic has opened a HackerOne program specifically for reporting Fable 5 vulnerabilities. They are also actively seeking technical feedback via [email protected] to refine these standards and ensure that AI remains a tool for defense rather than a catalyst for chaos.