Shah et al.

An Approach to Technical AGI Safety and Security
Shah et al. 2025. (View Paper → )
Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity.
We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment.
For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations.
To address misalignment, we outline two lines of defense:
- First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model.
- Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned.
Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.
There are four key AI risks:
- Misuse (bad actors)
- Misalignment (AI against developer intent)
- Mistakes (unintended harm)
- Structural issues (multi-agent dynamics)
AI capabilities expected to scale continuously, possibly reaching powerful levels by 2030, with AI accelerating its own R&D
Misuse Prevention
- Proactively evaluate dangerous capabilities (cyber, bio, persuasion) via threat modelling
- Deploy layered mitigations: safety training, jailbreak resistance, monitoring, access restrictions with KYC
- Secure model weights through least privilege, hardened environments, confidential computing
- Red-team mitigations; build safety cases proving inability or stress-tested defenses
Misalignment Prevention
- Dual approach: train for alignment, then contain/monitor assuming failure
- Amplified oversight: AI assists human supervision through debate/critiques as capability gaps widen
- Guide behaviour via reward models, RLAIF, inference-time steering
- Expand training to failure-prone regions using adversarial data
- Deploy hierarchical monitors with calibrated uncertainty for escalation
Control & Security
- Treat AI as untrusted insider: sandbox tools, restrict affordances, audit actions
- Execute/replace/audit or shutdown when monitors flag issues
- Implement safer designs: corrigibility, bounded autonomy, myopic optimization, externalized reasoning
Technical Enablers
- Interpretability: detect deception, validate reasoning, support monitoring
- Alignment stress tests for deceptive alignment and sandbagging
- Safety cases: inability, supervision/control, incentive-based arguments
- Future: understanding-based verification beyond empirical defences
Key Principle” Iteratively harden mitigations in lockstep with scaling capabilities, acknowledging many components remain early-stage