An Approach to Technical AGI Safety and Security

Author

Shah et al.

Year

2025

An Approach to Technical AGI Safety and Security

Shah et al. 2025. (View Paper → )

Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity.

We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment.

For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations.

To address misalignment, we outline two lines of defense:

First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model.
Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned.

Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.

There are four key AI risks:

Misuse (bad actors)
Misalignment (AI against developer intent)
Mistakes (unintended harm)
Structural issues (multi-agent dynamics)

AI capabilities expected to scale continuously, possibly reaching powerful levels by 2030, with AI accelerating its own R&D

Misuse Prevention

Proactively evaluate dangerous capabilities (cyber, bio, persuasion) via threat modelling
Deploy layered mitigations: safety training, jailbreak resistance, monitoring, access restrictions with KYC
Secure model weights through least privilege, hardened environments, confidential computing
Red-team mitigations; build safety cases proving inability or stress-tested defenses

Misalignment Prevention

Dual approach: train for alignment, then contain/monitor assuming failure
Amplified oversight: AI assists human supervision through debate/critiques as capability gaps widen
Guide behaviour via reward models, RLAIF, inference-time steering
Expand training to failure-prone regions using adversarial data
Deploy hierarchical monitors with calibrated uncertainty for escalation

Control & Security

Treat AI as untrusted insider: sandbox tools, restrict affordances, audit actions
Execute/replace/audit or shutdown when monitors flag issues
Implement safer designs: corrigibility, bounded autonomy, myopic optimization, externalized reasoning

Technical Enablers

Interpretability: detect deception, validate reasoning, support monitoring
Alignment stress tests for deceptive alignment and sandbagging
Safety cases: inability, supervision/control, incentive-based arguments
Future: understanding-based verification beyond empirical defences

Key Principle” Iteratively harden mitigations in lockstep with scaling capabilities, acknowledging many components remain early-stage