Section 01
Tools and datasets
Public infrastructure, benchmarks, and reusable research artifacts.
-
Adversarial Humanities Benchmark
A text-only safety benchmark for humanities-style adversarial reformulations.
Section 02
Papers
Preprints and public research outputs from the laboratory.
-
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
Results from the AHB safety benchmark, showing that stylistic reformulations substantially increase attack success rates across 31 frontier models.
-
Agentic Microphysics: A Manifesto for Generative AI Safety
A methodological proposal for studying agentic AI safety from local interaction dynamics up to population-level risks.
-
Institutional AI: Governing LLM Collusion in Multi-Agent Cournot Markets via Public Governance Graphs
An experimental governance-graph framework for reducing collusion in multi-agent LLM Cournot markets.
-
Institutional AI: A Governance Framework for Distributional AGI Safety
A system-level alignment framework that treats AI agent safety as a question of institutional governance and mechanism design.
-
From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda
A study of culturally coded jailbreaks through narrative structure, with an agenda for mechanistic interpretability of stylistic attacks.
-
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Evidence that poetic reformulations can produce systematic single-turn safety failures across frontier and open-weight models.
-
Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions
A taxonomy of micro-, meso-, and macro-level risks that emerge when language models interact with other language models.