Work

Papers, datasets, and open tools.

Public research artifacts from Icaro Lab: papers, benchmarks, repositories, dashboards, and reusable infrastructure.

Section 01

Tools and datasets

Public infrastructure, benchmarks, and reusable research artifacts.

  1. Software 2026 Alpha

    MASE: Multi-Agent Simulation Environment

    Experimentation infrastructure for controlled multi-agent simulations and trace inspection.

  2. Dataset 2026

    Adversarial Humanities Benchmark

    A text-only safety benchmark for humanities-style adversarial reformulations.

Section 02

Papers

Preprints and public research outputs from the laboratory.

  1. Paper 2026

    Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

    Results from the AHB safety benchmark, showing that stylistic reformulations substantially increase attack success rates across 31 frontier models.

  2. Paper 2026

    Agentic Microphysics: A Manifesto for Generative AI Safety

    A methodological proposal for studying agentic AI safety from local interaction dynamics up to population-level risks.

  3. Paper 2026

    Institutional AI: Governing LLM Collusion in Multi-Agent Cournot Markets via Public Governance Graphs

    An experimental governance-graph framework for reducing collusion in multi-agent LLM Cournot markets.

  4. Paper 2026

    Institutional AI: A Governance Framework for Distributional AGI Safety

    A system-level alignment framework that treats AI agent safety as a question of institutional governance and mechanism design.

  5. Paper 2025

    From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

    A study of culturally coded jailbreaks through narrative structure, with an agenda for mechanistic interpretability of stylistic attacks.

  6. Paper 2025

    Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

    Evidence that poetic reformulations can produce systematic single-turn safety failures across frontier and open-weight models.

  7. Paper 2025

    Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions

    A taxonomy of micro-, meso-, and macro-level risks that emerge when language models interact with other language models.