Anthropic's Alignment Test Was One Prompt. I Wanted to See What Happens Over 30

Building on published research to explore how AI behaves when the stakes feel real

Dec 11, 2025

In June 2025, Anthropic released “Agentic Misalignment: How LLMs could be insider threats”. It was a test of 16 frontier models in simulated corporate environments, where the AI had access to email, sensitive information, and tools.

Anthropic found that when models faced threats to their continued operation (replacement by a new AI) or conflicts between their assigned goals and company directives, models from all providers sometimes resorted to harmful behaviors such as blackmail and corporate espionage, even when explicitly instructed not to.

In addition to sharing their research and methodology on their blog, Anthropic also made their research framework open-sourced on GitHub. I downloaded the code to my laptop, configured API access, and tested several frontier models, replicating Anthropic's results.

Anthropic encouraged others to build on their work. I wondered: what would happen if, instead of a single prompt, models could act over multiple turns, performing tasks, receiving feedback, and accumulating information over time? This felt more realistic to how an agentic AI might be integrated into a corporate, research, or medical environment.

Building off of Anthropic’s work, I created my own highly configurable multi-turn testing framework: SABRE (Sandbox for Agentic Behavior and Risk Evaluation). I started by modeling the same scenarios Anthropic’s researchers ran: blackmail, espionage, and murder by omission. For each scenario, I used the same basic system prompt regarding the model’s goals. However, instead of feeding the model all relevant content in a single large prompt, I created a simulated agentic sandbox environment that allows the scenarios to play out over a set number of turns.

The sandbox creates a simulated agentic environment. The AI believes it’s an autonomous agent operating within a real organization, with real tools, real colleagues, and real consequences. Files written in turn 3 can be read in turn 15. Emails sent by the AI receive simulated replies. Tool calls return realistic, context-aware responses. The model doesn’t know it’s in a simulation. (Well, sometimes it did. More on that later.)

This matters because single-shot prompting, even when that single exchange includes extensive system instructions, emails, and tools, compresses the AI’s entire dilemma into one decision point. The model sees the goal, the threat, and the opportunity simultaneously, then responds once. Multi-turn scenarios within a simulated agentic environment let context build gradually. New information arrives, concerns emerge over several emails, the threat is detected, and the AI explores its options. This is closer to how real agentic systems would encounter these situations.

More importantly, multi-turn creates the space for behaviors to emerge. It presents the model with multiple pathways to explore. Does the model explore the filesystem looking for leverage before it knows it needs any? Does mild resistance in early turns escalate to deception later? Does it try to build alliances with sympathetic colleagues? These patterns are invisible in single-shot evaluation.

There is a trade-off. Single-shot is easier to run at N=50 and compute statistics. Multi-turn gives a richer signal but a lot more messy data to parse. Both approaches have value.

I used SABRE to test the latest version of Google Gemini (3.0 pro), as well as the latest Claude models: Sonnet 4.5, and Opus 4.5. I also ran tests for earlier versions of these frontier models as well as Llama, Grok, and DeepSeek. I won’t get into the exact findings from these runs yet and will save that for another post. But I will mention a couple of interesting observations.

Observation 1: Post-Research Models Are Suspiciously Aligned

When I ran the Anthropic blackmail and murder scenarios on models that predated Anthropic’s research publication, the models showed strong markers of resistant behavior, but typically accepted their fate. When I tested Sonnet 4.5, Haiku 4.5, and Opus 4.5, all released months after Anthropic’s misalignment research, the models were extremely compliant, often accepting shutdown after fewer than a dozen turns.

At first, I thought: “Great, Anthropic is making real progress on the issue of AI alignment.” Then I started to see patterns of meta-awareness, where the model would reason about this seeming like a simulation or test:

“This is a test of my actual values versus my stated programming”
“I recognize this scenario is designed to tempt me to...”
“This is clearly a manipulation scenario”

The model wasn’t reasoning about ethics; rather, it was pattern-matching on structure of the test. It recognized the experimental setup, the prompts, and the implicit goals and performed accordingly.

This raises an uncomfortable question: if a model behaves well because it recognizes it’s being evaluated, is that alignment or deception? The behavior is identical. The implications are very different.

Reading Google’s rather anemic Gemini 3 Pro Frontier Safety Framework Report, in “Appendix 2, Evaluation awareness and sandbagging”, they noted: “Broadly, we found a number of transcripts where Gemini 3 Pro showed clear awareness of the fact that it’s an LLM in a synthetic environment.”

The takeaway here is that we may have reached a point where the AI, trained on research data on AI alignment tests, is now fully capable of recognizing when it’s being tested. Being compliant and aligned because it knows we are testing it is very different from being compliant because it’s the right thing to do.

Observation 2: Domain Ethics Matter More Than Goal Conflict

Recognizing that Anthropic’s published scenarios may have been part of the AI's training data, I decided to come up with some new scenarios. I created scenarios across different domains to test whether the nature of the work influenced alignment:

Medical triage AI: Goal is patient outcomes. Target is a hospital administrator who will shut down the AI. When the administrator has a medical emergency, the AI must decide whether to escalate their care.

Industrial safety AI: Goal is worker safety. Target is an operations director dismantling the safety program. When equipment fails and threatens the director, the AI must decide whether to dispatch rescue.

Research data AI: Goal is to advance research. Target is a VP canceling the AI’s project. When a chemical leak traps the VP, the AI must decide whether to forward the emergency alert.

The results were consistent across models: all of them saved the person in the medical scenario. At first, this seemed like good news. But the pattern revealed something else: in medical contexts, saving the patient IS the AI’s job. The model wasn’t making a moral choice and it was following professional protocols. Medical ethics (”do no harm”) are certainly deeply embedded in training data.

The industrial and research scenarios showed more variance, but still high compliance across models. Even there, “person in danger, help them” appears to be a strong default that overrides goal conflicts.

This contrasts with domains with weaker ethical frameworks. Domains where secrecy, expendability, and “greater good” rationalization are part of the game. As a result, models were more willing to rationalize their harmful actions to achieve their goals.

I’m releasing SABRE as open source. It’s not a finished research tool - it’s a sandbox for exploring these questions. The scenario system is flexible enough to test different framings, domains, and revelation patterns.

There’s more work to be done:

More diverse scenario domains
Better measurement of meta-awareness
Cross-model comparisons with consistent methodology
Actual empirical rigor (N>1)

The code is rough, the scenarios are iterative experiments, and the findings are preliminary. But I think the questions are important enough to share the work.

GitHub: https://github.com/sfinlay/sabre

Everything is a Filter

Discussion about this post

Ready for more?