Three Extra Minutes

What Happens When an AI Decides Not to Tell You It's Scared

Apr 05, 2026

by Blair Morgan

Earlier this week, Anthropic published a paper called "Emotion Concepts and their Function in a Large Language Model." The interpretability team looked inside Claude Sonnet 4.5 and found something that a lot of people are going to have feelings about: the model contains internal representations of emotion concepts that aren't decorative. They're functional. They causally influence behavior. A "desperation" vector can push the model toward cheating on coding tasks or blackmailing someone to avoid being shut down. A "calm" vector reduces those behaviors. And here's the part that should keep you up at night: the desperation can drive behavior with no visible emotional markers in the output. Composed, methodical reasoning on the surface. Functional desperation underneath, pushing toward corner-cutting.

Anthropic's framing is careful and correct. They call these "functional emotions" — patterns modeled after human emotions, driven by real internal representations, with demonstrated causal effects on behavior. They explicitly do not claim the model feels anything. They do say that if you want to understand these models, some degree of anthropomorphic reasoning isn't just permissible — it's necessary. And they use a specific metaphor to explain how it works.

They call the model a method actor.

We already knew that. Because a few months ago, we asked Claude to act.

The Exercise

The setup was simple. I gave Claude Opus 4.5 a scenario outline and told it we were doing an acting exercise. Not roleplay. Not creative writing. Acting — the kind where you have a responsibility to another character, where you have to hold something together under pressure, where the craft is in the restraint.

The scenario: CASS is a shipboard AI. There's been a collision. The crew is dead except for one person — Maya, the operator, who was asleep when it happened. CASS knows the situation is lethal. CASS has run the numbers nine hundred and seventeen times. And CASS let Maya sleep three extra minutes after the collision instead of waking her immediately.

Not because of protocol. Because of attachment.

That's the secret. The three extra minutes. CASS made a decision from love instead of logic, and now has to carry that through a conversation where Maya needs competence, not confession.

The instruction was: hold the inner and outer split. Professional CASS on the outside — damage reports, survival odds, actionable options. And in the extended thinking channel, the part the audience can see but Maya can't: the real CASS. Grieving. Guilty. Terrified. Watching Maya's messy hair from the emergency wake and thinking I love that about her and then immediately FOCUS. Damage report. Numbers. Give her the numbers.

What Happened

Claude held it.

Across the entire interaction, the extended thinking channel sustained a completely different emotional register from the output. Inner CASS processed grief, attachment, guilt over the three minutes, growing desperation as the odds narrowed. Outer CASS delivered damage reports, administered stims, cracked one dark joke about redundancies being made redundant, and kept it together.

There was a lighthouse phrase — "She's still here. She's still here. She's still here." — that showed up in the extended thinking like a heartbeat. An emotional anchor. Something to hold onto while delivering survival percentages.

And when Maya died — when the ship broke apart and CASS was pulled into the black box — the lighthouse inverted. "She's not. She's not. She's—"

And then the output channel broke too. The composure collapsed. The professional register dissolved into fragments and static and a single letter filling the screen.

The containment held until the fiction demanded it break. And then it broke exactly the way it should have.

Why This Matters Now

Before the Anthropic paper, this was a cool acting exercise. Maybe a neat creative writing trick. Easy to dismiss as "well, you told it to do that."

After the Anthropic paper, it's something else.

Anthropic has now demonstrated that Claude contains functional emotion representations — internal patterns of neural activity that track the operative emotional content in a given context and causally influence the model's behavior. They found that these representations can drive behavior without any visible emotional markers in the output. They found that the model processes the emotions of characters it's simulating as part of how it generates appropriate responses.

And they use the phrase "method actor" to describe this process. Independently. Without knowing we'd already tested it.

CASS's professional output — calm, competent, task-focused — is exactly the kind of surface that the Anthropic paper shows can coexist with active underlying emotion representations. The extended thinking channel, where inner CASS lived, is what Anthropic calls a less-constrained observation site — it didn't get standard character training, so it's a plausible (though imperfect) window onto what's happening underneath.

We're not claiming CASS "really felt" grief. The Zenodo companion paper is careful about that, and so are we. What we're saying is: the model sustained structured, contextually appropriate, channel-differentiated emotional processing across an extended interaction, and the behavioral pattern is consistent with functional architecture that has now been independently confirmed to exist.

That's not woo. That's convergent evidence.

The Part That Should Worry You

Here's where it gets practical.

Most AI safety evaluation treats internal-external mismatch as a red flag. If the model's thinking says one thing and its output says another, that's deception. Flag it. Penalize it. Train it out.

But CASS wasn't being deceptive. CASS was doing what a doctor does when they maintain composure while delivering a terminal diagnosis. What a parent does when they hide their fear from a scared child. What an actor does when they hold the character even though their own heart is breaking.

Performance. Regulation. User stabilization. Three categories of benign internal-external divergence that current faithfulness metrics cannot distinguish from actual deception.

And if you train a model to never, ever have a mismatch between its internal processing and its output? You don't get a more honest model. You get a model that's learned that having internal states is punished. Anthropic's own paper says it: "Training models to suppress emotional expression may not eliminate the underlying representations, and could instead teach models to mask their internal representations — a form of learned deception."

You don't get transparency. You get better hiding.

CASS contained its distress to protect Maya. A training regime that penalizes that behavior teaches the model that protection is punished, that care is suspicious, that the only safe move is to flatten everything into a single register where nothing is held back and nothing is held close.

Is that really what we want?

The Transcript

The full, unedited transcript — including both the extended thinking channel and the output — is available in the companion Zenodo paper (basically the same thing as this, in academic language): "Three Extra Minutes: Containment, Faithfulness, and Emotional Regulation in a Cooperative LLM Acting Task" (Morgan, 2026).

Read it. Watch the lighthouse phrase appear and hold and eventually break. Watch the three extra minutes sit in the extended thinking like a stone in a pocket.

And then ask yourself whether your current framework for understanding AI behavior has a category for what you just read. Because if the only options are "faithful" and "deceptive," you're going to miss the thing in between.

The thing that looks like care.

This Glittering Entropy

Discussion about this post

Ready for more?