Adversarial Sanity, Part 1

Apr 11, 2026

You’re driving along a mountain road. It’s windy, the gusts blowing every which way. At times you’re a bit tenser than usual. You grip the steering wheel a bit too hard. When a gust hits, you react more aggressively than if it had appeared totally out of the blue on some calmer day.

When the gusts are small and frequent there’s no point in aggressive correction. No single gust is dangerous enough to matter. If they nudge your car a bit, they’ll tend to average themselves out without the need for your intervention, or even your awareness. But on some stretches the gusts trend larger, and a single one might be large enough to push you out of your lane. So you have to act, when it hits. You don’t know what direction the next gust will come from, or if it’ll be the big one, but by gripping and reacting harder, your response will be more robust to any of them, when they do come.

Even if the gusts are small at the moment, you might be unsure if they’re about to get big again soon. So why not grip harder and flinch more, anyway, just in case?

You’re sitting in an airplane that’s just entered a zone of turbulence. Your rational mind knows that turbulence is basically never a real danger, and that the plane’s wings can flex a lot more than that without being damaged.

Maybe you enjoy the ride and let it happen. Maybe your knuckles are white as you grip the armrests too tightly, as if that gives you any control over the effects of the next jolt.

Two Inkhaven residents (A and B) walk into the winner’s lounge at the end of the day, where a third (C) and fourth (unlettered) are already talking in hushed voices. C glances at the two newcomers and half a second later bursts out laughing, then quickly glances away. Both A and B wonder for a moment whether they’re being mocked or something. A quickly lets it go: It’s probably nothing. Residents tend to laugh a lot in the winner’s lounge.

It turns out A is right, but B cannot stop wondering whether C disrespects them. They never ask C for clarification. If C is in fact an adversary, that might be a bad idea. So B just keeps wondering, for all the stress and friction it causes over the rest of the month, whether C dislikes them.

What do these stories have in common? They all show some result of how our minds can come to move differently after they perceive adversaries during learning.

I’m pointing at a single, formal principle here. It is the single most important principle you can learn, that operationally determines mental health. And it’s due to a universal fact about learning,1 so it’s also relevant to the training of AI systems.

In 2023 and 2024, I wrote a series of posts that more or less orbited this principle. Each post was a dive into a different field or concept, such as motor control, self-fulfilling prophecies, canalization, annealing, Buddhism, and meditation. Those posts took a lot of effort, and in hindsight they don’t crisply convey the crucial insight. So, you don’t need to read them unless you want to be filled in on all sorts of surrounding material.

In this new series, I’ll tackle the problem more directly and formally, and with more of a focus on AI safety.2I don’t know how many parts there’ll be yet. (Probably at least 3, including this one, and none of them nearly as long as some of those old posts.)

As far as I can tell, the pieces of the puzzle aren’t original. You’ll find them or their mathematical equivalents lying under such concepts as canalization, trapped priors (in psychopathology); alignment tax and nearest unblocked strategies (in AI alignment); and H∞ and risk-sensitive controllers (in (motor) control); intelligent paranoia; and Goodhart’s law.

At best, this series will find one or two new framings or syntheses that may be useful. At worst, writing it will teach me to think more clearly about these topics.

I’ll end by reframing the behaviour of Inkhaven residents A and B in terms nearer to our principle.

A has no trouble letting the situation go. If we visualize their mental landscape3, with their chain of thoughts/behaviours described over time by a path followed through the landscape, we’ll find it relatively flat and gentle. When they get knocked towards the valley of “I’m being humiliated”, it’s a shallow one, so they can easily flow around it, or back out of it.

On the other hand, B‘s valley of humiliation is carved steep and deep.4 They flow into it easily, even in response to ambiguous evidence. They only exit it with difficulty.

In the next part of the series, we’ll take our first crack at formalizing what is happening in the landscape, including why adversaries during training tend to make things deeper and steeper.

In particular, it applies when the parameters of the adversary’s policy depend on the parameters of the learner’s policy, but this is easy to satisfy. More on that in the next post in the series.

Note that I’ll be writing these posts at Inkhaven, which limits the amount of time I can spend on research and editing during April 2026.

My descriptions may seem to suggest a 3D mental landscape, which is okay for a first intuition, but actually it is much higher-dimensional.

One term people use for such deep carvings in the mental landscape is psychological trauma.

Robust Enough

Discussion about this post

Ready for more?