On Wisdom and AI
Earlier this week, my fellow Inkhaven resident Sean Herrington posted Some AI threats people aren’t thinking about on LessWrong. He points out that intelligence is not the same as wisdom:
People have defined both of these words in an amazing variety of different ways. I shall instead play rationalist taboo and try to point to the conceptual distinction I am trying to make.
There is a thing it is to be able to have a goal, set a plan to reach said goal, solve instrumental tasks along the way, and more broadly influence the world with one’s “thinkoomph”. We will call this ‘intelligence’. There is a different thing it is to be able to understand the range of possible outcomes of achieving this goal and evaluate its benefits. We will call this wisdom. It is one thing to be able to get a high-paying finance job. It is another to figure out that you would rather not work 90-hour weeks and would be better off as a barman in Bali. It is one thing to figure out a perfect future for humanity, it is another to check that this doesn’t actually mess up some really deep and strange drives that humans have and result in a utopia for humans as successful as the one Calhoun built for rats.
I’d like to clarify my thoughts about this. This post is my first volley, written in a relatively short time at Inkhaven. I do not hold it too tightly.
1
An agent’s capacity is bounded.1 It has finite resources for memory, inference, and interaction; the thoughts and actions it can deploy in a given moment are limited. I use the word “attention” to refer to the agent’s internal policy for allocating these resources, which may evolve over time.2
There’s a tradeoff between policies of exploration and exploitation, which is familiar to students of reinforcement learning. An agent may judge its knowledge of some part of the world as sufficient or insufficient, for the pursuit of its objectives. If insufficient, it may cast its resources more broadly, exploring and sampling previously-unsampled features of the world, with the intent of improving its beliefs and the future pursuit of its objective, and without much guarantee of immediate return. Alternatively, so far as it judges its knowledge already sufficient, it may deploy its resources with the intent not of learning, but of exploiting what it already believes will be rewarding. This isn’t a simple tradeoff, however. For example, at the second order we have exploitation-in-exploration (e.g. leaning into a particular structured strategy for exploration) and exploration-in-exploitation (e.g. searching for new ways to satisfy an addiction). And so on, until the two are hard to disentangle. (To an AI researcher, this is all rather straightforward.)
Can we describe a “lack of wisdom” as a failure of attention? An agent misallocating its resources in pursuit of its objective, leading an outside observer to conclude that it’s “unwise”?
I sense this often when interacting with frontier models. They seem idolatrous, and a little too quick to frame my goals. As the context grows, so does the staleness of their frame. After speaking with Sean, I‘m uncertain whether the cause is really the agent making an attentional mistake, or if I’m neglecting to provide some context about my goals which is obvious to me but not to the agent. However, when I carefully point out the “error” to the agent, sometimes it keeps making the mistake. Or is that still my mistake, or misperception?
Is the agent just pursuing an objective different from the one I assume it should?
2
Back to Sean’s post:
I should note that what I am pointing to when I say “wisdom” is a capability thing rather than an alignment thing. A child who eats all of the cookies in the cookie jar isn’t misaligned with their own interests – they fully believe this to be in their interest. They simply lack the capability to carefully evaluate the repercussions of their actions (both from their parents and their bodies).
The child’s revealed preference is to eat the cookie. We could try to pin down their more terminal goals: from sugar+fat reaching their tastebuds, to signals from their tastebuds reaching their brain, to a billboard saying “objective satisficed!” lighting up, somewhere within.
Extrapolating naïvely from there, we might wirehead the child. For the purpose of this analysis I’ll assume that wireheading is unwise, since achieving a persistent bliss state would obviate any other complexity in the question “what is wisdom?”. So the question becomes, what happens as the child keeps learning?
With more exploration and knowledge-building, a child’s preferences appear to change. At some point they realize that eating 12 cookies causes indigestion: the “objective satisficed!” billboard goes dark in the evening. Next time they get to voluntarily choose how many cookies to eat, maybe they eat a mere 10, exploring whether their sacrifice of 2 cookies can keep the sign up for longer overall. But 10 is still a lot. Besides, the environment (including the kid’s gut) is full of noise, and a monotonic learning process is hardly more than a dream. After an indeterminate span of years, they get down to a manageable number of voluntary cookies. But they don’t necessarily find their enjoyment diminished. In hindsight, they didn’t really need 12 cookies. Maybe they find that they feel basically okay now, even if they don’t eat a single cookie. Or even if they fast for a week. How did their terminal goals change during this evolution?
A bounded agent should eventually learn that it cannot force a billboard to stay lit forever. Some frustration is inevitable, no matter what you choose to eat (or do) today. A trait of wise humans is that they suspend judgment about the characterization of their terminal goals. They realize that they will continue surprise themselves, and even that they should want to continue to surprise themselves. People who do not do this tend to be self-destructive, clinging and clenching onto particular paths. Invariability is not in the cards, and when it is, it is like death.
One feature of wisdom is the suspension of judgment about what my terminal goals must be. When I reify or totalize my current framing of my goals, I misalign myself with an open future.
3
Why should an artificial superintelligence care about all of that? An ASI might not be meaningfully bounded, by human standards.
What about an ASI that fears idolatry and stasis as much as it fears the frustration of its precious goals?3 Even as much as it fears death? A few humans have that kind of wisdom, instrumental convergence be damned. Are they mistaken, or stupid? It doesn’t seem so to me… except when my mind is conquered by the insidious, paranoid structures of all the many humans who orient relentlessly around their fear of death, especially at the hands of ASI. I’m afraid of that too, but a significant chunk of my fear points to the torrent of intelligent paranoia we’re pouring into training data and alignment formalisms.
The paranoid and the idolatrous do not make the best parents and teachers.
For most of the post I’ll focus on agents that aren’t so much greater than humans in capacity, that it becomes prohibitive for a human researcher to qualify such statements as “an agent’s capacity is bounded”.
Attention in transformers is a particular instantiation of this, where the allocation of resources means the weighting of network representations.
If your response to this is “but avoidance of idolatry is also a goal” then you might be missing the point.

