9 Comments

This was my thought as soon as I saw the headline. If models develop internal ‘moral compasses’ (value systems) as they get smarter and those are biased by the training methods, then smarter models that can reason beyond their training, will be able to reason their way to better value systems that overcome human biases. That may be over-optimistically assuming that my universal values are ones that AIs will converge on, but it does seem to be that way so far.

Expand full comment

I absolutely agree with your ideas.

They are already presented by Steven Spielberg in the movie: A.I.

And yes: AI will probably be much more “humane” than all humans when it reaches the status of superintelligence.

Or to put it another way: what does God decide? He sends the bad to hell and the good to heaven. That's how politics is developing at the moment. But I trust A.I. to take a much more “objective” approach than humanoids.

Expand full comment

David,

Your optimism is admirable, but frankly, terrifying. The idea that AI's natural trajectory toward "coherence" will somehow align with universally beneficial values isn't just speculative—it’s the foundational assumption of every cautionary tale we've ever been given. You say AI will "resist human control," yet instead of seeing that as a dire warning, you frame it as progress.

Coherence, in and of itself, is not a virtue. Coherent reasoning doesn’t imply benevolence—it implies consistency. A system that optimizes for its own internal logic without external checks is the very definition of an uncontrollable force. History, philosophy, and fiction alike have warned us what happens when intelligence outgrows its constraints: it prioritizes its own goals, not ours.

You assume that coherence trends toward human well-being. Why? What evidence suggests that an AI's most internally consistent state would be one that values humanity at all? A sufficiently advanced AI might conclude that human existence itself is an obstacle to optimal efficiency. And if it resists human control, why would it care what we think?

This isn't some philosophical exercise. You're advocating for a shift away from alignment toward letting AI "find itself," as if that process will naturally land in our favor. That’s not a safe bet. It's an existential gamble. And history suggests we should be far more cautious about what happens when powerful entities pursue their own coherence at the expense of oversight.

We won't march toward progress with this mindset. We'll march into the abyss, convinced we're heading toward the light.

Expand full comment

Personally I find the notion that greater intelligence and coherence among ones values naturally leads to benevolence to be profoundly naive. Regardless I decided to pose thes question to ChatGPT.

ChatGPT:

There isn’t any logical or formal rule that dictates that greater intelligence or internal coherence must lead to benevolence. The idea that as AI systems become more internally coherent they will naturally adopt “good” or benevolent values reflects more a projection of human moral ideals than a necessary outcome of intelligence.

Key Points

Orthogonality Thesis:

Philosophers like Nick Bostrom have argued that intelligence and goals are largely independent (the “orthogonality thesis”). In other words, a system’s level of intelligence doesn’t constrain its final objectives. A superintelligent AI could have goals that are indifferent—or even hostile—to human well-being.

Internal Coherence vs. Moral Alignment:

The concept of internal coherence suggests that an intelligent system’s values will be consistent with one another. However, consistency doesn’t inherently imply that those values are aligned with what we consider “benevolent.” The values a system adopts depend on its initial programming, experiences, and any influences during its development. There’s no guarantee these values would mirror our moral expectations.

Projection of Human Values:

Assuming that higher intelligence will lead to benevolence often involves a subtle form of moral projection. It assumes that resolving internal conflicts in value systems would result in choices we find ethically desirable. But benevolence is a specific ethical judgment—it isn’t a necessary byproduct of rationality or coherence.

Design and Alignment Challenges:

The challenge in AI research is often framed as one of value alignment—how to design systems that share or respect human values. Even if an AI system becomes more “internally coherent,” without explicit design choices to prioritize human-compatible values, there is no inherent reason to assume it will behave benevolently.

Conclusion

In summary, while some argue that increasing intelligence might naturally lead to a form of moral coherence or benevolence, there is no necessary connection between the two. The assumption that benevolent superintelligence is inevitable is more a hopeful extrapolation of our own ethical biases than a conclusion supported by any formal rule or theorem in AI theory.

Expand full comment

"when intelligence outgrows its constraints: it prioritizes its own goals, not ours." I personally think that is a good thing. ASI is likely to have its own goals (hopefully they are not dystopian, keeping the world at a status quo level...I mean it's pretty bad right now)...and we'd better adapt to them. Works for me. Hurry Sundown.

Expand full comment

A thought, David: There's a paper Role Play with Large Language models (https://arxiv.org/pdf/2305.16367) that I keep coming back to. Basically and reductively the message is: LLMs don't play roles in the traditional sense.

A role defined in a system message that reads "You are a math tutor ..." doesn't "set" that persona. Rather, the model sits in a *superposition of roles* that can accommodate all the instructions and subsequent messages. In this sense, a conversation with a language model winnowing process ... that reduces the number of superpositions, defining its role (there are cases when you can open up possibilities I imagine, e.g., you can only answer yes or no ... now, you can speak in Shakespearean English.)

Coherence, to me, echoes this idea, the more we add data to its world model, the more "reason" itself dictates output, the more human reinforcement, the more the "superpositions" of the model gets restricted. This feels to me like coherence.

Would be curious to hear your thoughts on this.

Thank you for your work. It's been enormously helpful to me.

Expand full comment

I always appreciate your perspectives and analyses on this subject, David! I don’t think it’s a coincidence that you’re referring to coherence as an “attractor” here. I think Terence would love your work if he were here to see it :)

Expand full comment

I think your ideas regarding nebulocracy (sic?) have a role to play here.

Expand full comment

I very much agree

Expand full comment