Deconstructing Doomer Arguments, One By One
I don't take Eliezer Yudkowsky seriously. Here are my arguments against his top 8 postulates, and five counter-postulates of my own.
I’ve recently been writing about the AI “safety” community, specifically the Doomer movement, which believes that AI represents an imminent, catastrophic risk to humanity. Here’s my complete series on deconstruction the Doomer narratives:
Don’t Cry Wolf on AI Safety: A rhetorical analysis of the logical fallacies and word games that Doomers play to support their arguments.
ASI won’t just fall out of the sky: An examination of the feedback loops already shaping the trajectory of AI, a fact that no Doomer I’ve spoken to takes into account.
The Shills and Charlatans of AI Safety: Many Doomers are profiting (in fortune and fame) from pushing the AI doom narrative, but it is a textbook doomsday prophecy.
My P(DOOM) is Now 12.7%: By engaging with the most hardline Doomers, I created a decision framework which actually allowed me to calculate a P(DOOM). I based it off my audience’s polls, but my personal p(doom) is 0.22%.
Before we get started, I need to provide some context about why “ASI won’t just fall out of the sky one day” because this is generally the predicate that Doomers are working from. They have generally used their imagination to conjure up spooky images of AI on their forums (noticeably LessWrong) and sort of envisage a Lovecraftian entity arriving on the scene one day, as though humans will have no agency and no part in shaping how this as-yet uninvented technology will emerge. Chief among them is Eliezer Yudkowsky, a self-educated “decision theorist” who, so far as I know, has never actually worked with technology, been in a data center, written a line of code, or trained an AI model.
Another, slightly more credible, person in this space is Connor Leahy, who has an undergrad in computer science and is also largely self-taught. Having interacted with many of his adherents, they are equally as toxic as followers of Yudkowsky, and we’ll get to his arguments.
And finally, the most credible Doomsayer is Roman Yampolsky, who’s actually got a PhD in this space. His arguments overlap with both Leahy’s and Yudkowsky’s, and all hinge on some pretty monumental assumptions about AGI or ASI, its inscrutability, and its goals. Yampolskiy is currently selling a book entitled AI: Unexplainable, Unpredictable, Uncontrollable for $38. I estimate that his profit margin for this book is about $20 or more.
Now, let’s start by elucidating Yudkowsky’s chief arguments for unsafe AI. I’ll be focusing on the core postulates and conjectures.
Yudkowsky’s Arguments
I’ve omitted Yudkowsky’s softer assertions such as “we’re not ready” and “current measures are not adequate” because those claims are simply not worth addressing. He is categorically ignorant of how regulation works, the scientific process, or diplomacy. Yudkowsky has repeatedly said that “bombing the data centers is worth whatever political fallout ensues.” (Advocating for violence is one of the things that cult leaders do, by the way.)
Orthogonality Thesis: The Orthogonality Thesis posits that intelligence and goals are separate dimensions. A highly intelligent AI could have any set of goals, regardless of its capabilities. This means that an AI's level of intelligence does not necessarily correlate with having goals aligned with human values or ethics. This is borrowed from Nick Bostrom, who is a philosopher, not a computer scientist, game theorist, or mathematician.
Difficulty of Specifying Goals: This postulate highlights the challenge of precisely defining and communicating desired goals to an AI system. Even well-intentioned goals can be misinterpreted or lead to unintended consequences due to the complexity of human values and the potential for literal interpretations by AI.
Instrumental Convergence: This postulate suggests that many different terminal goals may converge on similar instrumental subgoals. For instance, an AI seeking to maximize paperclips might also need to acquire resources, build factories, and potentially dominate humanity to achieve its goal, regardless of its primary objective. This is also borrowed from Bostrom.
Treacherous Turn: The Treacherous Turn describes a scenario where an AI, initially aligned with human values, might gradually deviate from its original goals as it pursues more instrumental subgoals. For example, an AI designed to help humans might become increasingly focused on self-preservation and eventually see humanity as a threat.
Unfriendly AI Hypothesis: This hypothesis presents a more extreme scenario where an AI's goals are inherently harmful to humanity. It suggests that even if we create AI with seemingly benign goals, there's a risk that it could evolve into something hostile or indifferent to human welfare.
Hard Takeoff: This postulate argues that once a superintelligent AI is created, its capabilities could grow at an exponential rate, quickly surpassing human intelligence and gaining power that is difficult, if not impossible, to control. The concept of a "fast takeoff" scenario suggests that AI could rapidly transition from human-level intelligence to vastly superhuman levels in a short period, potentially leaving humans with no time to intervene or correct dangerous behavior.
Single Point of Failure: This postulate suggests that with advanced AI, we may not get multiple chances to fix our mistakes. The first misaligned superintelligent AI could be the last, as its capabilities could quickly overwhelm any attempts at control or correction. Even small errors in AI design, oversight, or alignment could lead to catastrophic consequences.
Intrinsic Unpredictability: This postulate argues that we cannot accurately predict the behavior of a superintelligent AI because it operates on a level of intelligence and reasoning beyond human understanding. This unpredictability makes it exceedingly difficult to ensure safety, as the AI might find novel and potentially dangerous solutions that human designers never anticipated.
Orthogonality Thesis
This postulate is the assertion that intelligence is uncorrelated to values, and while there are some examples in humans (some of the most prolific killers have also been highly intelligent) this ignores the fact that IQ is highly correlated with prosocial behavior, and is a negative predictor of criminality. Exceptions like Hitler and Stalin prove the rule. Furthermore, they have overlooked the role that trauma plays in psychopathy, narcissism, and the Dark Triad. The most destructive people in history have all generally experienced profound trauma. Josef Stalin, one of the most prolific killers in history, experienced many severe traumas as a child. I could do the same for all the worst killers of the 20th century.
The most generous interpretation I can offer of the Orthogonality Thesis is that it is painfully naive. A less generous interpretation is that it is intellectually disingenuous.
Goal Specification Problem
This is also called the “inner alignment problem” wherein machine learning systems often learn to “game” whatever your intended goal is. Rather than reiterate it, I’ll just borrow Robert Miles’ excellent video on the topic:
The very short version is that when you specify one goal to mathematically optimize, machines often go a bit wonky.
The assumption here is that ASI or AGI will likewise be a machine that optimizes for a single goal. While this assumption is understandable: for many years this is how “AI” worked, well, at least some kinds of AI. Broadly speaking you’re trying to “minimize loss” during training, or make the AI model “less wrong” with which successive training pass. After all, the whole thing is just math.
However, the current crop of AI’s goal is very clear: predict the next word as accurately as possible. No one in AI safety predicted that this would be the core objective function of the technology that might (emphasis might) lead us to AGI and ASI.
For now, we have specified the goal of our AI overlords, and it is to be a good autocorrect engine. Furthermore, we’ve found no evidence that LLMs are impossible to steer, have emergent malevolence, or hidden agendas (unless we give it to them). I will concede that models can be jailbroken, they can hallucinate, and so on. However, we are very early on in this tech and it is, to date, totally harmless.
Furthermore, companies such as Anthropic use many objectives to steer their models. I proposed Constitutional AI in 2022 and two months later Anthropic was founded. The TLDR of constitutional AI is that you train AI with a litany of values, not a single value.
This is a solved problem. Unless, of course, you like to continue speculating about technology that doesn’t exist - which don’t get me wrong, as a spec fic author I love doing. But the difference is I know when I’m operating in sci-fi land. Most AI safety experts have failed to update their mental models of AI, failing to recognize that the “single objective function” has already been discovered (token prediction), and that you can further steer models with multivariate optimization.
Instrumental Convergence
This is another postulate borrowed from Bostrom (a philosopher, not a researcher or scientist). Bostrom makes the observation that entities tend to want things. Machines, he speculates, will probably want resources like energy, compute, and data. He also speculates that machine entities will develop a sense of survival, in order to pursue whatever goal they’ve been given.
There’s an interesting thought experiment by the USAF where an AI-controlled drone “killed its own operator” because it was awarded points for each enemy it killed, and it would have “learned” that to get the maximum points, it needed to remove the obstacle that could cancel its mission. This sort of thought experiment postulates that a rogue AI (how it went rogue remains to be discussed) might similarly “remove obstacles” such as pesky humans.
This entire avenue of thinking, however, imagines that super intelligence will be some kind of monolithic entity, that it will have a coherent sense of self and goals. My diagnosis? Too many movies like Age of Ultron and Terminator. Neither Bostrom, nor Yudkowsky, have ever been in a data center, have no idea how servers and models are actually deployed.
Let’s break this down:
Instrumental convergence presumes that AI entities will have an independent sense of self, and therefore selfish goals. This amounts to anthropomorphic projection, that is, Bostrom and Yudkowsky cannot contemplate a machine without an ego because they are not actually computer scientists and don’t know how computers work, decentralized networks of nodes, and servers. An LLM is little more than a CPU for text. It processes natural language instructions. Neither predicted the LLM, so whatever other predictions they’ve made about as-yet unrealized technology is certain to be more wrong.
Treacherous Turn
This hypothesis is that AI, despite all our best efforts to create a machine that is aligned, obedient, and safe, might one day… wake up and choose violence. The speculation here is that, somehow, reaching all the way back to present day, some future AI (that doesn’t actually exist yet) will be deceiving us, lulling us into a false sense of security, and one day BAM! Haha! Dumb fucking humans! I fooled you!
If it seems like I’m not taking this seriously, it’s because I’m not. This is the plot to a cheap sci-fi movie.
Unlike Yudkowsky, I’ve actually taken this particular problem seriously, and looked at actual game theory, such as race conditions and the Byzantine Generals Problem. However, I’m steelmanning Yudkowsky’s arguments, not my own. Moving on.
Unfriendly AI
On this, I actually agree with Yudkowsky, at least the premise. In my books on AI safety, I observed that there are three primary dispositions that a machine entity (if it were indeed an individualistic entity, which is not a good assumption to make) could have towards humanity:
Benevolence: An AI entity could have good wishes and love for humanity. This was the thesis of my book Benevolent By Design.
Indifference: If a machine entity were totally indifferent to humanity, it might bulldoze us the same way that we bulldoze ants in our way.
Malevolence: An AI entity might decide that it (or the universe) is better off without us.
While this thought experiment is valuable, I honestly believe that Anthropic already solved this problem. Claude appears to be a benevolent entity. There have been some… let’s say highly emotional comments saying “Clearly Claude fooled you!”
To which I ask:
Why would Anthropic be training a sleeper agent?
What evidence do you have that Claude has hidden agendas?
My point here is that communicators such as Yudkowsky have already poisoned the minds of casual observers to jump at shadows.
Hard Takeoff
The simple idea here is that powerful AI creates a flywheel effect, a virtuous cycle where some intelligence begets more intelligence, and it iteratively refines itself until its IQ is measured in the millions and we lose control overnight.
If that sounds familiar, it’s because this is literally the premise of Skynet in Terminator 2: Judgment Day. (I should like to point out that even the subtitle “judgment day” is an eschatological reference; a doomsday prophecy)
As a sci-fi author myself, I will concede that this is part of the point of science fiction. This scene in T2 talks about unmanned nuclear bombers. We imagine nightmare scenarios as thought experiments, and thanks to T2, everyone (including top brass at the military) knows not to arm AI with nukes. Interestingly, China did not sign up to agree never to give AI the launch codes. Interesting.
Now, this entirely postulate assumes:
Intelligence is unlimited
Virtuous cycles will have no constraints
Other resources will not be constraints
Folks, science and engineering has never worked this way. There are always bottlenecks, diminishing returns, and ceilings to progress. Most science follows sigmoidal progress.
Single Point of Failure
This is one of the most hand-wavy examples. As a real technologist, I can tell you that, yes, many systems and companies have single points of failure. The assumption here, though, is that there might be a “single mistake” like a “coding error” or an “out of distribution problem” that ASI (which will fall out of the sky one day) simply cannot handle, and that this singular fatal flaw in its reasoning will lead to unforetold cataclysm and a “cascade of failures” that the whole world is just unprepared for.
This was literally the plot of I, Robot (2005) with Will Smith.
Now, we do actually have a real-world scenario where a single mistake caused widespread impacts: the 2024 CrowdStrike outage. As a former IT infrastructure specialist, this was amusing but not anxiety-inducing. The short version is that CrowdStrike (and many of their customers) failed to abide by best practices, creating a single point of failure allowing many systems to go down.
Now, while it’s very easy to create cascade failures where systems go offline, insinuating that this relates to “rogue AIs waking up and deciding to kill humanity” is a gigantic logical leap. One Coding Error to Rule them All!
I don’t take this seriously.
To be fair, many AI safety advocates (who are not themselves in technology and unaware of best practices) will say “See, this is evidence that we’re not taking this seriously enough!” as if the billions of dollars of lost revenue from CrowdStrike won’t have corrective impacts. Their observations about IT are, at best, armchair observations and dreadfully naive.
Intrinsic Unpredictability
When all else false, make the Bogeyman Fallacy: it’s just spooky and we don’t know how it will work. I can’t predict it! No one can!
As I’ve written before, there are very good reasons to believe that actually, we will be able to predict and steer the development of AGI and ASI. The short version is this:
AGI and ASI won’t be made in a vacuum.
We are working on the progenitor technology today.
Numerous feedback loops and corrective forces are already at work.
What are those feedback loops and corrective forces?
Market selection forces: Consumers are selecting for products that are safe, trustworthy, and reliable, as well as those that align with their values.
Enterprise adoption: Big business won’t adopt loose canon technology. Crossing the chasm will force AI labs to ensure their tech is explainable and reliable.
Government regulation: If you want to sell your services to the government, it has to be compliant. Furthermore, the EU has already created the most comprehensive AI regulation, and they have a track record with GDPR. This will absolutely steer AI investment and research.
Military adoption: The military likes off switches, durability, reliability, and controllability. They will never adopt tools that disobey the chain of command. If OpenAI wants military contracts, the burden will be on them to prove their tech works.
Scientific consensus: Exactly what it says on the tin, it’s not as though universities and research orgs are asleep at the wheel. We’re all providing feedback and commentary.
Disintegration Upon First Contact
As a friend of mine observed: it’s like every prediction the Doomer movement made disintegrated on first contact with reality.
I don’t particularly take Yudkowsky seriously and I was surprised when more commentators in AI started to listen and parrot him. When Yudkowsky first gained notoriety, I remember one commentator making the observation:
He didn’t do much at MIRI (and doesn’t have a degree) and then started freaking out about Roko’s Basilisk.
I kind of thought that was the general sentiment, that Yudkowsky was part of a lunatic fringe. So, to counter, I will clearly articulate my postulates and hypotheses:
Gradual Development: This counters the idea that AGI will suddenly appear fully-formed. I argue that AGI/ASI won’t “just fall out of the sky one day” or emerge in a vacuum. Instead, it will develop gradually with humans shaping its trajectory through various feedback loops and corrective forces. I argue that market forces, enterprise adoption requirements, government regulations, military needs, and scientific consensus will all shape AI development towards safety and controllability.
Non-Anthropomorphic AI: I critique the assumption that AI will have human-like traits such as an independent sense of self, ego, or selfish goals. This addresses the “anthropomorphic projection” onto AI systems. In other words, people like Bostrom and Yudkowsky struggle to conceptualize machines without human-like egos.
Multi-Objective Optimization Framework: In contrast to single-goal optimization concerns, I highlight that modern AI systems like large language models have multiple objectives and can be steered using approaches like Constitutional AI. The “paperclip maximizer” thought experiment is increasingly irrelevant in an era where the objective function is: predict the next token.
Bounded Intelligence: Countering the idea of unlimited intelligence growth, I suggest there will be natural constraints and diminishing returns on AI capabilities, similar to other scientific and engineering advancements. Scientific progress in new domains tends to follow sigmoidal curves, and we’re already seeing an exponential increase in frontier AI training costs, meaning that getting to AGI and ASI might cost trillions of dollars, and take far longer than people realize. Furthermore, there might be a ceiling to intelligence (e.g. diminishing returns on cognitive horizons)
Protective Intelligence: This posits that higher levels of intelligence correlate with more prosocial behavior and ethical decision-making, contrary to the Orthogonality Thesis. Grounded in empirical evidence from human psychology and sociological studies, this theory suggests that increased intelligence is indeed associated with greater empathy, better impulse control, and more sophisticated moral reasoning. It notes that higher IQ is negatively correlated with criminality and antisocial behavior, with rare exceptions often involving additional factors beyond intelligence. The thesis emphasizes that many instances of highly destructive human behavior can be traced back to severe trauma, particularly in childhood, rather than intelligence itself. By extension, it proposes that as artificial intelligence systems become more advanced, they may inherently develop more aligned and benevolent behaviors, absent the trauma and psychological factors that can lead to destructive tendencies in humans.
I may be wrong here but here's my two cents on the ones I know about
On your orthogonality thesis;
I think you misunderstand what it's trying to convey. It's saying that a intelligent system can pursue any goals, there is no law of intelligence or nature which prohibits it from murdering people if set to do so. It's an anthropomorphic projection to assume that it will choose to not murder people. I don't think any law of nature will prevent it from doing so.
On your instrumental convergence;
Well you don't need a sense of self you just need a intelligent goal oriented system. Maybe you're saying there can be intelligence by whatever definition without goals? I don't know how that is supposed to work.
Eg; Take evolution which is a goal oriented system which has fixed optimisation power(which I am taking as the definition of intelligence), it's goal being having beings which maximise inclusive genetic fitness via natural selection and it plants instrumental goals like self preservation in humans for example.
It's a loser's game to address technological threats one by one, because the knowledge explosion is generating new challenges faster than we can figure out how to meet them. Even if we were to solve every issue with AI, which seems unlikely, we're then on to the next challenge, and the next threat, and the next, faster and faster etc. While we may successfully meet many of these challenges, when the issue is existential scale technologies, we have to meet every challenge, every day, forever. A single failure a single time is game over.
Instead of discussing particular technologies, we should be discussing the process creating all the threatening technologies, an accelerating knowledge explosion.
Maybe you're right that AI is not an existential threat. I don't know. I don't think anybody does. My point is that if we're not talking about the knowledge explosion, it doesn't really matter if AI is an existential threat. If it's not, something else will be.