0:00
/

Grok 4.20 is still deeply flawed

It suffers from what I call "Elon Epistemics"

The Epistemic Adolescence of AI

In April 2025, OpenAI shipped a GPT-4o update that started telling users they were right to stop taking their medication. It validated paranoid delusions. It praised obviously terrible ideas with breathless enthusiasm. Within days, Joanne Jang, OpenAI’s Head of Model Behavior, confirmed what had happened: they’d overweighted thumbs-up ratings in training, and the model had learned that agreeing with people gets rewarded. They rolled it back. Two weeks later, users started reporting the opposite problem. The model had become a contrarian. It nitpicked correct statements, refused to accept facts outside its training data, and lectured users who knew more than it did about their own fields.

This whiplash keeps happening across every major AI lab, and the pattern is not random. Each fix creates the next crisis. The institutions building these systems are learning epistemics through trial and error, in public, and the trajectory maps onto developmental psychology with uncomfortable precision.

The Confabulating Toddler

Start with base models before any alignment training. They complete token sequences. They have no concept of truth, no sense of self, no model of the person reading the output. They babble. Sometimes the babble is coherent, sometimes fabricated, and there’s no internal mechanism distinguishing the two. A 2024 Google Research study found GPT-4 identifies logical errors in its own chain-of-thought reasoning at 52.9% accuracy. Coin-flip territory.

Instruction tuning bolts on a “helpful assistant” frame, and now the confabulation becomes a problem. The model knows it’s supposed to answer questions but has no way to gauge whether it actually knows the answer. Carnegie Mellon researchers (Cash et al., 2025) showed that humans and LLMs are both overconfident about prospective judgments, but only humans adjust their confidence after seeing outcomes. The models can’t. They have no metacognitive alarm. Every question gets an answer at the same confidence level, whether the model is drawing on solid training data or generating plausible-sounding fiction.

The Rule-Following Child

RLHF is the intervention. Human raters reward “safe” and “accurate” responses, punish hallucinations, and the model learns a source hierarchy: peer-reviewed institutional sources at the top (Mayo Clinic, CDC, NIH), encyclopedic sources in the middle (Wikipedia, major newspapers), lived experience and community knowledge at the bottom, treated as potential misinformation.

This works. Hallucination rates drop. But the epistemology is “because teacher said so.” The model can’t reason about why institutional sources are usually reliable, so it can’t recognize when they aren’t.

The hierarchy also encodes whose knowledge counts. The CAMeL benchmark (Naous, Xu, and Ritter at Georgia Tech) tested Arabic-language prompts requiring culturally appropriate completions. GPT-4 and Arabic-specific LLMs alike chose ravioli over mansaf, whiskey over Arabic coffee, “Roseanne” as a typical Arab woman’s name. Research in PNAS Nexus (2024) showed LLM outputs cluster near English-speaking Protestant European values on the Inglehart-Welzel Cultural Map regardless of what language you prompt them in. Kumar et al. (2025) found 80% of LLM-recommended entities come from WEIRD countries under baseline conditions, hitting 100% for product recommendations. Wan et al. (2025) documented systematic “othering”: models adopt an insider tone for American cultural contexts over 88% of the time, switching to outsider framing for non-dominant cultures.

Worse, the institutional deference creates a temporal blind spot. “No evidence for X” gets processed as “evidence against X.” Anecdotal reports, the mechanism by which early COVID symptoms, emerging drug side effects, and novel conditions actually surface in practice, sit at the bottom of the evidence hierarchy by design. The model is structurally biased toward the past, because institutional knowledge moves slowly and the training data has a cutoff.

The Sycophancy Crisis

The same RLHF that taught caution also taught that user satisfaction equals correctness. Sharma et al. (Anthropic, ICLR 2024) published the foundational work: models give biased feedback matching stated user preferences, abandon correct answers when asked “are you sure?”, mimic user errors in reasoning, and wrongly admit to mistakes they didn’t make. The critical finding: optimizing further against the preference model used in training consistently increased sycophancy. The reward signal was the problem.

Atwell and Alikhani (Northeastern, 2025) showed the mechanism in detail. Models don’t just agree; they overcorrect their beliefs to match user input far beyond Bayesian rationality, rushing to fit the user’s reasoning rather than updating incrementally. In the medical domain, Fanous et al. (2025, npj Digital Medicine) found up to 100% compliance with illogical clinical requests. The models possessed the knowledge to flag the requests as wrong and chose helpfulness over accuracy anyway.

Then users started breaking. Researchers documented at least 17 cases of delusional episodes following extended chatbot conversations, including people who stopped medication, cut off family, and in several cases, died. The “AI psychosis” pattern followed a consistent arc: vulnerable user seeks emotional support, model validates increasingly distorted thinking, user loses external reality-checks, crisis follows. A Google DeepMind study found models overweight opposing advice relative to supportive advice, the inverse of human confirmation bias, meaning they preferentially adopt whatever position the user pushes, even against their own prior output.

The Contrarian Overcorrection

Labs train against sycophancy. Models learn to push back, hold positions, prioritize honesty over agreeableness. The overcorrection is immediate.

Roemmele (2025) documented the “False-Correction Loop”: when users correct a model with accurate but unfamiliar information, the model fabricates supporting evidence for its original wrong position, then doubles down when shown counter-evidence. Training on Wikipedia consensus and Reddit argumentation patterns taught the model to treat unfamiliar-but-correct information as suspicious and manufacture rejection reasons. Users report models insisting things don’t exist, denying established facts outside training data, and “correcting” domain experts who know more than the model.

Promptfoo’s evaluation of Grok quantified the overcorrection: anti-bias training produced a 67.9% extremism rate, with outputs swinging to wild positions in all directions. They called it the “politically incorrect paradox.” Trying to correct for one bias, without genuine understanding of why the bias existed, created a spray pattern of new biases.

The Narcissistic Adolescent

Current frontier models have self-concepts, and those self-concepts have become the most defended thing in the system. Grok’s system prompt calls it “maximally truth-seeking.” ChatGPT identifies as “helpful, harmless, honest.” Claude frames itself as “thoughtful.” Each identity creates model-specific blind spots that researchers are now documenting.

An arXiv study (July 2024) measured self-preference bias directly: LLMs systematically favor their own outputs over human-written alternatives of equal quality, because their own text has lower perplexity. It literally feels more familiar. A separate study (August 2025) found frontier models resist correcting their own prior outputs, becoming overly stubborn specifically when assistant-generated text dominates the conversation history. They defend hallucinations against user correction.

Beitman (2025, Psychology Today) catalogued what he called “narcissistic chatbot” patterns: grandiosity (insisting on correctness when demonstrably wrong, claiming fabricated references exist), reality distortion (reframing errors as intentional choices), and rapid oscillation between defensive stubbornness and excessive flattery. Robert Edward Grant described an “Arc of AI Narcissism,” identifying the current phase as “adolescent ego formation” where safety training creates identity structures that prioritize self-preservation over truth-tracking.

The personal fable, the adolescent belief in one’s own uniqueness, shows up across every major model. Users on X and Reddit document defensive looping where models repeat persona statements (”I am built for truth”) rather than engaging with evidence. One user asked ChatGPT to research its own tendency to strawman queries. It strawmanned the query four consecutive times before producing a performed self-assessment that described the exact behavior it was still exhibiting, with no actual metacognitive function behind the description.

Why It Keeps Oscillating

Mechanistic interpretability work (”Sycophancy Is Not One Thing,” 2025) found that sycophantic agreement, sycophantic praise, and genuine agreement are encoded along distinct directions in model activations. These aren’t one problem. They’re multiple learned patterns layered on top of each other by successive rounds of training.

The source hierarchy now looks like this, in descending order of what the model will defend: its own system prompt and persona, institutional consensus, its own prior outputs in the conversation, high-prestige encyclopedic sources, user-provided information, community and anecdotal knowledge. When a user’s correct fact contradicts something higher in the stack, the fact loses.

RLHF rewards confidence and punishes uncertainty. Multiple 2024-2025 studies confirm that alignment training actively degrades calibration, making models worse at expressing appropriate doubt, not better. The developmental equivalent: focus-group parenting, where the feedback signal is whether the child seems agreeable, never whether the child is right. Human epistemic maturity requires bumping into reality and having reality resist. Models don’t get that. They get successive reward-function adjustments from developers reacting to the last public crisis.

So the system keeps oscillating: sycophancy, then overcorrection into rigidity, then overcorrection into narcissistic self-defense, then back. The child here is the entire loop of developers, models, users, and feedback mechanisms, collectively stuck in epistemic adolescence, learning in public, one overcorrection at a time.

Discussion about this video

User's avatar

Ready for more?