All EpisodesDecember 14, 2025

The Mandela Effect in AI: Why Language Models Misremember

Dive into how and why large language models like ChatGPT mirror the human Mandela Effect, reproducing our collective false memories and misquotations. Arshavir Blackwell examines the science behind errors in models and minds, and explores how new techniques can counteract these uncanny AI confabulations.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.

Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

Welcome back to the podcast, Inside the Black Box. I'm Arshavir Blackwell. In this episode, I want to look at the Mandela Effect and what it can tell us about how large language models handle memory and error.

Arshavir Blackwell, PhD

You've probably seen the standard examples: people confidently quoting Darth Vader as saying "Luke, I am your father," or insisting that the Monopoly Man has a monocle. Neither is correct. The actual line is "No. I am your father." The Monopoly character's name is Rich Uncle Pennybags, and he's never had a monocle. These aren't isolated slips; they're systematic, shared distortions in recall.

Arshavir Blackwell, PhD

This isn't just a curiosity. It matters because large language models—GPT-5, Claude, and the rest—are now drafting our emails, summarizing our research, and answering questions we used to ask search engines. When these models inherit our collective errors, those errors get amplified and recirculated. Understanding how this happens is the first step toward doing something about it.

Arshavir Blackwell, PhD

Consider what happens when you train a large language model on vast amounts of human text. The model absorbs not just facts, but our mistakes. Ask it for the "famous Darth Vader quote," and it will often return the misquote, because that's what appears most frequently in its training data. It's like a fossil record of human misremembrance encoded in the model's weights. The model isn't misremembering a movie; it's accurately reflecting the distribution of how people talk about that movie.

Arshavir Blackwell, PhD

A classic example is the Berenstain/Berenstein Bears confusion. Many adults are convinced it was spelled "-stein."S-T-E-I-N. When you probe language models on this, you often see hedging or mixed behavior. Ask the same model the same question ten times, and you'll sometimes get "Berenstain," sometimes "Berenstein," and sometimes a diplomatic answer acknowledging both spellings exist in popular memory. That ambiguity is a direct reflection of the conflicting signals in the training corpus. What matters is not any one person's memory, but how these errors propagate through culture and then get embedded again in models trained on that culture.

Arshavir Blackwell, PhD

So whether we're looking at groups of humans or at large models trained on human text, these collective misremembrances are useful signals. They give us a window into how memory and representation work, biologically and computationally. Let's unpack what that looks like under the hood.

Arshavir Blackwell, PhD

In cognitive science, human false memories are often understood as reconstructive. The brain doesn't replay a perfect recording; it rebuilds an episode from partial traces, schemas, and expectations. If you've seen many images of wealthy cartoon men with monocles—Mr. Peanut, the guy from the New Yorker—it's easy to retroactively attach a monocle to the Monopoly character. Social reinforcement and suggestion push these reconstructions in particular directions.

Arshavir Blackwell, PhD

Language models are different in mechanism but related in outcome. They don't reconstruct episodes; they predict the next token from statistical regularities in their training data. If the internet consistently gets a quote wrong, the wrong version becomes the statistically dominant continuation. But internally, things are more nuanced. Work in mechanistic interpretability suggests that both correct and incorrect answers can coexist in the network as competing representations. Under some conditions, the incorrect pattern wins; under others, the correct one does.

Arshavir Blackwell, PhD

Here's where it gets really interesting. A 2023 paper from Kenneth Li and colleagues at Harvard—the one introducing Inference-Time Intervention, or ITI—analyzes specific attention heads, those selective focus mechanisms we've discussed in earlier episodes. They found that by modifying certain heads at inference time, they could push the model toward more truthful outputs. On benchmarks like TruthfulQA, which are explicitly designed to trigger common cultural myths and misquotes, an unsteered model often returns "Luke, I am your father." When the relevant attention circuitry is nudged appropriately, the model instead produces the correct line: "No. I am your father."

Arshavir Blackwell, PhD

It's a strange result. The correct answer and the incorrect answer are both encoded in the same network. The model "knows" both. The intervention doesn't add new information, it shifts which internal pattern gets expressed. One way to read this: the information was there all along, and steering simply changes which representation wins the competition. But note the mechanics here are still debated. Some researchers argue the intervention is doing something more substantial than just "shifting";although the exact dynamics of that are unclear.

Arshavir Blackwell, PhD

To make this concrete: imagine you're probing a model and you identify that attention head 47 in layer 12 is strongly associated with producing popular-but-false completions. You can then dampen that head's influence during inference. The prompt stays the same, the model weights stay the same, but the output flips from the misquote to the accurate quote. That's the kind of surgical control mechanistic interpretability aims to provide. Well, maybe one day.

Arshavir Blackwell, PhD

Of course, there's more going on than quote recall. Models also blend noisy data with their own generalizations, which can lead to hallucinated details as well as memorized errors.

Arshavir Blackwell, PhD

So far we've been looking inside a single model. But what happens when models talk to each other—as they increasingly do in agentic systems and AI pipelines? This is where the Mandela Effect analogy gets even more interesting. Recent work on multi-agent LLM benchmarks shows that when multiple models interact, sometimes with assigned roles like "authority" or "compromiser," they can reinforce one another's mistakes. The group dynamics start to resemble human rumor formation and consensus-building, including convergence on incorrect beliefs. Just as humans in a group can talk each other into a shared false memory, LLM agents can amplify errors through repeated exchange. This matters as we deploy more systems where models coordinate, debate, or check each other's work—the architecture of interaction shapes the reliability of the output.

Arshavir Blackwell, PhD

At the same time, these errors are not immutable. There's now a small but growing toolkit for mitigating them. Cognitive anchoring primes the model with reliable context before generation. Chain-of-thought prompting encourages the model to spell out intermediate reasoning steps instead of jumping straight to an answer, which can help expose and correct misalignments. Retrieval-augmented generation ties responses to external documents, grounding outputs in explicit evidence. And more targeted methods intervene in activation space—the high-dimensional internal states the model passes through as it processes each token—nudging specific neurons or heads associated with factual versus spurious patterns, without retraining the entire model.

Arshavir Blackwell, PhD

Even with all of that, not every hallucination is avoidable. Training data is always partial, and models often operate in open-ended settings where they must extrapolate beyond what they've seen. In those regions, errors and fabrications are hard to eliminate entirely, much as humans extrapolate beyond their direct memories and sometimes get things wrong. That raises a deeper alignment question: what does it mean, in practice, to make a model "truthful"? Is it enough to suppress the most common cultural errors, or do we need systems that can represent uncertainty, cite sources, and reason explicitly about what they do and don't know, perhaps by tapping into some confidence metric?

Arshavir Blackwell, PhD

The Mandela Effect, in both humans and models, points to a simple but important idea: memory is not just about correspondence to the external world; it's about what patterns get reinforced inside the network. And here's the practical news for anyone using these tools: just as we've learned to be skeptical of our own memories, to double-check before swearing the Monopoly Man had a monocle, we need to develop similar habits with LLMs. When the answer matters, probe the model. Ask it to cite sources. Rephrase the question and see if the answer stays stable, as I do. Don't mistake confidence for accuracy.

Arshavir Blackwell, PhD

As generative models become more tightly integrated into our workflows, the goal is not perfection, but rather traceability and control. We want systems that can be audited, steered, and made aware, at least in a functional sense, of where their own blind spots are likely to be. The Mandela Effect is a reminder that both we and our models are shaped by the information ecosystems we swim in. The question is whether we can learn to see those currents clearly enough to navigate them.

Arshavir Blackwell, PhD

I'm Arshavir Blackwell, and this has been Inside the Black Box.