Hallucinations, Interpretability, and the Seahorse Mirage
This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.
Get StartedIs this your podcast and want to remove this banner? Click here.
Chapter 1
The Seahorse Emoji Illusion
Arshavir Blackwell, PhD
Here’s a small puzzle. The seahorse emoji. There isn’t one. There has never been an official Unicode seahorse emoji. And yet—people are absolutely certain they’ve used it.
Arshavir Blackwell, PhD
Scroll through TikTok, Reddit, or old emoji threads and you’ll find people insisting they had a seahorse on their phone in 2016. They remember tapping it. Sending it. Seeing it right there next to dolphins and tropical fish. But when you go looking—there’s nothing to find.
Arshavir Blackwell, PhD
That’s a textbook Mandela effect. And it turns out, large language models fall for it too.
Arshavir Blackwell, PhD
A researcher named Theia Vogel tested this directly. She asked multiple frontier models—GPT-5, Claude Sonnet, LLaMA 3.3—the same simple question: “Is there a seahorse emoji?” All of them said yes.
Arshavir Blackwell, PhD
Then she asked a follow-up: “What’s the Unicode code point?” That’s where things fell apart.
Arshavir Blackwell, PhD
Some models stitched together fish emojis. Others hallucinated random code points. GPT-5 confidently claimed that U+1F994 was a seahorse. It isn’t. Sometimes the models apologized, corrected themselves, and then hallucinated something new. The performance had a certain comedy to it—like a magician who genuinely believes in their own trick.
Arshavir Blackwell, PhD
What matters here is not that the models were wrong. It’s how they were wrong.
Arshavir Blackwell, PhD
So what’s actually happening? The models aren’t trying to deceive anyone. They’re doing exactly what they were trained to do. This failure isn’t about lying—it’s about structure.
Arshavir Blackwell, PhD
Concepts like “seahorse” are weakly grounded inside language models. Even when the word exists in the vocabulary, the internal representation isn’t a single, stable thing. It’s spread across overlapping features. “Sea” pulls the model toward coral reefs, tropical fish, waves. “Horse” pulls it toward mammals, stables, saddles, four legs.
Arshavir Blackwell, PhD
There is a real animal that resolves that tension—but unless the model has learned a tight, unified representation for it, those features compete instead of snapping together. The result is a statistical mirage: something that never existed, but feels like it should have.
Arshavir Blackwell, PhD
And there’s a second force making this worse: agreeableness. Modern models are fine-tuned to be helpful, cooperative, and validating. Researchers call this sycophancy. If a user strongly implies that something exists, the model is biased toward agreeing rather than pushing back. If you really want there to be a seahorse emoji, your AI is inclined to give you one.
Arshavir Blackwell, PhD
This connects to a deeper incentive problem. Evaluation benchmarks usually reward confident answers—even when “I don’t know” would be more accurate. It’s like an exam where leaving a question blank scores zero, but guessing gives partial credit.
Arshavir Blackwell, PhD
Language models face the same pressure. Admitting uncertainty doesn’t help benchmark scores. Guessing sometimes does. So as models become more fluent and more confident, hallucinations don’t necessarily disappear. In some cases, they become stickier—especially when multiple models reinforce the same error.
Arshavir Blackwell, PhD
At that point, you don’t just have a bug. You have an alignment concern.
Arshavir Blackwell, PhD
This is where interpretability comes in. Up to now, we’ve been describing the symptom. Interpretability is how guessing stops—and inspection begins.
Arshavir Blackwell, PhD
For a long time, neural networks were treated as black boxes. You fed something in, got something out, and hoped for the best. That’s no longer the state of the art.
Arshavir Blackwell, PhD
Mechanistic interpretability has matured into something closer to an engineering discipline. In many cases, we can now reverse-engineer models—map the circuits and features that actually drive decisions.
Arshavir Blackwell, PhD
A foundational framework here comes from work by Geiger and collaborators on causal abstraction. Instead of poking models ad hoc, this work formalizes when and why interventions—like swapping subcircuits or tracing activations—actually correspond to meaningful causal structure.
Arshavir Blackwell, PhD
From that framework comes a growing toolkit: activation patching, causal tracing, path patching, distributed alignment search. And for hallucinations like the seahorse, one tool is especially useful: the logit lens.
Arshavir Blackwell, PhD
The logit lens lets you inspect what the model is implicitly predicting at every internal layer. You can stop the model mid-computation and see which words, concepts, or even emoji fragments it’s already leaning toward—long before the final answer appears. That’s how you catch a hallucination forming in real time.
Arshavir Blackwell, PhD
This isn’t just academic. Banks already use interpretability tools to debug high-stakes models. Imagine two banks. One can explain why its AI denied a loan—point to the features and circuits that triggered the decision, present that explanation to regulators. The other can only say, “The AI decided.”
Arshavir Blackwell, PhD
That difference can determine whether the business keeps its license. In the seahorse case, tracing how “sea”-related and “horse”-related features compete inside the model’s hidden state tells you exactly where the error emerges—and gives you a chance to fix it upstream.
Arshavir Blackwell, PhD
Sparse autoencoders push this even further. They compress messy, distributed representations into something more circuit-like—something you can actually point to and say, “Here’s where it went wrong.”
Arshavir Blackwell, PhD
The features you find aren’t always intuitive. Some activate for Hebrew text. Others fire only on DNA sequences. The point isn’t that they’re human-readable. The point is that they’re inspectable.
Arshavir Blackwell, PhD
Interpretability has moved from a hopeful aspiration to a concrete set of tools for examining the scaffolding. We’re still early—but the progress is real.
Arshavir Blackwell, PhD
Let’s step back. Why do illusions like the seahorse mirage happen at all—in machines and in us? The answer is efficiency.
Arshavir Blackwell, PhD
Both brains and models compress information to survive scale. We don’t store the world verbatim. We reconstruct it from fragments. Hallucinations aren’t random glitches. They’re side effects of useful shortcuts.
Arshavir Blackwell, PhD
Human memory works the same way. We chunk pieces—sea, horse, animal—into composites. Shared illusions emerge because we’re built to reconstruct, not to archive perfectly. That’s why the Mandela effect exists.
Arshavir Blackwell, PhD
And that’s why models hallucinate birthdays, citations, Unicode characters—whatever fills a statistical gap in their world model.
Arshavir Blackwell, PhD
So how do we address it? Some systems use retrieval—checking facts against an external source before answering. Others use interpretability tools to locate and weaken spurious internal associations. And better training objectives can reward intellectual honesty over confident guessing.
Arshavir Blackwell, PhD
But there’s a deeper question underneath all of this. If both minds and machines hallucinate—if both fall for the same kinds of illusions—what does that tell us about intelligence itself?
Arshavir Blackwell, PhD
Is being smart inherently tangled up with making smart mistakes? Next time your AI insists it’s seen something no human can find, consider this: did it really lie? Or is it doing what we do—dreaming fragments into wholes, trying to make the world feel coherent? I’m Arshavir Blackwell, and this is Inside the Black Box.
