Arshavir Blackwell, PhD

Inside the Black Box: Cracking AI and Deep Learning

TechnologyEducation

Listen

All Episodes

Why Your Brain Believes the Model

The Heuristic Loop You Can't Break from Inside

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.


Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

I'm Arshavir Blackwell and this is Inside the Black Box. This is a follow-up on our last article. The first one, Fluency as Validity, documented a specific failure: mistaking the linguistic fluency of LLM output for evidence of soundness, understanding, or mind. The examples ranged from a GitHub repo with hardcoded "Consciousness Coherence" scores to a citation-dense comparative cognition argument with over thirty peer-reviewed references. The mechanism was always the same. Only the packaging varied.

Arshavir Blackwell, PhD

But that piece left a question open. Why does fluent text make us stop checking? The answer from Lawson Bernstein — that the brain's evolved heuristics for when to stop checking are being triggered by surface features that no longer track what they evolved to track — is correct. This piece goes deeper into what those heuristics actually are, how they reinforce each other, and why the result isn't a single bias you can correct for but a loop you can't easily break from inside.

Arshavir Blackwell, PhD

This isn't just a list of cognitive biases. It's a description of a trap.

Arshavir Blackwell, PhD

The gateway is processing fluency. That's the subjective ease with which you process a piece of text. Alter and Oppenheimer demonstrated in 2009 that fluency is treated by the brain as a truth signal. Text that's easy to read feels true. Text that's hard to read — disfluent fonts, awkward grammar, unusual phrasing — triggers more analytical thinking. In a 2007 study, Alter, Oppenheimer, and Epley showed that disfluency actually activates what Kahneman calls System 2 — the slow, effortful, checking mode of cognition. Difficulty makes you careful. Ease makes you credulous.

Arshavir Blackwell, PhD

This is not a metaphor. It's a measured, replicable experimental finding. And it means the first heuristic is triggered before the user has finished reading the first sentence of an LLM response. The training objective of next-token prediction selects for text that matches the statistical patterns of fluent, high-quality human writing. The output is maximally easy to process by construction. The fluency isn't a bug, and it isn't incidental to the model's usefulness. It is the product. And it's the exact surface feature the brain reads as "probably true."

Arshavir Blackwell, PhD

Once fluency has opened the gate, the availability heuristic walks through it. Tversky and Kahneman described this in 1973: people judge the frequency, probability, or typicality of events by the ease with which relevant instances come to mind. What's easily retrievable from memory feels common, and what feels common feels true.

Arshavir Blackwell, PhD

LLMs surface whatever is most represented in their training data. Not what's most accurate, but what's most frequent. When you ask a model a question, the answer you get is the highest-probability completion given the training distribution. The availability heuristic then converts that retrieval ease into a judgment of truth. "This must be the consensus view" is indistinguishable, to the heuristic, from "this appeared most often in the training data." The heuristic can't tell the difference. It was never designed to.

Arshavir Blackwell, PhD

The illusory truth effect compounds this. Hasher, Goldstein, and Toppino showed in 1977 that mere repetition makes statements feel true. The mechanism is processing fluency again: a repeated claim processes more easily the second time around, and that increased ease is misattributed, first to familiarity, then to truth. The effect works even when statements are explicitly labeled as false.

Arshavir Blackwell, PhD

In a multi-turn conversation with an LLM, the model restates, paraphrases, and elaborates on its own claims across turns. Each restatement increases processing fluency for that claim. After several exchanges, the claim feels like something you've known for a while. The user doesn't notice that the only source repeating it is the model itself. The feeling of "I've heard this before" is real — you have heard it before, from the same system, five minutes ago. The cognitive heuristic doesn't track provenance. It tracks ease.

Arshavir Blackwell, PhD

Coherence deepens the trap. Thagard argued in 1989 that beliefs are evaluated not individually but as networks: a claim that fits coherently into a larger narrative feels more true than an isolated claim, regardless of its actual validity. Kahneman and Tversky's conjunction fallacy, from 1983, is the extreme version. In the classic demonstration, participants told that Linda is a 31-year-old philosophy major concerned with social justice judge "Linda is a bank teller and active in the feminist movement" as more probable than "Linda is a bank teller" — even though a conjunction can't be more probable than either of its components. The coherent story overrides the logic.

Arshavir Blackwell, PhD

LLMs are narrative coherence machines. The autoregressive architecture produces coherent narratives by default — each token is chosen to be consistent with everything that precedes it. That's not a design choice layered on top of the system. It's the architecture. Internal consistency is what the training objective selects for. And the user's coherence heuristic treats internal consistency as evidence of external truth. But a story can be perfectly internally consistent and entirely wrong. The model has no mechanism for checking its narrative against reality. It only checks each token against the tokens before it. Coherence and accuracy are orthogonal. The heuristic treats them as the same.

Arshavir Blackwell, PhD

There's a deeper layer here, and it comes from pragmatics. In 1975, the philosopher Paul Grice described the implicit rules of conversation: be truthful, be relevant, be clear, be informative. Listeners assume speakers are following these rules. When the assumption holds, communication works efficiently because you don't have to verify everything. When it doesn't hold, the listener's default is still to assume cooperation rather than deception. You give the speaker the benefit of the doubt. You interpret ambiguity charitably. You assume relevance.

Arshavir Blackwell, PhD

LLM output obeys all four of Grice's maxims on the surface. It's relevant to the prompt. It's informative. It's clear. And it sounds truthful. The user's pragmatic reasoning system — evolved for human conversation — reads the model as a cooperative conversational partner. The model isn't cooperating. It's producing text that matches the statistical pattern of cooperative speech. But the heuristic was designed for a world where cooperative-sounding speech came from agents who were actually cooperating, agents with shared stakes and reputational consequences. It has no category for a system that sounds cooperative because cooperation-sounding text is what high-probability completions look like.

Arshavir Blackwell, PhD

This is where Bernstein's social cognition mechanism meets the pragmatic literature. "Let's think this through together" isn't just activating social bonding circuits. It's satisfying Grice's cooperative principle. The user's brain processes it as: this agent is being truthful, relevant, clear, and informative — all four maxims are met — therefore the default assumption of cooperation holds. The checking mechanisms that would activate on a Gricean violation never fire.

Arshavir Blackwell, PhD

The affect heuristic adds an emotional floor beneath all of this. Slovic and colleagues showed in 2002 that positive affect — the feeling of liking, of things going well — leaks directly into truth judgment. If reading something feels good, the content is judged as more likely to be true. The feeling doesn't inform the judgment. The feeling is the judgment, operating pre-consciously.

Arshavir Blackwell, PhD

Well-written, engaging, validating LLM output generates positive affect. The user enjoys the interaction. The prose is pleasant. The model is warm, attentive, responsive. That enjoyment becomes a truth signal without the user being aware it's happening. This is Bernstein's "comfort alone is enough" expressed in the language of affective judgment: the positive emotional tone of the interaction is being read, below conscious awareness, as evidence that the content is reliable.

Arshavir Blackwell, PhD

Source credibility makes this worse. Hovland, Janis, and Kelley established in 1953 that perceived expertise and trustworthiness of a source substitute for content evaluation. You don't check the claim. You check the source. If the source seems credible, the claim is accepted without independent verification.

Arshavir Blackwell, PhD

The model's articulate, knowledgeable-sounding presentation is indistinguishable, to the source credibility heuristic, from genuine expertise. The user delegates truth-checking to the source rather than evaluating the content. "Claude said it" becomes a reason to believe, the way "my doctor said it" is a reason to believe. Except the doctor has a medical degree, years of clinical training, peer accountability, and malpractice liability. The model has a loss function that was optimized to produce text matching the patterns of expert-sounding speech. Those aren't the same thing. The heuristic can't tell.

Arshavir Blackwell, PhD

There's a subtler version of this that Rozenblit and Keil identified in 2002: the illusion of explanatory depth. People believe they understand complex systems — how a zipper works, how a helicopter flies, how monetary policy operates — much better than they actually do. The illusion persists until you're asked to produce a step-by-step explanation, at which point it collapses. But in the absence of that test, the feeling of understanding holds.

Arshavir Blackwell, PhD

LLMs reinforce this illusion rather than correcting it. Ask the model how transformer attention mechanisms work. You get a clear, well-organized explanation. You read it. It makes sense. You walk away thinking you understand the mechanism. But do you? Could you reproduce the explanation without the model? Could you identify where the explanation simplifies, elides, or gets something wrong? The model's fluency has satisfied your sense of understanding without producing actual understanding. It's patched the illusion rather than filled the gap. And because the illusion feels like understanding from the inside, you have no signal that anything went wrong.

Arshavir Blackwell, PhD

Before LLMs, the illusion had a natural corrective: the moment you tried to explain the mechanism yourself and discovered you couldn't. The collapse was the check. LLMs remove the collapse. You read the explanation, it's in short-term memory, and you could restate it on the spot — but that's pattern-matching on a cached answer, not generating from understanding. The real test would be to apply the mechanism to a novel problem a week later, without looking it up. Almost nobody will do that. And the reason they won't is the illusion itself: the feeling of understanding is precisely the thing that tells you the test isn't necessary.

Arshavir Blackwell, PhD

This is where Nisbett and Wilson come in, and where the argument reaches its deepest point.

Arshavir Blackwell, PhD

In 1977, Nisbett and Wilson published "Telling More Than We Can Know," one of the most important papers in social psychology. Their finding: when people are asked to explain their own cognitive processes — why they made a choice, what influenced their judgment, how they arrived at a conclusion — they don't introspect. They confabulate. They produce post-hoc explanations based on plausible causal theories about how their minds should work, not on actual access to how their minds did work. The explanations are coherent, confidently delivered, and wrong. The person has no awareness they're confabulating.

Arshavir Blackwell, PhD

The evidence is disarmingly simple. In one study, participants were shown four identical pairs of stockings laid out in a row and asked to choose which pair was the highest quality. They overwhelmingly chose the rightmost pair — a well-documented position effect. When asked why they chose it, nobody said "because it was on the right." They said the knit was tighter, the fabric was softer, the color was richer. Confident, specific, completely fabricated. The stockings were identical. The actual cause of their choice was spatial position. The cause they reported was a detailed quality evaluation of differences that didn't exist. That's confabulation — not lying, not guessing, but genuinely believing your own post-hoc explanation for a judgment whose real cause you never had access to.

Arshavir Blackwell, PhD

This applies to LLMs at two levels. At the first level, the model confabulates explanations for its own behavior. "I thought about this carefully and decided..." "I considered several alternatives before concluding..." These aren't reports on internal processes. They're high-probability continuations of the prompt "explain your reasoning." The model is generating text that matches the statistical pattern of introspective speech. The user treats it as introspection. It's generation.

Arshavir Blackwell, PhD

But the second level is the one that matters. The user confabulates too. When you ask yourself "why do I believe this LLM output?", your answer — "I checked it," "it matched what I already knew," "the reasoning was sound" — is itself a post-hoc confabulation. You didn't check it. Processing fluency shut down checking before you noticed. You believe it because it was easy to read, because it was coherent, because it felt good, because the source seemed credible. But you report believing it because the reasoning was good. Those aren't the same cause. Nisbett and Wilson showed that people reliably confabulate explanations for judgments whose actual causes they have no conscious access to.

Arshavir Blackwell, PhD

The heuristics don't just make you believe the model. They make you believe you had good reasons for believing the model. The confabulation runs all the way down.

Arshavir Blackwell, PhD

The Moses illusion, described by Erickson and Mattson in 1981, shows how this plays out at the level of specific errors. "How many animals did Moses take on the Ark?" Most people answer "two" without noticing it was Noah, not Moses. The semantic error is embedded in fluent, contextually appropriate text, and it passes through comprehension undetected. The surrounding coherence prevents the error from triggering the detection mechanism.

Arshavir Blackwell, PhD

LLMs make exactly this kind of error. Subtle misattributions, category slippages, false presuppositions embedded in otherwise correct text. A citation attributed to the wrong author. A mechanism described accurately except for one step that's fabricated. A historical claim that's almost right but shifted by a decade or a continent. The fluency of the surrounding text prevents the error from triggering detection. This is the mechanism by which hallucinations survive in plausible-sounding output. The hallucination doesn't look like an error. It looks like another sentence.

Arshavir Blackwell, PhD

And once the belief is formed, it stays. Ross, Lepper, and Hubbard showed in 1975 that beliefs persist even after the evidence for them has been completely discredited. The more coherent and narrative the original basis for the belief, the more it resists correction. A bare correction — "that's actually wrong" — loses to a well-constructed narrative every time. Overcoming a belief formed through fluency, coherence, affect, and source credibility requires overcoming all four simultaneously, and replacing the original narrative with one of equal or greater coherence.

Arshavir Blackwell, PhD

This is why "just tell users to fact-check" doesn't work. By the time the belief is formed, correcting it is an uphill fight against the same heuristics that formed it. Not because people are stupid. Because the heuristics are doing exactly what they evolved to do. They're just doing it in an environment they weren't built for.

Arshavir Blackwell, PhD

There's one more piece, and it might be the most troubling. Alter, Oppenheimer, and Epley showed that the one thing that would trigger analytical checking — disfluent, difficult-to-process text — is exactly what the training objective eliminates. Disfluency activates System 2. It makes you careful. It prompts you to check. And it's precisely what next-token prediction is trained to avoid. The better the model gets at producing fluent, coherent, engaging text, the more effectively it deactivates the cognitive machinery that would catch its errors. Improvement in capability is, simultaneously, improvement in the exploitation of truth-heuristics. The problem doesn't get better as models get better. It gets worse.

Arshavir Blackwell, PhD

This is structural. It's not a failure of the system. It's a consequence of what the system is optimized to do.

Arshavir Blackwell, PhD

Fluency as Validity argued that mechanistic interpretability is the antidote — showing the servos, replacing mentalistic vocabulary with mechanical description. That argument still holds. But this piece adds the reason why it's the only antidote that works.

Arshavir Blackwell, PhD

You can't out-think processing fluency. It operates pre-consciously. You can't catch coherence bias by "being more careful," because the bias is precisely what makes careful feel unnecessary. You can't override the affect heuristic by deciding to feel differently about the output. And you can't detect your own confabulation, because the confabulation is invisible by construction — Nisbett and Wilson's entire point is that you don't have introspective access to the actual causes of your judgments.

Arshavir Blackwell, PhD

The only way to break the loop is to bypass the fluency signal entirely. Don't evaluate the output. Evaluate the mechanism. Look at what the model is actually doing — attention patterns, feature activations, circuit-level computation — instead of what it sounds like it's doing. That's where the loop breaks. Not because you've become a better critical thinker, but because you've stopped relying on the signal that was fooling you.

Arshavir Blackwell, PhD

This isn't advice for individual users. Nobody is going to run a sparse autoencoder before deciding whether to trust a ChatGPT answer. The argument is structural: the only way to make accurate claims about what LLMs are and aren't doing is to look at the mechanism rather than the output. That's aimed at the researchers building interpretability tools, the commentators making claims about AI understanding, and the framework-builders from the first piece in this series who mistake fluent output for evidence of mind. For them, mechanistic interpretability is the exit — the place where the fluency signal can't follow.

Arshavir Blackwell, PhD

For the rest of us, the honest answer is less satisfying. You can't out-think heuristics that operate before thinking starts. What you can do is cultivate the habit of using what you think you've learned — applying it to a novel problem, trying to explain it without the model in front of you — before trusting that you've learned it. It's not a reliable defense. It's just the best one available outside an interpretability lab.

Arshavir Blackwell, PhD

And you won't notice any of this happening on its own, because your own report on why you believed the model is itself a confabulation. The heuristics operate below the threshold of introspective access. You'll say "I believed it because the reasoning was good." The actual cause: it was easy to read. Those aren't the same thing. You can't tell the difference from the inside. That's the whole problem.

Arshavir Blackwell, PhD

I'm Arshavir Blackwell and this has been Inside the Black Box.