All EpisodesJanuary 7, 2026

When Knowledge Battles Noise in GPT Models

Explore how GPT-2 balances fleeting factual recall with generic responses through internal competition among candidate answers. Discover parallels with human cognition and how larger models navigate indirect recall to reveal hidden knowledge beneath suppression.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.

Chapter 1

Inside the Black Box: The Fate of Factual Knowledge in GPT-2

Arshavir Blackwell, PhD

Welcome back to Inside the Black Box. I'm Arshavir Blackwell. Today we're looking at a specific puzzle: what happens to factual knowledge inside GPT-2 when the model seems to "know" something but fails to say it? If you caught our previous episode on neural circuits, this builds directly on that work.

Arshavir Blackwell, PhD

Give GPT-2 a simple prompt: "The capital of Finland is…" and it answers "the." Not "Helsinki." The obvious interpretation is that it never learned the fact. But that's not what's happening. Using the logit lens—an interpretability tool that lets you examine predictions at each layer, not just the final output—you can track what the model is "thinking" at each step. At layer 11 out of 12, GPT-2 assigns "Helsinki" a 46% probability. It's the model's top prediction. The right answer is right there. Then at layer 12, that probability collapses and "the" takes over. The model had the answer and then lost it.

Arshavir Blackwell, PhD

This isn't a one-off glitch. Track other facts and you see the same pattern: "Paris" for "The capital of France is…" rises to the top mid-network, then drops in the final layer. "Blue" for "The color of the sky is…"—same thing. Knowledge emerges in middle layers and gets suppressed at the end.

Arshavir Blackwell, PhD

When you decompose that final layer, the transformer splits into attention and MLP components, and both contribute negatively to the correct answer. The MLP contribution is especially strong in the wrong direction.

Arshavir Blackwell, PhD

So you get this trajectory: factual answers rise in probability, peak in middle layers, then get suppressed by generic tokens as the model approaches output. The model has the knowledge; it loses an internal competition just before generating text.

Arshavir Blackwell, PhD

This internal competition maps onto something from psycholinguistics that I find genuinely interesting. In the 1980s, Elizabeth Bates—one of my advisors at UCSD—proposed the Competition Model. The idea is that when people process language, multiple possible interpretations activate simultaneously. Different cues—grammar, word order, semantics—each contribute weight to candidates, and the most supported one wins by the time you understand the sentence.

Arshavir Blackwell, PhD

Inside GPT-2, you're seeing something structurally similar. Multiple candidate tokens—"Paris," "the"—get activated in parallel. Attention heads in middle layers read the prompt and boost candidates based on context. At layer 10, for instance, attention heads that focus on "France" and "capital" increase the probability for "Paris."

Arshavir Blackwell, PhD

But in the final layers, the model's learned priors reassert themselves. Generic tokens like "the" get a systematic boost. The model has probably learned that after phrases ending with "is," outputting "the" is statistically safe across many contexts. So there's a genuine competition: specific factual knowledge versus generic statistical priors.

Arshavir Blackwell, PhD

What's striking is that the transformer isn't doing symbolic lookup—it's activating options based on cues and letting them compete, with the output determined by which candidate accumulates the most weighted evidence. That's not a metaphor for what the brain does; it's a similar computational strategy.

Arshavir Blackwell, PhD

This raises a question: Is this competition a deep feature of how any system processes sequential information? Not a locked vault of knowledge, but ongoing competition where the winner is sometimes the best answer and sometimes just the safest default.

Arshavir Blackwell, PhD

What happens at larger scale? You might expect bigger models to solve the problem. Not quite. Run the same analysis on GPT-2 XL—more parameters, 48 layers. For "The capital of Finland is…" it reaches nearly 100% confidence on "Helsinki" at layer 41. That high probability holds for about ten layers. But even in this larger model, the final layers chip away at it, and "the" still edges ahead at output. The factual signal is stronger and persists longer, but still gets suppressed.

Arshavir Blackwell, PhD

Here's what's interesting though: if you let GPT-2 XL continue generating, it often finds an indirect route to the fact. For "The capital of France is…" it outputs "the," then continues with "city of Paris." The knowledge wasn't erased—it got rerouted through a more hedged structure that eventually expresses the right answer. Suppression acts more like a detour than deletion in that case.

Arshavir Blackwell, PhD

That pattern has a human analog. When direct recall fails, you often retrieve information indirectly—fumbling through associations until the right memory surfaces. GPT-2 XL is doing something similar. The candidate that lost the first competition can still win a later one.

Arshavir Blackwell, PhD

Looking at logit lens data across longer outputs, the picture is clear: factual knowledge isn't static or uniformly accessible. In bigger models, there's more signal behind correct answers, and sometimes enough to express them—even if only through indirection. This reframes what we mean by "capability." The model often isn't failing because it never learned a fact; it's failing because its architecture suppresses confident recall in favor of generic output—until context gives the fact another chance.

Arshavir Blackwell, PhD

That's where we'll leave it for today. Thanks for listening. I'm Arshavir Blackwell and I will see you next time on Inside the Black Box.