Arshavir Blackwell, PhD

Inside the Black Box: Cracking AI and Deep Learning

TechnologyEducation

Listen

All Episodes

Cracking the Code of AI Interpretation

Dive into how we naturally explain neural networks with folk interpretability and why these simple stories fall short. Discover the journey toward mechanistic understandability in AI and what that means for how we talk about and trust large language models.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.


Chapter 1

Introduction

Arshavir Blackwell, PhD

I'm Arshavir Blackwell, and welcome back to Inside the Black Box. Today I want to talk about something I'll call folk interpretability. On this show, we spend a lot of time on formal tools—activation patching, sparse autoencoders, circuit tracing. These are the methods that let us open the box and inspect what models are actually doing. But alongside all of that, there's another mode of explanation that we use constantly, whether we mean to or not. It requires no GPUs. No specialized tooling. Just stories.

Chapter 2

Folk Interpretability

Arshavir Blackwell, PhD

By folk interpretability, I mean the natural language explanations we reach for when talking about models: "This neuron detects dogs." "This attention head tracks syntax." "This layer resolves ambiguity." We talk about models the way we talk about people—or maybe little workers with assigned jobs. You can hear this most clearly when we explain transformers to non-specialists: "Picture different neurons looking for different things—one for negation, one for sentence boundaries, one for sentiment." These explanations are intuitive. They're accessible. And importantly, they're not optional. Humans explain systems by telling stories. If we didn't do this, we wouldn't explain anything at all. The problem isn't that folk explanations exist. The problem is mistaking them for ground truth.

Arshavir Blackwell, PhD

When you ask a model why it chose one word over another, the answer you get is not a readout of internal computation. The model isn't consulting an internal log. It's doing what it was trained to do: predicting what a helpful explanation sounds like. These are narratives we elicit through behavior, not telemetry. They can be coherent. They can be persuasive. They can even be useful. But they are not direct evidence of mechanism.

Arshavir Blackwell, PhD

Still, there's real value here. Anyone with an internet connection can probe a model this way. Folk interpretability democratizes the first pass at understanding. It gives us hypotheses. The question is how seriously to take what comes back. Because when you look more closely, these simple stories start to break down. Take the idea that "this neuron cares about dogs." That assumes features are discrete, stable, and localized. But what mechanistic work actually finds is something messier: overlapping, distributed representations that shift with context. Early analyses of BERT attention made this clear. Many attention heads don't cleanly encode semantics. Instead, they show structural behaviors—fixed positional offsets, separator tokens acting as attention sinks, or multiple heads in the same layer behaving almost identically. Attention can correlate with meaning, but it's rarely a neat semantic role assignment. The folk story is tempting. The underlying reality is higher-dimensional and less cooperative.

Arshavir Blackwell, PhD

So when models produce introspective-sounding language— "I felt a pull between two interpretations," "I stayed ambiguous because that was statistically favored"— we should be careful. These are not reports of internal computation. They're confabulations learned from human language. Plausible hypotheses, not evidence. Which raises a deeper question: Can we ever develop an accessible language for systems that operate in high-dimensional geometry? Or will every metaphor necessarily leave something essential out? There's precedent for this problem.

Chapter 3

Analogies and Insights

Arshavir Blackwell, PhD

Consider pain research. Pain is fundamentally subjective. There's no direct sensor for it. Self-report is indispensable. And yet, decades of psychophysics show that verbal reports are shaped by expectation, habituation, and response bias. What's interesting is that pain reports correlate best with changes—differences across conditions or intensities—rather than absolute levels. Self-report is informative, but it's not a transparent window into sensation. It tells us something real. Just not everything.

Arshavir Blackwell, PhD

Implicit learning shows something similar. In classic artificial grammar learning experiments—first studied by Arthur Reber in the 1960s—participants are exposed to strings generated by a hidden grammar. They're never told the rules. Later, they can reliably classify new strings as grammatical or not. Crucially, when you ask these participants to explain what they've learned, or to write instructions for new learners, those explanations are incomplete. New participants using the written rules do better than chance—but not as well as the original learners. So for example, they might state that two 'A's' are always followed by an X, but not verbalize more complicated relationships. There's knowledge in the system that doesn't fully manifest into language. This pattern shows up throughout cognitive psychology: partial leakage of mechanism into awareness, without full access. Competence exceeds reportability. People know more than they can say.

Chapter 4

Mechanistic vs Folk Interpretability

Arshavir Blackwell, PhD

Models may have the same property—internal structure that exceeds what any explanation, folk or otherwise, can fully capture. So maybe we need a similar framework for explanation here. Folk interpretability isn't useless—but it isn't sufficient. Folk stories generate hypotheses. Mechanistic tools test them. This is where methods like sparse autoencoders, transcoders, and activation patching matter. They let us ask whether a folk explanation corresponds to anything causal inside the system. If a model claims it "felt a pull between two meanings" for an ambiguous sentence, we can test that. We can look at which features in the residual stream change between ambiguous and unambiguous cases. Sometimes the mechanistic evidence lines up with the story. Sometimes it doesn't. And that discrepancy is informative.

Arshavir Blackwell, PhD

This may be where the field is heading: multiple windows into the same system, each with different strengths. Folk interpretability for hypothesis generation. Mechanistic interpretability for validation. Iteration between the two to refine what kinds of explanations we should trust. If we could develop a richer folk language—one constrained by what mechanistic study actually shows—it might change more than research. It could reshape public understanding. Imagine people talking about circuit-level dynamics with the same fluency they now talk about "bias" or "training data." Would that change how we think about trust, safety, or alignment? It just might.

Chapter 5

Closing Thoughts

Arshavir Blackwell, PhD

I'll leave you with this. The philosophers Paul and Patricia Churchland famously speak of talking to each other in ways that more closely reflect their neurochemical states. Their example is of Patricia after a hectic meeting, to wit: "Paul, don't speak to me, my serotonin levels have hit bottom, my brain is awash in glucocorticoids, my blood vessels are full of adrenaline, and if it weren't for my endogenous opiates I'd have driven the car into a tree on the way home. My dopamine levels need lifting." End quote. The question for interpretability is whether we could develop something analogous: instead of "Claude is confused" or "the model believes X," something like "there's interference between the factual-recall circuit and the instruction-following circuit" or "the feature for uncertainty is co-activating with the feature for confident assertion." The Churchlands' version never caught on—glucocorticoids and endogenous opiates didn't become folk vocabulary. Whether circuit-level interpretability language could fare better is an open question. Perhaps one day we will talk about our own subjective experience, and that of LLM models, with that level of precision.

Arshavir Blackwell, PhD

Thanks for listening. See you next episode. I'm Arshavir Blackwell, and this has been Inside the Black Box.