Peering Inside the Black Box
This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.
Get StartedIs this your podcast and want to remove this banner? Click here.
Chapter 1
Imported Transcript
Arshavir Blackwell, PhD
Welcome to the first episode of Inside the Black Box. I’m your host, Arshavir Blackwell. Today, I want to crack open a major shift in how we peer inside large language models. If you’ve seen those saliency maps or heatmaps people used to show, where the model “looked” in a sentence, you know that they’re fine for getting a rough idea. But, that’s like reading MRI scans through frosted glass. You get the general outline, but not the function, not the logic, and definitely not the why.
Arshavir Blackwell, PhD
For years, these kinds of tools were all we had. But as these models started writing essays and debugging code, people realized that we’re outmatched. These tools just painted over the surface, telling us which words were hot, but not how the model made its decision in the first place.
Arshavir Blackwell, PhD
This is where mechanistic interpretability comes into play. This starts around 2019, when Chris Olah and the team at Anthropic decided: let’s take transformers apart like laptops, circuit by circuit. Their hunch was: if the network shows structure on the outside, there’s got to be engineering on the inside? That’s kind of radical, actually, treating these models less like black box crystal balls and more like really complicated clockwork.
Arshavir Blackwell, PhD
What did that approach unlock? One classic example is the “induction head.” This is a special attention head that basically learns to spot repeated patterns and continue them. Now we weren’t just pointing at colors on a heatmap; we could say--- here’s the mechanism that copies and extends patterns in text, step by step.
Arshavir Blackwell, PhD
One of the most fascinating discoveries has been the emergence of coreference circuits—tiny subnetworks inside large language models that figure out who’s being talked about. Take a sentence like “Alice went to the market because she was hungry.” The model somehow knows that she refers to Alice. What’s remarkable is that no one programmed this explicitly; these circuits just appeared during training, echoing the same strategies humans use to keep track of people and things in conversation. Remember, all the model is trained to do is to predict the next word. It’s as if the model has independently rediscovered part of the machinery of language understanding.
Arshavir Blackwell, PhD
And that’s just one example. Researchers are now mapping out attention heads that specialize in negation, or that track topic shifts as a discussion moves from one idea to another. Tools like sparse autoencoders help disentangle the overlapping representations inside these networks, so individual neurons and features become cleaner, more interpretable. And with techniques such as causal scrubbing, developed at Redwood Research, we can finally go beyond just observing correlations. We can intervene—testing whether a particular circuit actually causes a behavior, or if it’s just riding along. Step by step, the inner logic of these models is starting to come into focus.
Arshavir Blackwell, PhD
The field is feeling less like peering into a void and more like charting the circuitry of some alien brain—except it’s a brain we built ourselves.
Arshavir Blackwell, PhD
But interpreting circuits is really just one angle. There’s another revolution happening in parallel—what Sean Trott dubbed LLM-ology, or artificial psycholinguistics. This is close to home for me, because my PhD work was basically a prototype of this, just with smaller models and made-up languages instead of giant LLMs.
Arshavir Blackwell, PhD
LLM-ology says: let’s treat language models as cognitive subjects, the way you might treat people in a linguistics lab. You give them a carefully controlled set of prompts, change one detail at a time, run minimal-pair studies. There’s the classic “banana phone” test: if you ask the model to describe "making a banana phone,"" does it get the joke? Or does it give you a straight-faced engineering schematic for banana-shaped electronics? (it usually does the latter).
Arshavir Blackwell, PhD
What comes out of these studies is that models consistently get the syntax before they get anywhere close to handling meaning. They’re great at the form—putting words in legal order—but understanding the deeper concept, like the difference between joke and instruction, is still a work in progress. Instruction tuning helps with appearances, but sometimes that only encourages surface mimicry, not true understanding underneath.
Arshavir Blackwell, PhD
The field’s split into two sub-areas here. There’s the psycholinguistic side: playing endless games of prompt tweaking, measuring whether outputs change reliably. But then there’s the neurolinguistic side—where you crack open the model activations and use diagnostic classifiers, which are tools to sort of peek at what information is coded in neuron outputs.
Arshavir Blackwell, PhD
These two traditions—mechanistic interpretability and LLM-ology—are less rivals and more like dance partners. Think of it this way: when LLM-ology runs an experiment and stumbles on some weird output—a model hallucinating, say, Theresa May becoming the French President—mechanistic researchers can trace the circuits or neurons that might be “entangling” France, UK, and Theresa May, and start to untangle it. Conversely, if MI finds a suspicious module that mixes concepts, behavioral people can devise new prompts to see where things break in action. It’s an iterative cycle—an endless ping-pong between what the model does and what it really knows.
Arshavir Blackwell, PhD
There are huge challenges here. The parameter counts on these models are---big. Seventy billion parameters. Thousands of attention heads. It’s like trying to map every synapse in a blue whale’s brain with tweezers. The solution may lie in automating many of these analyses, using methods such as sparse autoencoder, activation clustering, and whole feature atlases. Going manually neuron by neuron does not scale.
Arshavir Blackwell, PhD
Another major challenge is polysemanticity, or superposition—the phenomenon where a single neuron or attention head encodes multiple, often unrelated features. It’s like receiving overlapping signals through the same channel, where distinct concepts interfere with one another. Mechanistic interpretability tools such as sparse autoencoders can help separate these signals by disentangling the underlying feature space, but the process remains complex. Determining which features are active—and in what contexts—requires careful analysis. And when it comes to establishing causal relationships, the bar is even higher: demonstrating that editing or ablating a component truly causes a behavioral change demands rigorous experimental controls and becomes increasingly difficult as model scale grows.
Arshavir Blackwell, PhD
Let's walk through that Theresa May hallucination case. Suppose a model says, “Theresa May won the French election”—which, you know, is not true. An LLM-ology prompt variation might catch that the model’s confusing the office titles or nationalities. That signals to MI researcher: go hunt down the tangled circuits or neurons blending political geography with names. You take tools like causal scrubbing to verify, is this really where things break? With editing approaches like ROME or MEMIT, you don't have to retrain the model, you can surgically correct just that factual association—say, adjust the link for “Theresa May” and “France”—leaving everything else intact. Suddenly, a hallucination flagged by behavior can be traced, verified, and edited at the root, all with tools from both fields.
Arshavir Blackwell, PhD
That synergy might be the best shot we have at real transparency and control. MI can alert us to hidden biases or risky internal goals before they show up in public, while LLM-ology can spot when a model’s learned misleading patterns or is overly literal, for example. Both are essential for making these models trustworthy.
Arshavir Blackwell, PhD
The black box is still opaque, but these tools are making it less so—turning inscrutable matrices into something we can reason about, something almost familiar. We’re not all the way there, by any means, but every year, these tools get sharper. Because we appear to be heading in the direction of giving these systems more and more control over our lives, in areas such as medical diagnosis, scientific research, and even law and government, we must have a much better understanding of what these models are doing, and how we can fix them when they go astray. ...This has been Inside the Black Box. I'm Arshavir Blackwell.
