Bridging Circuits and Concepts in Large Language Models
This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.
Get StartedIs this your podcast and want to remove this banner? Click here.
Chapter 1
Introduction
Arshavir Blackwell, PhD
Welcome back to Inside the Black Box, where we take apart large language models and look directly at the machinery—what it's doing, why it works, and where our understanding still has blind spots. I'm your host, Arshavir Blackwell.
Arshavir Blackwell, PhD
In the last few episodes, we talked about mechanistic interpretability—reverse-engineering neural networks into circuits and features—and about what some call "LLM-ology," or Artificial Psycholinguistics, the cognitive-science-style analysis of these systems as behavioral entities.
Arshavir Blackwell, PhD
Today, we're tackling the core question that everything else points to: How do low-level mathematical mechanisms become high-level thought?
Arshavir Blackwell, PhD
Knowing which attention head copies a name across a sentence is useful. But it doesn't explain how the model writes a coherent paragraph about French history, complete with the Eiffel Tower, the French Revolution, and the Napoleonic wars. Somewhere between the artificial neurons and the ideas is a missing middle.
Arshavir Blackwell, PhD
Mechanistic interpretability can tell us which head tracks subject–verb agreement or which neurons light up around negation. But those are fragments. The real puzzle is how they combine into something that feels like structured reasoning.
Chapter 2
Philosophical Context
Arshavir Blackwell, PhD
This brings to mind Searle's Chinese Room, a thought experiment created by philosopher John Searle meant to separate symbol manipulation from understanding. Imagine a person in a room who doesn't speak Chinese. They receive Chinese characters through a slot, consult a massive rulebook, and return perfectly appropriate Chinese responses—despite having no idea what any of the symbols mean.
Arshavir Blackwell, PhD
Searle's point wasn't that the outputs are wrong; they're indistinguishable from a fluent speaker's. His argument is that syntax alone doesn't yield semantics—manipulating symbols according to rules isn't the same thing as grasping meaning. With modern language models, we've essentially built a vastly more complex version of the room, which forces us to revisit the question: at what point, if any, does sophisticated pattern-matching become understanding?
Arshavir Blackwell, PhD
Searle's Chinese Room is unavoidable here: if a system is only manipulating symbols, does anything like "understanding" emerge? Over four decades later, that question is no longer purely philosophical—it's empirical. And the empirical story is complicated.
Arshavir Blackwell, PhD
We know early layers encode low-level patterns. We know deeper layers represent abstractions. But the step from abstraction to meaning—how that happens—is still wide open.
Chapter 3
Approaches to Understanding Meaning
Arshavir Blackwell, PhD
Researchers are approaching this gap from three directions: representational geometry, circuit composition, and compression and concept formation. Each one explains a piece of how mechanisms become meaning.
Arshavir Blackwell, PhD
Let's start with geometry—the shape of concepts inside activation space.
Arshavir Blackwell, PhD
Imagine the word "puppy." In human semantic judgments, "puppy" sits close to "dog" and "pet," far from "rock" or "pillow." When you compute similarity in high-dimensional model activations, you see the same structure. That's representational similarity analysis, or RSA.
Arshavir Blackwell, PhD
RSA tells us where abstract distinctions become explicit. It shows that "puppy" and "dog" live near each other in activation space, and that semantically related words tend to cluster together.
Arshavir Blackwell, PhD
But RSA is correlational. It reveals structure—it does not tell us what computations produced that structure. It gives us the map, not the generative law.
Arshavir Blackwell, PhD
Still, there's an intriguing pattern here. Some recent work—sometimes called the Platonic Representation Hypothesis—suggests that different architectures trained on different data may converge on similar geometric layouts for concepts.
Arshavir Blackwell, PhD
The evidence is still contested, and the claims are strong, but if it holds up, it would hint that certain conceptual organizations are necessary for language modeling. That's not "understanding," but it is a pattern worth paying attention to.
Chapter 4
Circuit Composition
Arshavir Blackwell, PhD
Now let's zoom down from geometry to circuits.
Arshavir Blackwell, PhD
Mechanistic interpretability identifies specific computational motifs: induction heads, name-mover heads, heads involved in pronoun resolution. In small models, we can trace chains of these heads, end-to-end, using tools like ACDC.
Arshavir Blackwell, PhD
ACDC—Automatic Circuit Discovery. ACDC works by systematically ablating or replacing parts of the network and measuring how those interventions change the model's behavior.
Arshavir Blackwell, PhD
Instead of guessing which heads matter, ACDC isolates the minimal set of components that causally contribute to a specific capability, then maps the pathways through which information flows.
Arshavir Blackwell, PhD
For example, resolving a sentence like: "The teacher told the student that she might fail the exam." In small GPT-2 variants, researchers can map exactly which heads carry which pieces of information.
Arshavir Blackwell, PhD
But scale is the problem. Frontier models have hundreds of billions of parameters. Their heads are polysemantic—one head handles possession in one context and clause boundaries in another. Mechanistic decomposition doesn't scale cleanly.
Arshavir Blackwell, PhD
The field currently looks a bit like early biology: we've identified the cell types, but not yet the organs or the systems.
Chapter 5
Information-Theoretic Approach
Arshavir Blackwell, PhD
The third approach is information-theoretic.
Arshavir Blackwell, PhD
Why does the idea of "gravity" function as a concept in a neural network? One hypothesis is that compressing thousands of gravity-related contexts into a single, stable representation reduces prediction error.
Arshavir Blackwell, PhD
Sparse Autoencoders—SAEs—reveal feature directions that look like conceptual atoms. Some seem to capture genuine abstractions like city names, quotation marks, negation.
Arshavir Blackwell, PhD
The theory is elegant: meaning emerges as the equilibrium between expressiveness and compression. But this remains more theoretical speculation than established result. We still don't know which compressed representations stabilize into robust concepts and which remain convenient shortcuts.
Arshavir Blackwell, PhD
This distinction matters. If "gravity" is a true abstraction, it generalizes. If it's just a statistical patch, it breaks when you push it outside training data.
Arshavir Blackwell, PhD
We're still sorting out which is which—and we don't yet have reliable methods to tell them apart.
Chapter 6
Challenges Across Approaches
Arshavir Blackwell, PhD
Across all three approaches—geometry, circuits, compression—there's a shared challenge: representations evolve as they move through the network.
Arshavir Blackwell, PhD
"Negation" in layer 8 is not "negation" in layer 12. Features change character. They're dynamic, not static.
Arshavir Blackwell, PhD
The logit lens makes this visible. The logit lens, as you may recall, is a technique that projects intermediate model activations directly into vocabulary space so we can see, at each layer, which tokens the model is already implicitly predicting before the final output head. Early layers track local context; middle layers build structure; late layers assemble full semantic predictions.
Arshavir Blackwell, PhD
Tracking a feature across layers is tracking a process, not a location. And that makes interpretability fundamentally harder.
Arshavir Blackwell, PhD
To bridge circuits and concepts, we need tools that operate across scales.
Chapter 7
Promising Research Programs
Arshavir Blackwell, PhD
Building on these three foundational approaches, several research programs look promising: representational geometry at scale, causal composition studies, and tracking representational dynamics.
Arshavir Blackwell, PhD
Representational Geometry at Scale: Platforms like Neuronpedia let us explore SAE-extracted features and examine whether semantic boundaries—between "causation" and "correlation," for example—are stable or malformed.
Arshavir Blackwell, PhD
SAEBench provides standardized evaluations for SAE quality, helping us understand which extraction methods produce the most interpretable features. Together, these tools let us ask whether geometric distortions relate to model failures.
Arshavir Blackwell, PhD
Causal Composition Studies: This is where researchers don't just observe circuits—they test them. Ablate a head, and see if the behavior collapses.
Arshavir Blackwell, PhD
In small models, we can do full causal reconstructions, like the IOI circuit work. An "Information-Organizing/Integrating circuit," you may recall, is a hypothesized pattern in transformer models where multiple attention heads and MLP features coordinate to gather dispersed information from the prompt, integrate it into a structured representation, and reuse it downstream for coherent reasoning or generation. In larger models, that becomes intractable, so we'll need hierarchical approaches.
Arshavir Blackwell, PhD
Tracking Representational Dynamics: An emerging direction—not yet a formalized research program, but a growing area of interest—involves tracking how representations evolve across layers, especially for ambiguous or polysemous words.
Arshavir Blackwell, PhD
This approach fits the transformer architecture more naturally than static diagrams, and several research groups are developing methods here.
Arshavir Blackwell, PhD
None of these solve the missing middle, but they give us footholds.
Chapter 8
Final Reflections
Arshavir Blackwell, PhD
All of this brings us back to the philosophical question we can no longer dodge.
Arshavir Blackwell, PhD
If a model's internal geometry mirrors human semantic structure—does that count as understanding? Or is it sophisticated mimicry?
Arshavir Blackwell, PhD
Mechanistic interpretability gives us one answer: we understand the model when we can predict and explain its behavior across contexts.
Arshavir Blackwell, PhD
LLM-ology gives another: we understand models by comparing their cognitive patterns to ours—where they align and where they diverge.
Arshavir Blackwell, PhD
These approaches aren't competitors. Together, they let us study synthetic cognition with the precision of engineering and the curiosity of cognitive science.
Arshavir Blackwell, PhD
Before we get too optimistic, the obstacles are real:
Arshavir Blackwell, PhD
Scale: Frontier models are too large for end-to-end circuit mapping. Polysemanticity: Features do more than one thing, depending on context. Validation: We have no ground truth for "what the model is really doing." Stability: Representations shift across training checkpoints and vary with context. The Moving Target: Interpretability must evolve as rapidly as capabilities do.
Arshavir Blackwell, PhD
These aren't minor details. They define the core difficulty of the field.
Arshavir Blackwell, PhD
Right now, we're in a position somewhat analogous to where neuroscience was about a century ago. We can record. We can map. We can stimulate. But meaning is still elusive.
Arshavir Blackwell, PhD
The analogy has limits, though. Unlike neuroscientists studying the brain, we have complete observational access to every weight and activation in these systems. The challenge isn't access—it's interpretation. We can see everything, but we don't yet know what we're looking at.
Arshavir Blackwell, PhD
Mechanistic interpretability gives us the microscope. Representational geometry gives us the staining techniques. Artificial Psycholinguistics gives us comparative anatomy.
Arshavir Blackwell, PhD
If we can integrate these, we might eventually build a genuine science of synthetic cognition—one that illuminates artificial intelligence and human intelligence at the same time.
Arshavir Blackwell, PhD
But the missing middle remains real. It won't be solved quickly. And overstating our progress only sets the field back.
Arshavir Blackwell, PhD
Still—every new method, every mapped circuit, every clarified abstraction moves us closer to understanding how large language models think, or whatever word we ultimately end up using for what they do.
Arshavir Blackwell, PhD
If you want to keep exploring these questions—the ones that sit between engineering, cognitive science, and philosophy—subscribe to Inside the Black Box, at insidetheblackbox.ai. And feel free to share the episode with someone who works in this space and wants deeper discussions about how these systems actually work.
Arshavir Blackwell, PhD
Thanks for listening. I'm Arshavir Blackwell, and I'll see you next time inside the black box.
