Arshavir Blackwell, PhD

Inside the Black Box: Cracking AI and Deep Learning

TechnologyEducation

Listen

All Episodes

Bridging Circuits and Concepts in Large Language Models

How do millions of computations inside large language models add up to something like understanding? This episode explores the latest breakthroughs in mechanistic interpretability, showing how tools like representational geometry, circuit decomposition, and compression theory illuminate the missing middle between circuits and meaning. Join Arshavir Blackwell as he opens the black box and challenges what we really mean by 'understanding' in machines.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.


Chapter 1

Introduction

Arshavir Blackwell, PhD

Welcome back to Inside the Black Box, where we take apart large language models and look directly at the machinery—what it's doing, why it works, and where our understanding still has blind spots. I'm your host, Arshavir Blackwell.

Arshavir Blackwell, PhD

In the last few episodes, we talked about mechanistic interpretability—reverse-engineering neural networks into circuits and features—and about what some call "LLM-ology," or Artificial Psycholinguistics, the cognitive-science-style analysis of these systems as behavioral entities.

Arshavir Blackwell, PhD

Today, we're tackling the core question that everything else points to: How do low-level mathematical mechanisms become high-level thought?

Arshavir Blackwell, PhD

Knowing which attention head copies a name across a sentence is useful. But it doesn't explain how the model writes a coherent paragraph about French history, complete with the Eiffel Tower, the French Revolution, and the Napoleonic wars. Somewhere between the artificial neurons and the ideas is a missing middle.

Arshavir Blackwell, PhD

Mechanistic interpretability can tell us which head tracks subject–verb agreement or which neurons light up around negation. But those are fragments. The real puzzle is how they combine into something that feels like structured reasoning.

Chapter 2

Philosophical Context

Arshavir Blackwell, PhD

This brings to mind Searle's Chinese Room, a thought experiment created by philosopher John Searle meant to separate symbol manipulation from understanding. Imagine a person in a room who doesn't speak Chinese. They receive Chinese characters through a slot, consult a massive rulebook, and return perfectly appropriate Chinese responses—despite having no idea what any of the symbols mean.

Arshavir Blackwell, PhD

Searle's point wasn't that the outputs are wrong; they're indistinguishable from a fluent speaker's. His argument is that syntax alone doesn't yield semantics—manipulating symbols according to rules isn't the same thing as grasping meaning. With modern language models, we've essentially built a vastly more complex version of the room, which forces us to revisit the question: at what point, if any, does sophisticated pattern-matching become understanding?

Arshavir Blackwell, PhD

Searle's Chinese Room is unavoidable here: if a system is only manipulating symbols, does anything like "understanding" emerge? Over four decades later, that question is no longer purely philosophical—it's empirical. And the empirical story is complicated.

Arshavir Blackwell, PhD

We know early layers encode low-level patterns. We know deeper layers represent abstractions. But the step from abstraction to meaning—how that happens—is still wide open.

Chapter 3

Approaches to Understanding Meaning

Arshavir Blackwell, PhD

Researchers are approaching this gap from three directions: representational geometry, circuit composition, and compression and concept formation. Each one explains a piece of how mechanisms become meaning.

Arshavir Blackwell, PhD

Let's start with geometry—the shape of concepts inside activation space.

Arshavir Blackwell, PhD

Imagine the word "puppy." In human semantic judgments, "puppy" sits close to "dog" and "pet," far from "rock" or "pillow." When you compute similarity in high-dimensional model activations, you see the same structure. That's representational similarity analysis, or RSA.

Arshavir Blackwell, PhD

RSA tells us where abstract distinctions become explicit. It shows that "puppy" and "dog" live near each other in activation space, and that semantically related words tend to cluster together.

Arshavir Blackwell, PhD

But RSA is correlational. It reveals structure—it does not tell us what computations produced that structure. It gives us the map, not the generative law.

Arshavir Blackwell, PhD

Still, there's an intriguing pattern here. Some recent work—sometimes called the Platonic Representation Hypothesis—suggests that different architectures trained on different data may converge on similar geometric layouts for concepts.

Arshavir Blackwell, PhD

The evidence is still contested, and the claims are strong, but if it holds up, it would hint that certain conceptual organizations are necessary for language modeling. That's not "understanding," but it is a pattern worth paying attention to.

Chapter 4

Circuit Composition

Arshavir Blackwell, PhD

Now let's zoom down from geometry to circuits.

Arshavir Blackwell, PhD

Mechanistic interpretability identifies specific computational motifs: induction heads, name-mover heads, heads involved in pronoun resolution. In small models, we can trace chains of these heads, end-to-end, using tools like ACDC.

Arshavir Blackwell, PhD

ACDC—Automatic Circuit Discovery. ACDC works by systematically ablating or replacing parts of the network and measuring how those interventions change the model's behavior.

Arshavir Blackwell, PhD

Instead of guessing which heads matter, ACDC isolates the minimal set of components that causally contribute to a specific capability, then maps the pathways through which information flows.

Arshavir Blackwell, PhD

For example, resolving a sentence like: "The teacher told the student that she might fail the exam." In small GPT-2 variants, researchers can map exactly which heads carry which pieces of information.

Arshavir Blackwell, PhD

But scale is the problem. Frontier models have hundreds of billions of parameters. Their heads are polysemantic—one head handles possession in one context and clause boundaries in another. Mechanistic decomposition doesn't scale cleanly.

Arshavir Blackwell, PhD

The field currently looks a bit like early biology: we've identified the cell types, but not yet the organs or the systems.

Chapter 5

Information-Theoretic Approach

Arshavir Blackwell, PhD

The third approach is information-theoretic.

Arshavir Blackwell, PhD

Why does the idea of "gravity" function as a concept in a neural network? One hypothesis is that compressing thousands of gravity-related contexts into a single, stable representation reduces prediction error.

Arshavir Blackwell, PhD

Sparse Autoencoders—SAEs—reveal feature directions that look like conceptual atoms. Some seem to capture genuine abstractions like city names, quotation marks, negation.

Arshavir Blackwell, PhD

The theory is elegant: meaning emerges as the equilibrium between expressiveness and compression. But this remains more theoretical speculation than established result. We still don't know which compressed representations stabilize into robust concepts and which remain convenient shortcuts.

Arshavir Blackwell, PhD

This distinction matters. If "gravity" is a true abstraction, it generalizes. If it's just a statistical patch, it breaks when you push it outside training data.

Arshavir Blackwell, PhD

We're still sorting out which is which—and we don't yet have reliable methods to tell them apart.

Chapter 6

Challenges Across Approaches

Arshavir Blackwell, PhD

Across all three approaches—geometry, circuits, compression—there's a shared challenge: representations evolve as they move through the network.

Arshavir Blackwell, PhD

"Negation" in layer 8 is not "negation" in layer 12. Features change character. They're dynamic, not static.

Arshavir Blackwell, PhD

The logit lens makes this visible. The logit lens, as you may recall, is a technique that projects intermediate model activations directly into vocabulary space so we can see, at each layer, which tokens the model is already implicitly predicting before the final output head. Early layers track local context; middle layers build structure; late layers assemble full semantic predictions.

Arshavir Blackwell, PhD

Tracking a feature across layers is tracking a process, not a location. And that makes interpretability fundamentally harder.

Arshavir Blackwell, PhD

To bridge circuits and concepts, we need tools that operate across scales.

Chapter 7

Promising Research Programs

Arshavir Blackwell, PhD

Building on these three foundational approaches, several research programs look promising: representational geometry at scale, causal composition studies, and tracking representational dynamics.

Arshavir Blackwell, PhD

Representational Geometry at Scale: Platforms like Neuronpedia let us explore SAE-extracted features and examine whether semantic boundaries—between "causation" and "correlation," for example—are stable or malformed.

Arshavir Blackwell, PhD

SAEBench provides standardized evaluations for SAE quality, helping us understand which extraction methods produce the most interpretable features. Together, these tools let us ask whether geometric distortions relate to model failures.

Arshavir Blackwell, PhD

Causal Composition Studies: This is where researchers don't just observe circuits—they test them. Ablate a head, and see if the behavior collapses.

Arshavir Blackwell, PhD

In small models, we can do full causal reconstructions, like the IOI circuit work. An "Information-Organizing/Integrating circuit," you may recall, is a hypothesized pattern in transformer models where multiple attention heads and MLP features coordinate to gather dispersed information from the prompt, integrate it into a structured representation, and reuse it downstream for coherent reasoning or generation. In larger models, that becomes intractable, so we'll need hierarchical approaches.

Arshavir Blackwell, PhD

Tracking Representational Dynamics: An emerging direction—not yet a formalized research program, but a growing area of interest—involves tracking how representations evolve across layers, especially for ambiguous or polysemous words.

Arshavir Blackwell, PhD

This approach fits the transformer architecture more naturally than static diagrams, and several research groups are developing methods here.

Arshavir Blackwell, PhD

None of these solve the missing middle, but they give us footholds.

Chapter 8

Final Reflections

Arshavir Blackwell, PhD

All of this brings us back to the philosophical question we can no longer dodge.

Arshavir Blackwell, PhD

If a model's internal geometry mirrors human semantic structure—does that count as understanding? Or is it sophisticated mimicry?

Arshavir Blackwell, PhD

Mechanistic interpretability gives us one answer: we understand the model when we can predict and explain its behavior across contexts.

Arshavir Blackwell, PhD

LLM-ology gives another: we understand models by comparing their cognitive patterns to ours—where they align and where they diverge.

Arshavir Blackwell, PhD

These approaches aren't competitors. Together, they let us study synthetic cognition with the precision of engineering and the curiosity of cognitive science.

Arshavir Blackwell, PhD

Before we get too optimistic, the obstacles are real:

Arshavir Blackwell, PhD

Scale: Frontier models are too large for end-to-end circuit mapping. Polysemanticity: Features do more than one thing, depending on context. Validation: We have no ground truth for "what the model is really doing." Stability: Representations shift across training checkpoints and vary with context. The Moving Target: Interpretability must evolve as rapidly as capabilities do.

Arshavir Blackwell, PhD

These aren't minor details. They define the core difficulty of the field.

Arshavir Blackwell, PhD

Right now, we're in a position somewhat analogous to where neuroscience was about a century ago. We can record. We can map. We can stimulate. But meaning is still elusive.

Arshavir Blackwell, PhD

The analogy has limits, though. Unlike neuroscientists studying the brain, we have complete observational access to every weight and activation in these systems. The challenge isn't access—it's interpretation. We can see everything, but we don't yet know what we're looking at.

Arshavir Blackwell, PhD

Mechanistic interpretability gives us the microscope. Representational geometry gives us the staining techniques. Artificial Psycholinguistics gives us comparative anatomy.

Arshavir Blackwell, PhD

If we can integrate these, we might eventually build a genuine science of synthetic cognition—one that illuminates artificial intelligence and human intelligence at the same time.

Arshavir Blackwell, PhD

But the missing middle remains real. It won't be solved quickly. And overstating our progress only sets the field back.

Arshavir Blackwell, PhD

Still—every new method, every mapped circuit, every clarified abstraction moves us closer to understanding how large language models think, or whatever word we ultimately end up using for what they do.

Arshavir Blackwell, PhD

If you want to keep exploring these questions—the ones that sit between engineering, cognitive science, and philosophy—subscribe to Inside the Black Box, at insidetheblackbox.ai. And feel free to share the episode with someone who works in this space and wants deeper discussions about how these systems actually work.

Arshavir Blackwell, PhD

Thanks for listening. I'm Arshavir Blackwell, and I'll see you next time inside the black box.