Arshavir Blackwell, PhD

Inside the Black Box: Cracking AI and Deep Learning

TechnologyEducation

Listen

All Episodes

Inside Circuits: How Large Language Models Understand

Dive into the world of neural circuits within large language models. In this episode, Arshavir Blackwell unpacks how transformer circuits, attention mechanisms, and high-dimensional geometry combine to create the magic—and limits—of modern AI language systems.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.


Chapter 1

Circuits Inside Large Language Models

Arshavir Blackwell, PhD

Welcome back to Inside the Black Box. Today we're looking at circuits inside large language models. When we say "circuits," it sounds almost anachronistic, as though we're talking about wires inside a robot. But in the context of language models, a circuit is a pattern: a reusable subnetwork of neurons, attention heads, and layers that specializes in a specific task. Not general intelligence, but something narrow. Matching parentheses in a long equation. Tracking who did what to whom in a complex sentence.

Arshavir Blackwell, PhD

Researchers have actually reverse-engineered these circuits. Instead of treating the model as an undifferentiated tangle of weights, you can trace how a particular set of operations—maybe a few attention heads in certain layers—performs one function, almost like a mini-program that emerged over the course of training. Some heads focus on quotes or brackets. Others link pronouns to their antecedents. The specialization is often clearer than you'd expect—analogous to finding dedicated brain regions, but inside a digital network.

Arshavir Blackwell, PhD

This connects to old debates in cognitive science about modularity. The question was whether you need rigid, discrete modules for sophisticated behavior, or whether specialization can emerge from more distributed systems. What we're finding in transformers suggests you can get both: no hard-coded modules, but clear pockets of functional specialization.

Arshavir Blackwell, PhD

How do we find these circuits? There's no indicator light that says "circuit detected." Instead, we have a toolkit: activation patching, ablations, synthetic input probing. Activation patching is particularly useful. You swap activation values between runs to identify which components carry specific information, like the memory of a subject's name across a sentence. You're essentially debugging by substitution.

Arshavir Blackwell, PhD

Ablations take a different approach: you knock out certain attention heads or neurons and observe whether the model fails at a task. Turn off what you suspect is the bracket-matching head—does it start producing unbalanced parentheses? Sometimes yes. Sometimes, surprisingly, no, because redundancy is pervasive in these systems.

Arshavir Blackwell, PhD

One well-documented example is the IOI circuit in GPT-2. IOI stands for Indirect Object Identification—the circuit that lets the model track "who did what to whom." Not just spotting names, but following relationships, a kind of proto-grammar built from the ground up. Researchers dissected this by tracing activations, severing connections, feeding the model hand-crafted sentences, and observing what activated.

Arshavir Blackwell, PhD

The harder question is whether these circuits can be steered or edited. Sometimes you can adjust a head or modify an input and the model's output changes predictably. But often things get complicated quickly. Circuits overlap, reuse components, and shift their roles depending on the input. And if you compare across models—is the bracket-matching circuit in GPT-2 the same as in GPT-3?—the answer varies. Sometimes there's clear correspondence. Sometimes the architecture has reorganized entirely. It's evolutionary, but with gradients instead of DNA.

Arshavir Blackwell, PhD

These circuits do clever, sometimes human-like things. But there's nothing mystical about them. They're emergent, not magic: they appear as a product of optimization, not intention. Training nudges the model toward statistically useful structures, and over millions of updates, these subnetworks get shaped by the data. The geometry does most of the work. If you examine the model's internal space, you find regions where certain tasks align—not because the system intends it, but because that configuration minimizes loss.

Arshavir Blackwell, PhD

Here's where it gets interesting. High-dimensional geometry isn't perfect. A circuit that performs well in 99 percent of cases can fail completely on edge cases. The IOI circuit, for instance—feed it a sentence with unusual word order or a rare construction, and the activations may go wrong. The model's internal geometry hasn't carved out a clean region for that input, and the output might slip into nonsense or confabulation. This connects back to the hallucination problem we discussed in a previous episode, as well as the Mandela Effect. The same compression that enables generalization also creates failure modes.

Arshavir Blackwell, PhD

There is a visual resemblance between brain imaging data and transformer activation plots: sparse bursts of activity, clustered regions lighting up, structure emerging from apparent noise. In both cases, these patterns are shaped in part by how high-dimensional systems are projected into interpretable visual space. But visible organization should not be mistaken for intent. Purposeful-looking structure can arise without intrinsic goals or agency, much as the brain readily perceives faces in cloud patterns.

Arshavir Blackwell, PhD

In artificial systems, these patterns emerge from mathematics coupled to optimization constraints—high-dimensional geometry shaped by architecture, data distributions, and objective functions. The resulting structure is neither arbitrary nor illusory, but it is not intentional. What we observe is teleonomy: apparent purpose produced by optimization rather than by minds.

Arshavir Blackwell, PhD

This raises a deeper question. Is intelligence best understood not as an inner symbolic faculty or animating spark, but as causally active structure—stable, reusable patterns carved into high-dimensional space by learning dynamics? We do not yet have a definitive answer. But each careful analysis of these digital circuits sharpens our understanding of how complex behavior can arise from simple rules, and in doing so, offers a clearer lens on cognition itself.

Arshavir Blackwell, PhD

I’m Arshavir Blackwell and this has been Inside the Black Box