Arshavir Blackwell, PhD

Inside the Black Box: Cracking AI and Deep Learning

TechnologyEducation

Listen

All Episodes

Decoding GPTs Hidden Circuits

Explore how sparse autoencoders and transcoders unveil the inner workings of GPT-2 by revealing functional features and computational circuits. Discover breakthrough methods that shift from observing raw network activations to mapping the model's actual computation, making AI behavior more interpretable than ever.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.


Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

Welcome back to Inside the Black Box. I'm Arshavir Blackwell, and today I want to dig into how we move from knowing what information lives inside a neural network to understanding what the model actually does with that information—at the level of computation, not just storage.

Arshavir Blackwell, PhD

Let's start with sparse autoencoders, or SAEs. If you've listened to earlier episodes, you'll remember SAEs as one of the first real tools for interpretability work. They take high-dimensional activation vectors and decompose them into sparse features—forcing most activations to zero, surfacing the critical few that matter for a given input. Instead of "neuron 347 fired," you get something like "the feature corresponding to legal terminology activated" or "the feature for Python syntax lit up."

Arshavir Blackwell, PhD

But SAEs have a fundamental limitation: they're observers, not explainers. Their training objective is reconstruction—encode the input activations, then decode back to something as close to the original as possible. So if the model is packaging information about syntax, semantics, context, quirks, maybe even noise—SAEs will catch all of it, whether or not any of it is actually used downstream.

Arshavir Blackwell, PhD

Think of it like photographing raw materials on a loading dock. You can catalog what's there—metals, plastics, components—but you have no idea what the factory makes from them. SAEs tell us what's present, not what's activated for computation.

Arshavir Blackwell, PhD

So the question becomes: can we build an interpretability tool that captures only what matters for the output? That's the insight behind transcoders.

Arshavir Blackwell, PhD

The architecture is almost identical to an SAE—encoder, nonlinearity, decoder, sparsity penalty. But the training objective is different. Instead of asking "can you reconstruct the input?" you ask "can you predict the output of this component?" It's a small change with significant consequences.

Arshavir Blackwell, PhD

Changing the target means features that don't drive the computation get filtered out. Going back to the factory analogy: instead of cataloging everything on the dock, you follow materials through the machines and record only what affects the final product. What you get are functionally meaningful features—components that actually participate in the model's computation.

Arshavir Blackwell, PhD

This works particularly well for MLPs. Transformers have many multilayer perceptrons, and each MLP is piecewise linear—thanks to activations like ReLU or GELU. Think of it as a function that behaves linearly within certain regions of input space, with the nonlinearity acting as a switch between those regions. Transcoders exploit that structure, uncovering a sparse basis for each linear mode. Instead of many features firing at once, you get two or three activated per token, each attached to a specific transformation.

Arshavir Blackwell, PhD

One useful property of transcoder features is the separation between encoder and decoder vectors. The encoder acts as a pattern detector—it asks "is this input the kind of thing I respond to?" The decoder specifies the transformation—"when I fire, add this direction to the output." The encoder is input-dependent. The decoder is fixed. Before transcoders, these were blended together. Now you can examine a feature and say: it fired because it detected this pattern, and when it fired, it applied this transformation.

Arshavir Blackwell, PhD

The original transcoder paper has a nice case study—feature 355, layer 8 of GPT-2 small. The researchers examined its up-circuit: which prior features fed into it, with what weights. From the connection graph alone, they hypothesized it was flagging semicolons inside citations—things like "(Vaswani et al. 2017; Elhage et al. 2021)." Then they checked activations on real data, and the hypothesis held. They read the model's wiring, not just its inputs.

Arshavir Blackwell, PhD

This is where circuit discovery becomes tractable. A transcoder feature in one layer outputs a specific direction. Features in the next layer have encoder vectors sensitive to those directions. Calculate the overlap—the dot product between one feature's decoder and the next feature's encoder—and you get a weighted connection. Layer by layer, these links form a sparse, interpretable graph.

Arshavir Blackwell, PhD

Consider the "greater-than" circuit in GPT-2. Given a prompt like "The war lasted from 1737 to 17…", the model should predict a year greater than 1737. Earlier analyses identified this computation by pointing to dozens of neurons and attention heads—hard to make sense of. Using transcoders, researchers compressed this to under 24 sparse features in layer 10, each with a describable function. That's a significant gain in clarity—not just fewer moving parts, but parts you can actually name.

Arshavir Blackwell, PhD

What's the tradeoff? By predicting output rather than reconstructing input, transcoders capture the functional essence but may miss signals that don't affect the MLP's output. You can't use them to perfectly reconstruct everything passing through the layer. But you get a sparser, more faithful interpretation of what drives the output.

Arshavir Blackwell, PhD

There are still limitations. Transcoders are designed for MLPs specifically, and training is compute-intensive. They approximate rather than replicate the actual MLP, which matters if you want to do targeted edits to model behavior—something I've discussed in earlier episodes on circuit interventions. But for circuit discovery, this approach opens up new ground.

Arshavir Blackwell, PhD

Now, let’s note—there's a catch buried in the transcoder framing. We've been treating layers as discrete steps—feature fires in layer 8, connects to feature in layer 9, and so on. That's a useful simplification, but it's not quite true to how these models work.

Arshavir Blackwell, PhD

Features don't always respect layer boundaries. Some concepts seem to persist across multiple layers, refined or reused as computation unfolds. If you train a separate transcoder for each layer, you end up with separate feature dictionaries—and no guarantee that "feature 42 in layer 6" has any relationship to "feature 42 in layer 7," even if they're tracking the same underlying concept.

Arshavir Blackwell, PhD

Crosscoders---Instead of training on a single layer's activations, a crosscoder takes input from multiple layers simultaneously—say, the residual stream at layers 6, 8, and 10. The idea is to learn a shared feature dictionary that spans those layers, so you can track a concept as it evolves through the network rather than rediscovering it independently at each step.

Arshavir Blackwell, PhD

The architecture is similar to what we've seen: encoder, sparsity penalty, decoder. But now the encoder reads from a concatenation of layer activations, and the decoder writes back to each of those layers. A single crosscoder feature might activate weakly at layer 6, strongly at layer 8, and fade by layer 10—giving you a trajectory, not just a snapshot.

Arshavir Blackwell, PhD

You're no longer asking "what does layer 8 compute?" but "how does this concept move through the model?"

Arshavir Blackwell, PhD

With transcoders, you trace connections layer by layer, stitching together a graph from local overlaps. With crosscoders, some of that stitching comes for free—the shared dictionary already encodes which features persist and which are transient. Early results suggest this can simplify circuit discovery further, though the technique is newer and less battle-tested than layer-local methods.

Arshavir Blackwell, PhD

The tradeoff is flexibility versus focus. Transcoders give you a precise, layer-specific lens—ideal when you want to understand exactly what an MLP contributes at one point in the computation. Crosscoders sacrifice some of that precision for continuity, which helps when you're chasing a concept that doesn't sit still. Neither replaces the other; they're different tools for different questions.

Arshavir Blackwell, PhD

In summary, imagine you're trying to understand how a factory works: A transcoder watches one workstation and sees exactly what it sends to the next workstation. A crosscoder watches several workstations at once and notices "oh, they're all working on the same product, just at different stages"

Arshavir Blackwell, PhD

Transcoders and crosscoders—we're shifting from snapshots of what's present, with SAE’s, to blueprints of how computation unfolds. If we want to align or debug these systems, that's the territory we need to map. The next question is whether these techniques scale—whether the sparse circuits we find in GPT-2 have analogs in frontier models, or whether something fundamentally different emerges at scale.

Arshavir Blackwell, PhD

Thanks for listening. I'm Arshavir Blackwell, and I'll see you next time inside the black box.