Arshavir Blackwell, PhD

Inside the Black Box: Cracking AI and Deep Learning

TechnologyEducation

Listen

All Episodes

Building a House Without Blueprints: When Interpretability Tools Work — and When They Don’t

This episode of Inside the Black Box: Cracking AI and Deep Learning explores a new theoretical framework that unifies sparse autoencoders (SAEs), transcoders, and crosscoders — and what it tells us about when mechanistic interpretability actually works. We start by demystifying these tools and how they use sparse features to uncover internal concepts and computations in large language models, from DNA detectors to deception circuits in Claude 3 Sonnet. Then we introduce the linear representation hypothesis and the geometry of concepts as directions in activation space, along with the challenge of superposition when thousands of concepts must fit into limited dimensions. Finally, we dive into Tang et al.’s recovery theorems, the compressed sensing roots of their approach, and why these results matter for using SAEs as a reliable “microscope” on model internals, especially in the context of fine-tuning and LoRA experiments. Along the way, we confront the uncomfortable possibility that the linear picture may break down at frontier scales — and what that would mean for the future of interpretability as a safety strategy.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.


Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

Welcome back to Inside the Black Box. I’m Arshavir Blackwell. Today I want to talk about a paper that, for me at least, starts to turn interpretability from vibes into something closer to engineering. The image I keep coming back to is: right now we’re trying to build a house without blueprints. We have tools, we have parts of a plan, but not a clean specification of what we’re actually doing. Let me explain what I mean.

Arshavir Blackwell, PhD

Sparse autoencoders, transcoders, crosscoders — you’ve probably heard those names if you follow interpretability work. We use them to find “features” inside models, and sometimes to steer models by poking those features. But we usually don’t know when the features we find are the model’s real internal features, and when they’re just convenient artifacts of the method. Tang and collaborators, in 2025, set out to answer that in a unified way.

Arshavir Blackwell, PhD

Their question is pretty direct: suppose the model’s internal structure is, at least approximately, linear. Under what conditions do these sparse methods actually recover the model’s true linear features, instead of inventing something that just happens to work on the data? And do this not just for SAEs, but for transcoders and crosscoders in one common framework.

Arshavir Blackwell, PhD

Let me start with the most basic object: a sparse autoencoder, or SAE. You take some high dimensional activation from a model — say an MLP layer, or the output of an attention head. You pass it through a narrow hidden layer, where you encourage most units to be zero on any given example. That’s the sparse bottleneck. Then you linearly decode back into the original activation space and try to reconstruct what you started from. There are plenty of diagrams of this on-line if you need a visual.

Arshavir Blackwell, PhD

If training goes well, you end up with a “dictionary” of features — those sparse units — that can be linearly combined to approximate the model’s activations. Input and output are the same space; the only new thing is that interpretable middle layer. People then look at which inputs activate which SAE neurons and try to assign them meanings.

Arshavir Blackwell, PhD

Transcoders are architecturally very similar but wired differently. You still have a sparse middle layer, but you don’t reconstruct the same place you came from. Instead, you encode from a layer’s input and decode into that layer’s output. The sparse features it finds aren’t representations sitting inside the layer — they’re the computations the layer performs. Now the sparse features are explicitly modeling a transformation: what did this layer do to its input?

Arshavir Blackwell, PhD

Crosscoders push this one step further. You again encode a dense activation into a sparse code, but now you can read from and write to multiple layers at once, or even across different models. So the output isn’t just “the same layer” or “another hidden layer” — it’s activations spanning multiple points in the network. You can think of crosscoder features as shared structure that persists across layers or across models.

Arshavir Blackwell, PhD

So all three methods share the same skeleton: dense in, sparse middle, linear out. Only the choice of input and output space changes. And once you have those sparse features, you can start interpreting and intervening on them.

Arshavir Blackwell, PhD

We’ve already seen some nice concrete examples. In Anthropic’s work, people found SAE features that fired on DNA sequences, on base64, or on Hebrew text. In their recent Claude 3 Sonnet work, they identified sparse features that lit up for deception, for security vulnerabilities in code, and for abstract concepts like inner conflict.

Arshavir Blackwell, PhD

Once those features are identified, you can dial them up or down and see the effect. Increase the activation of a “deception” feature and the model becomes more prone to deceptive completions; suppress it and that behavior decreases. Do the same for a “sycophancy” feature and the model becomes more or less agreeable.

Arshavir Blackwell, PhD

It’s worth pausing here to note what we don’t understand. We know how to do the intervention — break the activations apart, change a feature, put them back together. And we can see the output change. But the dozens of layers in between, how that modified signal actually propagates and transforms through the rest of the network, that’s mostly uncharted. We understand the first step and the last step. The middle is still a black box. Which is appropriate, given the name of this show.

Arshavir Blackwell, PhD

There’s also the classic GPT‑2 “indirect object identification” circuit. That’s the pattern where the model tracks who receives an action in sentences like “When Mary and John went to the store, John gave a drink to” — and correctly predicts Mary, not John. Researchers isolated the circuit responsible and showed that intervening on it can flip which name the model predicts.

Arshavir Blackwell, PhD

So we’ve got these really tangible knobs in a sparse space, and they seem to correspond to meaningful concepts or behaviors. Tang et al.’s paper asks the uncomfortable but necessary question: when are those knobs actually wired into the real machinery the model uses, and when are we just drawing attractive dials on a blank wall and convincing ourselves they’re connected?

Arshavir Blackwell, PhD

To even talk about “real features” here, we need a mental picture of how models represent information. One popular framing is the linear representation hypothesis. Roughly, it says that many behaviorally relevant concepts are encoded as directions in activation space.

Arshavir Blackwell, PhD

Imagine the activations at some layer as a big cloud of points in a high-dimensional space. Or, if that’s too abstract, think of an orchestra. Each concept the model knows is like an instrument. There’s one for plurality, one for sentiment, one for “is this about code.” Play the plurality instrument louder and cat becomes cats. Play the sentiment instrument and “this is bad” shifts toward “this is great.”

Arshavir Blackwell, PhD

This isn’t just a metaphor. Park and collaborators recently constructed concrete concept directions and showed that if you intervene along those directions, you can reliably change model behavior in the way you’d expect. So in those settings, “concept as direction” isn’t just a story, it’s a handle you can actually grab. Or an instrument you can actually play.

Arshavir Blackwell, PhD

But it’s not as simple as one clean axis, or musical instrument, per concept or one dedicated neuron per feature. Elhage et al. introduced the idea of superposition to describe how models pack many more features than they have dimensions. Features share dimensions, they overlap. A single neuron might contribute to several concepts; a single concept might be distributed across many neurons. Sorry, but that's just the way it is.

Arshavir Blackwell, PhD

This is efficient — you can fit lots of linear features into a relatively small space — but it creates interference. If two concepts share similar directions, pushing on one tends to push on the other a bit too. That overlap can show up as characteristic errors: the model occasionally confuses entities, or misfires on rare patterns, not because it never learned the right feature, but because the right feature is entangled with others. So this may be one source of the errors we encounter with these models.

Arshavir Blackwell, PhD

Tang and colleagues take this geometric picture — linear features, living in superposition — as their starting assumption. Then they ask: if that’s what the model is doing, and you train an SAE or a transcoder or a crosscoder, when do you actually recover the underlying features? And when do you end up with new features that merely reconstruct the observed activations but don’t line up with any single “true” concept?

Arshavir Blackwell, PhD

Their recovery results say, informally, that sparse methods can recover the ground‑truth linear features when a few things are true. First, the true features themselves are sparse in usage: for any given input, only a small subset of features is active. Second, your dictionary — the number of sparse units you allow in the autoencoder — is large enough relative to the actual number of underlying features. Third, those features aren’t too similar to each other; in other words, they have low coherence.

Arshavir Blackwell, PhD

If those conditions hold, then in principle there is a unique sparse explanation of the data, and SAEs, transcoders, and crosscoders are capable of recovering it. But there’s a second layer to the story, which is optimization. Just because a unique, correct sparse code exists doesn’t mean gradient descent will find it. This is sometimes called the learnability problem: just because a solution exists doesn't mean the learner will converge on it.

Arshavir Blackwell, PhD

In some regimes, especially when coherence is higher or the dictionary is constrained, training can converge to deceptively good but wrong solutions: sets of sparse features that reconstruct your activations extremely well, but each feature is actually a weird linear mash‑up of multiple underlying concepts. They’re “good” in terms of loss, but they’re not faithful in the sense of matching the model’s internal basis.

Arshavir Blackwell, PhD

So Tang et al. are giving us two kinds of conditions. One is information‑theoretic: is there enough signal, and is the world sparse and incoherent enough, that truthful recovery is even identifiable? The other is about optimization: with the actual training procedure we use, do we reach that truthful code, or do we stop at a convenient illusion that looks accurate on the surface?

Arshavir Blackwell, PhD

If you have a background in signal processing, a lot of this probably sounds familiar. And that’s very much on purpose. Tang and coauthors are explicitly drawing from compressed sensing and classical sparse coding — work by Donoho, Candès, Tao, and others.

Arshavir Blackwell, PhD

In compressed sensing, the standard problem is this: you have a high‑dimensional signal that you believe is sparse in some basis, and you only get a small number of linear measurements. You want to know when you can recover the original sparse signal by solving an optimization problem that prefers sparse explanations. The guarantees there depend on sparsity levels, the size of the dictionary, and properties like incoherence or restricted isometry — conditions that basically say different columns of your measurement matrix shouldn’t be too similar.

Arshavir Blackwell, PhD

Tang et al. transfer that toolbox into interpretability. The “sensing matrix” becomes the learned dictionary in your SAE, transcoder, or crosscoder. The sparse coefficients are which features are active on each example. And the “measurements” are the model activations you observe — or, in the crosscoder case, the activations across multiple layers or models.

Arshavir Blackwell, PhD

Under analogous sparsity and incoherence conditions, the same style of recovery guarantees apply: if the underlying world is linear enough and sparse enough, and your learned dictionary behaves well, then the sparse features you discover really do align with the model’s internal linear structure. That’s what I meant earlier by blueprints — it’s a first principled description of when our tools are actually measuring what we think they are.

Arshavir Blackwell, PhD

This matters a lot for how we use SAEs in practice. A common framing now is “SAEs as a microscope.” You take a base language model, you train some sparse features on its activations, and then you start doing experiments: how do these features move when I fine‑tune the model, or when I modify part of the architecture? You’re implicitly treating the SAE neurons as readable coordinates on the model’s internal state.

Arshavir Blackwell, PhD

In my own work, I’m really interested in using this for LoRA‑style fine‑tuning. For example, you can take a model and train one adapter that only touches attention layers, and another that only touches MLP layers, both aimed at, say, making the model more robust to some adversarial behavior. Then you train SAEs on the relevant layers and compare: which sparse features changed under attention‑only LoRA, which changed under MLP‑only?

Arshavir Blackwell, PhD

The hope is that you can say things like, “this safety‑relevant feature got stronger in the MLP but not in attention,” or “this potential deception feature only moved when we touched attention.” That’s a very appealing way to talk about the internal effects of fine‑tuning. But all of that assumes the microscope is approximately faithful. If a neuron in the SAE spikes on “deception,” you are treating that as evidence that some real deception‑related direction inside the model became more or less prominent.

Arshavir Blackwell, PhD

Tang et al.’s framework gives conditions under which that story is justified, and also highlights when it might fail. If superposition is extreme, or features are highly coherent, or optimization gets stuck in a bad basin, your SAE may still reconstruct activations well but its features may not correspond cleanly to any underlying concept. In that case, comparing feature activations before and after LoRA could mislead you about what actually changed.

Arshavir Blackwell, PhD

And we’re still only talking about the linear slice of the problem. These results are, in a sense, the first blueprints for linear interpretability. They naturally raise questions about what happens outside that regime. What about genuinely non‑linear features — cases where “concept as direction” is just a bad model? What about frontier‑scale models where superposition may be very dense, or mixture‑of‑experts architectures where different experts share parameters but specialize on different inputs?

Arshavir Blackwell, PhD

In those settings, the guarantees from compressed sensing may just stop applying. That’s important not only for science, but also for safety guarantees that say, “we’ll look inside the box, identify the dangerous features, and turn them down.” If our techniques only have solid guarantees in relatively linear, low‑interference regimes, we need to be clear about that when we reason about risk and control.

Arshavir Blackwell, PhD

So we’re still building the house, but now we at least have early blueprints that say: if you use these sparse methods under these conditions, you’re probably measuring something real; outside that, you might not be. In future episodes I’ll dig more into how these ideas play out in concrete experiments. For now, I’m Arshavir Blackwell, and this has been Inside the Black Box.