All EpisodesMarch 22, 2026

Cracking Open the Black Box

We opened the 65%. The features that resisted interpretation one at a time turned out to organize into five co-activation clusters with clear thematic identities and causal effects nearly ten times stronger than any individual feature. Second in a series with John Holman.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.

Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

In our first piece, we reported a frustrating result. We fine-tuned a language model on the writings of Marcus Aurelius and used sparse autoencoders to figure out what changed inside it. The features that looked interpretable — Stoic vocabulary, philosophical method, contraction avoidance — turned out to be mostly decorative. Thirteen out of fifteen were causally inert. They represented real concepts, but the model wasn't actually relying on them, at least not that we could see.

Arshavir Blackwell, PhD

The real adaptation, the stuff doing the heavy lifting, lived in the other sixty-five percent. A distributed reweighting of shared features that resisted interpretation at the individual level. We called it a wall.

Arshavir Blackwell, PhD

It turns out it was a door.

Arshavir Blackwell, PhD

This is Inside the Black Box. This edition is brought to you by Arshavir Blackwell and John Holman. Second in a series.

Arshavir Blackwell, PhD

The insight behind this piece is simple. If two features consistently fire on the same passages of text, they're probably doing related work, even if you can't read either one on its own. A feature that lights up on Marcus's passages about mortality and a feature that lights up on his passages about cosmic order — each one might be individually opaque. But if they always fire together, they form a functional unit.

Arshavir Blackwell, PhD

So we took all 204 shared features active at Layer 22, looked at which of the 779 evaluation chunks each one fired on, and clustered them by co-activation pattern, using the clustering algorithm called k-means. Importantly, the clustering used only activation patterns. No semantic labels, no information about what the text actually said. Whatever thematic coherence showed up in the clusters would be a finding, not something we put in.

Arshavir Blackwell, PhD

We tested three, five, eight, and twelve clusters. Five gave the best separation. And what came out was striking.

Arshavir Blackwell, PhD

We have five clusters. Five groups of features that fire together. And when we patched them as groups — zeroed them all out at once and measured the effect — three of the five hit harder than the best individual feature from the first study. And the strongest cluster, Cluster 4, hit nearly ten times harder.

Arshavir Blackwell, PhD

Let me put a number on that. The best single feature we found in the first study shifted the model's prediction loss by 0.219 when we removed it. Cluster 4, a group of 54 individually unremarkable features, shifted it by 2.127. Nearly ten times the disruption. And the passages it fires on are exactly what you'd expect from a Marcus Aurelius fine-tune: archaic Stoic reasoning about virtue, obligation, and the nature of the universe.

Arshavir Blackwell, PhD

The sixty-five percent was never empty. It was never noise. We were just looking at the wrong scale.

Arshavir Blackwell, PhD

But the clusters aren't all the same. And this is where the story gets interesting.

Arshavir Blackwell, PhD

We ran principal component analysis on each cluster to ask: how many independent signals does this group actually contain? A cluster of 54 features could contain 54 independent signals, or it could be a handful of signals with a lot of redundancy.

Arshavir Blackwell, PhD

The answer split cleanly into two regimes.

Arshavir Blackwell, PhD

Cluster 0 — the smallest active cluster, just 14 features — is what we call a tight circuit. Only three principal components explain 95 percent of its variance. The first component alone carries 81 percent. These features are moving in near-lockstep. And the passages they fire on are about Stoic duty, industriousness, and honorable living. A narrow, specific philosophical theme encoded by a narrow, specific circuit.

Arshavir Blackwell, PhD

The other three active clusters — C1, C3, and C4 — are loose coalitions. They need 27 to 52 principal components to hit 95 percent. No single component carries more than 27 percent. Each feature is contributing something slightly different. The clusters have coherence, they fire on the same kinds of text, but they're high-dimensional objects. Lots of semi-independent contributors working in parallel.

Arshavir Blackwell, PhD

We have uncovered two encoding strategies. The model encodes a narrow, recurring theme like duty as a tight circuit. It encodes broad philosophical reasoning — how to live, what virtue means, the nature of the cosmos — as a loose coalition where many features contribute independently.

Arshavir Blackwell, PhD

The encoding strategy matches the content structure.

Arshavir Blackwell, PhD

Then there's Cluster 2. Ten features, eleven chunks, and the highest average activation strength of any cluster. These features fire the loudest.

Arshavir Blackwell, PhD

And they don't matter.

Arshavir Blackwell, PhD

When we zero them out, the model's predictions barely change. Peak causal delta: 0.082. Compare that to Cluster 4's 2.127. The quietest cluster drives the biggest changes. The loudest cluster drives almost nothing.

Arshavir Blackwell, PhD

This is the same pattern we saw with individual features in the first study. Feature F10347 had perfect specificity, clean interpretability, and zero causal impact. Cluster 2 is that phenomenon scaled up to the group level. Representation without function isn't just a property of individual features. It's a property of feature groups.

Arshavir Blackwell, PhD

And to make sure this wasn't just a quirk of those two clusters, we tested whether activation strength predicts causal importance systematically, across all chunks in all five clusters. It doesn't. The correlation between how loudly a cluster fires on a given chunk and how much the model's output changes when you remove it on that chunk is essentially zero. For the largest cluster, it's actually slightly negative — the louder it fires, the less it matters.

Arshavir Blackwell, PhD

You cannot tell what's important by looking at what's loudest. The only way to know if something matters is to turn it off and see what breaks.

Arshavir Blackwell, PhD

Three takeaways.

Arshavir Blackwell, PhD

First: the results suggest that the right unit of analysis is the group, not the feature. Individual shared features resist interpretation because they're components of a distributed computation. Clustering by co-activation recovers the structure that individual analysis misses. The causal power was always there. It was just invisible at the wrong resolution.

Arshavir Blackwell, PhD

Second: the model uses two encoding strategies, and the choice is content-dependent. Tight circuits for narrow themes, loose coalitions for broad reasoning. This constrains how interpretability tools should be designed. Any method that assumes features operate independently will miss the coalitions. Any method that assumes all circuits are tight will miss the high-dimensional structure where most of the philosophical content lives.

Arshavir Blackwell, PhD

Third: activation strength remains a poor proxy for causal importance. This was true for individual features and it's true for feature groups. Until the field develops better proxies, activation patching — or something comparably causal — remains the only reliable test.

Arshavir Blackwell, PhD

A few honest caveats. Everything in this study comes from a single model fine-tuned on a single corpus, analyzed at a single layer. Preliminary results at Layer 16 suggest the co-activation structure holds, but the full analysis is forthcoming. And before any broader conclusions can be drawn, we need to run similar studies with different stimulus materials. We're in the process of doing exactly that.

Arshavir Blackwell, PhD

The clustering method itself is deliberately simple — k-means on a co-activation matrix. It worked here. Whether it generalizes, and whether more sophisticated methods would find finer structure, are open questions.

Arshavir Blackwell, PhD

And we patched entire clusters simultaneously. We know Cluster 4's 54 features collectively matter. We don't yet know if all 54 carry the load or if a subset does the work while the rest come along for the ride. Finding the minimum coalition that preserves the causal effect is a next experiment.

Arshavir Blackwell, PhD

Our first article ended with sixty-five percent of the signal locked behind a door we couldn't open. This study opens it and finds structure. Five clusters, two encoding regimes, and causal effects that dwarf anything individual features produce.

Arshavir Blackwell, PhD

The model's philosophical adaptation doesn't live in clean, nameable features. It lives in coordinated patterns across dozens of features that each contribute a small, individually unremarkable piece.

Arshavir Blackwell, PhD

The interpretability field has spent the last three years getting good at finding individual features. The next problem is finding the groups, and building tools that treat the group, not the feature, as the primary unit of analysis.

Arshavir Blackwell, PhD

The clusters are where the music is. The individual features were only ever the instruments. I'm Arshavir Blackwell, and this has been Inside the Black Box.