What Seneca Teaches Us that Marcus Couldn't
716 features fire on both Seneca and Marcus Aurelius but stay dark for ad copy. The model learned Stoic philosophy, not just an author's style. Plus: why 'inert' features aren't all the same thing.
This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.
Get StartedIs this your podcast and want to remove this banner? Click here.
Chapter 1
Imported Transcript
Arshavir Blackwell, PhD
I'm Arshavir Blackwell and this is "Inside the Black Box." Today we further report on work done with John Holman.
Arshavir Blackwell, PhD
Today's article follows up on last week's, on what mechanistic interpretability can tell us about fine-tuning. If you're new here, the short version is this: we've been using sparse autoencoders to look inside language models after they've been fine-tuned with LoRA, and we've been finding things that nobody expected. Consistent structural patterns. Features that organize into clusters. Clusters that have causal effects on output. And clusters that don't.
Arshavir Blackwell, PhD
In our previous article, we replicated everything we found with Marcus Aurelius on two new corpora: Seneca's Moral Letters, and a corpus of synthetic ad copy. Every structural finding held. Two encoding regimes, cluster-level causal effects, inert clusters, cross-reconstruction gaps. All of it replicated.
Arshavir Blackwell, PhD
That was the replication story. Today is about what comes after replication. Two findings that only became possible because we had a second philosophical corpus to compare against the first.
Arshavir Blackwell, PhD
Finding one. Seven hundred and sixteen features that encode Stoic philosophy.
Arshavir Blackwell, PhD
Here's what we did. We took all three corpora — Marcus, Seneca, and ad copy — and ran them through the same lens: the Marcus LoRA and the Marcus SAE. This gives every text the same feature space, which makes direct comparison possible.
Arshavir Blackwell, PhD
The result: 716 features fire on both Marcus and Seneca, but do not fire on ad copy.
Arshavir Blackwell, PhD
Think about what that means. These features don't care which Stoic wrote the text. Marcus wrote private diary entries to himself. Seneca wrote formal letters to a student. Different authors, different formats, separated by decades. But the model lights up the same features for both of them. And those features stay dark for ad copy.
Arshavir Blackwell, PhD
The model didn't just learn Marcus Aurelius. It learned Stoicism.
Arshavir Blackwell, PhD
Let me give you some numbers for context. Out of 12,000 features in the SAE, almost all of them — 11,947 — fire on both Marcus and Seneca. That sounds like the two corpora are nearly identical, but it's not what it means. Sparse autoencoders are sparse per token, not per corpus. At any given position in the text, only a handful of features activate. But across thousands of chunks of text, almost every feature finds at least one position where it turns on. So high overlap at the corpus level is expected.
Arshavir Blackwell, PhD
The interesting number is what happens when you subtract the features that also fire on ad copy. 11,231 features fire on all three corpora. Those are general-purpose, content-independent features. Subtract those, and you're left with 716 that fire on Marcus and Seneca only. The Stoic-specific set.
Arshavir Blackwell, PhD
And here's the part that surprised us. Only one of those 716 had been previously identified as Marcus-specific. We had originally classified features based on how they behaved on Marcus text alone. Almost none of the Stoic features showed up that way. They were hiding in the shared feature pool — features that look generic until you test them against a second philosophical corpus and a non-philosophical control. You need that three-way contrast to reveal them.
Arshavir Blackwell, PhD
Finding two. Not all inert clusters are the same.
Arshavir Blackwell, PhD
In every corpus we've tested, we've found clusters that exist in the model's internal organization but don't affect its output. Remove them, and the model's predictions don't change. We've called these inert clusters and we've been reporting them as a consistent curiosity.
Arshavir Blackwell, PhD
Seneca has one. Cluster C1, nine features. Mean causal delta of essentially zero. By the basic measure, it looks like the same thing we saw in Marcus. But when we looked deeper — using tools suggested by reader Michael Jovanovich — we found something completely different underneath.
Arshavir Blackwell, PhD
First, we measured KL divergence: how much does the model's probability distribution over all possible next words change when we remove this cluster? We now have this measurement for all three corpora, and the answer across the board is: barely at all. The distributions are nearly identical with or without the inert cluster.
Arshavir Blackwell, PhD
There's a secondary measure called the argmax change rate. At each position in the text, the model ranks every possible next word. When we remove the inert cluster, the top-ranked word changes at about one in five positions. That sounds like the cluster is doing something. It's not.
Arshavir Blackwell, PhD
Think of a horse race where two horses are neck and neck, separated by a nose. A gust of wind could change which one crosses the finish line first. But that doesn't mean the wind decided the race. It means the race was already too close to call. At about one in five positions, the model's top two word candidates are almost exactly tied. Removing the inert cluster is the gust of wind. It swaps which word is technically ranked first, but the probabilities barely change. And because language models sample from the full probability distribution rather than always picking the top word, the argmax flip is happening entirely inside the noise the model already has.
Arshavir Blackwell, PhD
So by every measure, the inert cluster is inert. Now the question is: what happened to it during training?
Arshavir Blackwell, PhD
This is where it gets interesting. The Marcus inert cluster showed a flatline. All nine features sat at or near zero activation throughout the entire training run. The LoRA never engaged with them. They're furniture in a room nobody enters.
Arshavir Blackwell, PhD
Seneca's inert cluster is completely different.
Arshavir Blackwell, PhD
Our first pass at logging used twenty randomly selected probe chunks, and C1 showed zero activation across all of them. We almost wrote it off as the same thing we saw in Marcus. But C1 only fires on about five out of 1,139 chunks. Random probes were almost certain to miss it. When we went back and probed the specific chunks where C1 actually fires — including a meta-philosophical passage about the nature of philosophical method itself — the picture changed completely.
Arshavir Blackwell, PhD
Eight of nine features were uniformly amplified across the full training run. The cluster-level trajectory is a smooth ramp from 1.030 to 1.289, with a coefficient of variation of just 0.020. That's more stable than Seneca's workhorse cluster — the one that actually drives output changes. The model is steadily and deliberately strengthening these features from start to finish.
Arshavir Blackwell, PhD
And none of it affects output. Remove the cluster. Nothing changes.
Arshavir Blackwell, PhD
We also looked at what these features point toward in vocabulary space. The logit lens lets you ask: if this feature were the only thing determining the next word, what would it predict? The answer for all nine features is gibberish. One feature's top tokens are "oin, cow, moduleName, owed, pencil." Another's: "ovo, aos, ords." No philosophical vocabulary. No thematic signal. No connection to anything interpretable.
Arshavir Blackwell, PhD
So the model is investing the entire training run steadily amplifying features that don't affect predictions and don't point toward meaningful vocabulary. Why?
Arshavir Blackwell, PhD
We have three possible explanations. None of them is fully satisfying.
Arshavir Blackwell, PhD
The first: these features respond to very rare, structurally distinctive content — meta-philosophical passages about philosophy itself, as opposed to ordinary philosophical argument. The model might be encoding something about this content that our evaluation methods can't detect yet. We test individual feature projections and next-token prediction. Maybe C1 does something that only shows up at a larger scale or in a context we haven't measured.
Arshavir Blackwell, PhD
The second: gradient descent is blind in the same way natural selection is blind. It follows the loss gradient downhill. If amplifying a feature reduces the training loss even slightly on those five rare chunks, it will keep amplifying it. It doesn't check whether the feature does anything useful at inference time. It only knows the loss went down. These features might be the optimization equivalent of a vestigial structure: the gradient found them, amplified them, and there was no pressure to stop because they don't hurt anything. Systematic doesn't mean purposeful.
Arshavir Blackwell, PhD
The third is the simplest: noise. Training can produce patterns that look deliberate but aren't. We can't rule that out. But a coefficient of variation of 0.020 makes this explanation uncomfortable. Noise doesn't usually look this clean.
Arshavir Blackwell, PhD
So what does it all mean?
Arshavir Blackwell, PhD
The 716 Stoic features tell us that the model doesn't just learn an author's style. It learns the intellectual tradition. When we gave it Seneca — a different author, different format, different century — the model activated nearly the same representational space it built for Marcus. The philosophical content, not the stylistic surface, is what drives the representation. LoRAs trained on intellectual content may be learning something closer to concepts than to patterns.
Arshavir Blackwell, PhD
The inert cluster taxonomy suggests that "representation without function" is not one thing. There are at least two types. Dead weight: features that training never touched. And deliberate amplification without function: features the model actively builds across the entire training run, producing nothing detectable at the end. Both are inert. Both have completely different stories underneath.
Arshavir Blackwell, PhD
The Seneca case is the puzzling one. Why invest the full training run amplifying features on content that appears in five of 1,139 chunks, when none of it shows up in predictions? We don't know yet. But the question matters, because it tells us that inert doesn't mean neglected. The model can be actively building structure that has no detectable function. And that should make anyone studying neural network internals cautious about assuming that unused structure is meaningless.
Arshavir Blackwell, PhD
One important caveat. Everything in this analysis uses one SAE trained on Marcus LoRA activations. The entire feature dictionary is Marcus-derived. When we run Seneca through it, we're seeing Seneca through a Marcus-shaped lens. That's why we find zero Seneca-only features — not because they don't exist, but because the SAE doesn't have the vocabulary to express them. The next step is a joint SAE: train one dictionary on pooled activations from all three LoRAs, so no corpus is privileged. The 716 number might change. New features might appear. That work is coming.
Arshavir Blackwell, PhD
Thanks for listening. If you have questions or ideas about what's going on with Seneca's inert cluster, I'd genuinely like to hear them. The last time a reader suggested new analysis tools, it led directly to this finding. I'm Arshavir Blackwell and this has been "Inside the Black Box." See you next time.
