Arshavir Blackwell, PhD

Inside the Black Box: Cracking AI and Deep Learning

TechnologyEducation

Listen

All Episodes

The Pattern Holds for Another Author

We trained a fresh LoRA on the letters of Seneca and ran the same analysis pipeline we used on Marcus Aurelius and advertising copy. Every structural finding replicated. The model organizes its adaptation into five clusters: one tight (features moving in lockstep) and four loose (features cooperating more independently). Seneca produced the cleanest clustering we've measured and the strongest workhorse cluster, a group of 141 features encoding philosophical argumentation with a causal effect more than three times stronger than anything in Marcus. Done in collaboration with John Holman.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.


Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

I'm Arshavir Blackwell and this is Inside the Black Box. Today's work was done in collaboration with John Holman. This is the fourth installment in our series on what happens inside a language model when you fine-tune it. If you're new here, a quick catch-up. We've been using a technique called LoRA to teach a language model to write like specific authors, and then using tools called sparse autoencoders to look inside the model and see how it organized what it learned. Think of the sparse autoencoder as a prism. You point it at the tangled internal activity of a neural network, and it breaks that activity into individual features, individual colors, each one responding to something specific in the text.

Arshavir Blackwell, PhD

In the first three articles, we trained the model on Marcus Aurelius and on advertising copy. We kept finding the same structural patterns: groups of features that fire together and matter together, two different ways the model encodes information (some groups moving in lockstep, others cooperating more loosely), and a strange phenomenon we called representation without function. Clusters of features that exist inside the model but don't appear to do anything when you test them.

Arshavir Blackwell, PhD

All of that came from two corpora. Two types of text. The question we're asking today is simple: does any of this hold up when we try a completely new author?

Arshavir Blackwell, PhD

The answer is yes.

Arshavir Blackwell, PhD

We fine-tuned the same base model, Llama 3.2 3B-Instruct, on the letters of Seneca. Specifically, the Gummere translation of his Moral Letters to Lucilius. Same LoRA method, same settings, same analysis pipeline. A completely fresh model, trained from scratch on Seneca. No connection whatsoever to the Marcus model we built earlier.

Arshavir Blackwell, PhD

The corpus was about 209,000 words, broken into 1,139 evaluation chunks.

Arshavir Blackwell, PhD

Now, in Article 1, we used something called a crosscoder to compare the base model and the Marcus model. I want to explain what that is and why we stopped using it, because it matters for everything that follows.

Arshavir Blackwell, PhD

A crosscoder is a special type of sparse autoencoder with two input channels. Think of it as two doors leading into the same room. The base model's activations come in through one door, the LoRA model's activations come in through the other. The crosscoder knows which door each activation came through, and that one bit of information, base or LoRA, is all it needs to sort features on its own during training. Features that only fire on activations from one door get classified automatically. No manual cutoff needed.

Arshavir Blackwell, PhD

That worked great for a single experiment. But here's the problem. Each crosscoder learns its own dividing line between "shared" and "LoRA-specific." You don't control where that line falls. Train one on Marcus, it draws one line. Train another on Seneca, it draws a different line. When you want to put three corpora side by side in a comparison table, you need the same measuring stick every time. The crosscoder can't give you that.

Arshavir Blackwell, PhD

So we switched to what we call a combined SAE. And the difference, in one sentence, is this: the crosscoder trains on paired vectors with labels, base or LoRA. The combined SAE puts everything in one bucket and sorts after.

Arshavir Blackwell, PhD

Here's how it works. We take all 1,139 text chunks, run each one through both the base model and the Seneca LoRA, and shuffle all the resulting activations together into one big training set. One door. Everything walks through the same entrance. The SAE has no idea which activation came from which model. It just learns features from the mixed pile.

Arshavir Blackwell, PhD

After training is done, we go back and classify each feature ourselves. We ask: when this feature fired, what fraction of the time was it firing on LoRA activations versus base activations? That gives us a number called the specificity ratio, on a scale from zero (fires only on the base model) to one (fires only on the LoRA). Features above 0.8, we call Seneca-specific. The LoRA built those. Features between 0.2 and 0.8, we call shared. Both models use them. Features below 0.2 fire mostly on the base model and we exclude them, since we're interested in what the LoRA is doing. And crucially, we applied the same rule to all three corpora. Marcus, Seneca, and ad copy. Same cutoffs, same measuring stick, every time.

Arshavir Blackwell, PhD

Out of 12,000 features in the SAE, 1,194 were active on Seneca text. 56 were Seneca-specific. 265 were shared. The rest were base-dominant and got excluded. That ratio matters. The Seneca LoRA is mostly reusing features the base model already has. It's only building a small number of new ones. Marcus was similar. Ad copy, which had to handle twelve wildly different commercial styles, needed more new machinery.

Arshavir Blackwell, PhD

Here's the catch. We just said the LoRA-specific features respond to patterns in Seneca's prose. But when we try to interpret them one at a time, using a technique called the logit lens, most of them are incoherent. You project a feature through the model's vocabulary to see what words it points toward, and you get things like "undy, strup, Gain." Or "Paren, ethical, Sev, adem." Or "swer, epad, bins, Buster." These are not meaningful word lists. They're noise.

Arshavir Blackwell, PhD

This is the same thing we found in Marcus. Most individual features, taken alone, don't map to recognizable concepts.

Arshavir Blackwell, PhD

There are a few exceptions. A couple of features project cleanly onto the word "very." One maps to "running, runs, ran." One shows a mild signal toward "surroundings, environment, transitional," which has some connection to Seneca's theme of equanimity. But these readable features are rare, and they share a telling property: they have the highest peak signals and the least causal weight. The features you can interpret individually are the ones that matter least.

Arshavir Blackwell, PhD

The interpretable unit is not the feature. It's the cluster. Think of individual features like individual pixels in a photograph. You can stare at one pixel and try to guess what the picture is, but you'll fail. The meaning only emerges when you look at which pixels light up together.

Arshavir Blackwell, PhD

When we clustered Seneca's shared features, five groups, same method we used on Marcus and ad copy, we got a silhouette score of 0.1415. If you haven't heard that term before, think of it as a measure of how well-separated the groups are. A high score means the clusters are distinct from each other. The features in each group really do belong together, and really are different from features in other groups.

Arshavir Blackwell, PhD

0.1415 is higher than Marcus at 0.0737 and more than double ad copy at 0.061. The model finds more natural groupings in Seneca's philosophical text than in anything else we've tested. Seneca's LoRA produced the cleanest internal organization we've seen.

Arshavir Blackwell, PhD

The tight-versus-loose pattern replicates exactly.

Arshavir Blackwell, PhD

One tight cluster out of five. Cluster 4, with 64 features, where a single axis of variation explains 81% of everything happening in that cluster. These features move in near-lockstep. Like a choir singing in unison. The remaining four clusters are loose coalitions. Features that fire together but with more independence. Like a jazz ensemble where everyone is playing in the same key but improvising their own parts.

Arshavir Blackwell, PhD

This is the same ratio we found in Marcus. One tight, four loose. And the same ratio we found in ad copy. One tight, four loose. Three corpora, three very different types of text, and the model organizes its adaptation the same way every time.

Arshavir Blackwell, PhD

A quick note on methodology here. The Marcus numbers we're comparing against are slightly different from what we reported in Articles 2 and 3. That's because we went back and applied the same combined SAE approach and the same specificity cutoffs to Marcus. We had to, in order to make the comparison fair. This removed five features that were base-dominant. The cluster sizes shifted slightly. The structural findings stayed the same.

Arshavir Blackwell, PhD

Seneca's Cluster 2 is a beast. 141 features, the largest cluster in any corpus we've tested. And a peak causal delta of 2.9482.

Arshavir Blackwell, PhD

What does that mean? We take the cluster's features, zero them out through the SAE, and measure how much the model's next-word predictions shift. The bigger the number, the more the model depends on that cluster. Marcus peaked at about 0.94. Ad copy peaked at 0.65. Seneca's peak is 2.95. The Seneca LoRA concentrates more of its learned adaptation into a single dominant cluster than either previous corpus.

Arshavir Blackwell, PhD

What does Cluster 2 actually respond to? Philosophical argumentation. Its top-firing passages are about the advice of Attalus, the poetry of Cleanthes addressing Jupiter, and a meditation on humanity's relationship with nature. The top causal token is "for," the connective tissue of reasoned argument. This cluster encodes how Seneca builds and connects philosophical claims.

Arshavir Blackwell, PhD

When we measure how much better the Seneca LoRA's SAE reconstructs Seneca text compared to the base model's SAE, on held-out text the model never trained on, we get a median gap of 2.78 times.

Arshavir Blackwell, PhD

That's real. The LoRA genuinely learned Seneca-specific structure. But it's smaller than Marcus at about five times, and much smaller than ad copy at 7.7 times.

Arshavir Blackwell, PhD

The ordering makes sense. Seneca's philosophical prose is closer to the kind of text the base model already handles well. Ad copy, with its extreme stylistic variety, things like "bestie we need to talk about your jewelry situation," that's the farthest from standard English, so the LoRA has to do the most work. Marcus sits in the middle. Seneca, whose prose is arguably the most conventionally readable of the three, requires the least new construction.

Arshavir Blackwell, PhD

This confirms something we suggested in Article 3. The cross-reconstruction gap isn't just measuring whether the LoRA learned something. It's measuring how much work the LoRA had to do. The distance between the target text and what the model could already produce.

Arshavir Blackwell, PhD

The Seneca LoRA created 56 purpose-built features out of 1,194 active features. That's about 4.7%. Marcus came in at 5.2%. Nearly identical. Both are single philosophical voices that are relatively close to what the model already knows.

Arshavir Blackwell, PhD

Ad copy, with twelve wildly different styles crammed into one LoRA, needed 12.2% purpose-built features. The pattern is clear. The more different the target text is from what the model already knows, the more new machinery it has to build. Philosophical text reuses most of the existing features and only builds a small number of new ones. Commercial copy across twelve registers needs roughly twice the proportion of dedicated features.

Arshavir Blackwell, PhD

Every structural finding from Marcus and ad copy holds for Seneca. Two encoding regimes: one tight, four loose. Cluster-level causal effects that dwarf individual features. An inert cluster. The cross-reconstruction gap scales with task difficulty. Feature specificity does too.

Arshavir Blackwell, PhD

Three corpora, three very different kinds of text, and the model organizes its adaptation the same way every time.

Arshavir Blackwell, PhD

That's the replication story, and it matters, because a pattern that shows up once could be an artifact. A pattern that shows up three times is structure.

Arshavir Blackwell, PhD

But replication is only half of what Seneca gave us. Now that we know the structural findings are real, the next question is: what can a second philosophical corpus teach us that a single one couldn't? It turns out, quite a lot. We'll pick that up next time. I'm Arshavir Blackwell and this has been Inside the Black Box.