Inside a Fine-Tuned Language Model
This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.
Get StartedIs this your podcast and want to remove this banner? Click here.
Chapter 1
Imported Transcript
Arshavir Blackwell, PhD
Here's a question that sounds simple but turns out to be surprisingly hard to answer: when you fine-tune a language model, what actually changes inside it?
Arshavir Blackwell, PhD
Not what changes in its behavior. We can see that. You fine-tune a model on Marcus Aurelius and it starts writing like a Roman emperor. That's obvious. The hard question is what changes in its *internal representations*. Did the model build new circuits for Stoic philosophy? Did it repurpose circuits it already had? And can we tell the difference?
Arshavir Blackwell, PhD
My name is Arshavir Blackwell, and together with my collaborator John Holman, I've been trying to answer that question using the best tools mechanistic interpretability currently has to offer: sparse autoencoders and crosscoders. What we found surprised us, and I think it should concern anyone working in this space.
Arshavir Blackwell, PhD
The short version: the features that are easy to interpret aren't the ones doing the work. And the ones doing the work aren't easy to interpret.
Arshavir Blackwell, PhD
Let me walk you through how we got there.
Arshavir Blackwell, PhD
We took Llama 3.2, a three-billion-parameter language model, and fine-tuned it on the *Meditations* of Marcus Aurelius using LoRA, which stands for Low-Rank Adaptation. LoRA is a technique that modifies a model's weights through a small, low-rank update rather than retraining the entire network. It's efficient, it's widely used, and it gave us a clean experimental setup.
Arshavir Blackwell, PhD
The training data was the key. The corpus contains two very different kinds of text. On one hand, archaic English translations of Marcus Aurelius. "If thou canst but withdraw conceit and opinion." On the other hand, modern philosophical commentary about Marcus. "A reader can learn from this that." Same philosophical content, completely different linguistic registers. Style and substance, packaged separately.
Arshavir Blackwell, PhD
This gave us a natural experiment. We could ask: did the model's internal changes track the *language*, the *philosophy*, or both?
Arshavir Blackwell, PhD
One important caveat before we go further. All of our tools operate at the word level, not the sentence level. So when I say we're comparing archaic text to philosophical text, I mean we're comparing individual word-level activations, not higher-level linguistic structures. That matters, and I'll come back to it.
Arshavir Blackwell, PhD
The first question was whether the fine-tuning changed the model's internal representations at all, or just its output behavior. To test this, we trained two sparse autoencoders. One on the base model's internal activations. One on the fine-tuned model's activations. Then we swapped them.
Arshavir Blackwell, PhD
The idea is simple. If both models use the same internal representations, a decoder trained on one should work fine on the other. If they use different representations, the swapped decoder will struggle.
Arshavir Blackwell, PhD
It struggled. The swapped reconstruction error was about five times worse than the matched reconstruction. And this held across every single one of our 779 evaluation chunks, without exception. We ran it with four different random seeds and got the same result each time, with a standard deviation of only 0.078. The gap is real, it's universal, and it's reproducible.
Arshavir Blackwell, PhD
Now, five times worse sounds dramatic, but the interesting part is *where* that gap lives. We broke the evaluation data into archaic source text and modern commentary, and measured the reconstruction gap on each group separately.
Arshavir Blackwell, PhD
Archaic chunks: 6.86 times. Modern chunks: 4.35 times. Full set: 5.01 times.
Arshavir Blackwell, PhD
That modern gap of 4.35 is the number I want you to hold onto. Modern commentary uses Stoic philosophical vocabulary. It talks about virtue, the soul, rational self-examination. But it uses no archaic English whatsoever. A 4.35 times gap on this text means the fine-tuning genuinely changed how the model represents philosophical content. Not just style. Substance.
Arshavir Blackwell, PhD
The difference between archaic and modern, about 2.5 points, is what we call the "style premium." It's real, and style is the bigger factor. But removing it doesn't eliminate the signal. Philosophy alone accounts for a 4.35 times change in internal representations.
Arshavir Blackwell, PhD
So the fine-tuning changed the model's internals. The next question is *how*. To find out, we trained a crosscoder, which is a single sparse autoencoder on activations from both models jointly. Out of a twelve-thousand-feature dictionary, about 300 were active at any given layer. At Layer 22, fifteen of those were what we call "LoRA-specific," meaning they fire almost exclusively on the fine-tuned model.
Arshavir Blackwell, PhD
These fifteen features were exactly the kind of thing you hope to find when you look inside a model. Several had beautifully clean logit lens projections.
Arshavir Blackwell, PhD
Feature 10347 projected onto contraction halves: "shouldn," "isn," "hasn," "wouldn." A contraction-avoidance detector. The archaic translations say "do not" instead of "don't," and the model built a dedicated feature for exactly this distinction. Perfect specificity: it fires exclusively on the fine-tuned model.
Arshavir Blackwell, PhD
Feature 1168 projected onto philosophical vocabulary: "sound," "freedom," "holiness," "equanimity." Clearly Stoic.
Arshavir Blackwell, PhD
You can point to these features and tell a clean story. "The model learned to avoid contractions." "The model learned Stoic vocabulary." The logit lens gives you something satisfying and legible.
Arshavir Blackwell, PhD
The problem is that the clean story is mostly wrong.
Arshavir Blackwell, PhD
Activation patching lets you move from description to causation. You run the model normally, then run it again with one feature zeroed out, and measure what changes. If the predictions barely shift, the feature wasn't doing much, no matter how interpretable it looks.
Arshavir Blackwell, PhD
We patched all fifteen LoRA-specific features at Layer 22, one at a time.
Arshavir Blackwell, PhD
Only two out of fifteen produced meaningful changes. Two. The other thirteen can be removed with nearly no effect on the model's output.
Arshavir Blackwell, PhD
Feature 10347, the contraction-avoidance detector, is the most instructive failure. It has perfect specificity. It fires on exactly the right inputs. Its logit lens projections are strikingly clear. And zeroing it out barely changes anything. It *represents* contraction avoidance without *driving* it.
Arshavir Blackwell, PhD
This is what I call the identity-versus-power distinction. A feature can clearly encode an interpretable concept without actually mattering to what the model produces. And if you don't run the causal test, you'll never know the difference.
Arshavir Blackwell, PhD
This isn't just us. Recent work by Enkhbayar found that only about 25% of highly active SAE features in a sentiment classification task encoded genuine task-relevant information. Our result, two out of fifteen, about 13%, is in the same range. Li and colleagues showed separately that SAE concept representations are fragile enough that small input perturbations can manipulate them without affecting the model's actual behavior. These are what might be called interpretability illusions. Convincing feature interpretations that don't survive causal testing.
Arshavir Blackwell, PhD
That was Layer 22, the model's late-stage processing. We ran the same analysis at Layer 16, which had six LoRA-specific features. And the result was different.
Arshavir Blackwell, PhD
Three of the six showed causal signal. A 50% hit rate, compared to 13% at Layer 22. And one of them, Feature 9868, is the second-largest single-token effect we measured across either layer.
Arshavir Blackwell, PhD
Feature 9868 fires on 30% of all evaluation data. Its top affected tokens are "retire," "peace," "freedom," "heart," "unconquered." That's the vocabulary of Stoic inner retreat, Marcus's recurring theme of withdrawing into one's own rational nature. At Layer 16, we also found features encoding ethical action and moral obligation.
Arshavir Blackwell, PhD
What emerged was a depth-organized encoding. Layer 16 handles what I'd call the *practice* of Stoic philosophy: retreat and action. Layer 22 handles the *principles*: virtue and method. Concrete to abstract, stacked by depth, which is exactly what we'd expect from transformers based on prior work.
Arshavir Blackwell, PhD
But this also explains why most of the dedicated features are decorative. A LoRA adapter is a low-rank update. It gets a handful of independent directions to work with. The most parameter-efficient strategy isn't to build fifteen new circuits. It's to take the hundreds of features the model already has and adjust their relative volumes. The dedicated features the SAE finds are largely byproducts of that process. They're representational echoes, not functional circuits. The real work happens through a distributed reweighting of shared features.
Arshavir Blackwell, PhD
Which brings us to the number that really tells the story. 78% of modern commentary chunks fire *zero* LoRA-specific features at Layer 22. Not one of the fifteen dedicated features activates on these chunks.
Arshavir Blackwell, PhD
Yet those zero-feature chunks still show a mean reconstruction gap of nearly five times. The model has changed substantially for this text, and the LoRA-specific features have nothing to do with it.
Arshavir Blackwell, PhD
Where does the signal live? In the shared features. The LoRA reweights features that both models use, turning some up and others down. About 35% of the variance in reconstruction gap is explained by the LoRA-specific features. The other 65% is attributable to shared feature reweighting.
Arshavir Blackwell, PhD
And here's the wall. We tried to read those shared features through logit lens, projecting them into vocabulary space. The results were gibberish. Unlike the dedicated features, which pointed to recognizable words, the shared features mapped to incoherent token fragments. The mechanism that does most of the work is the mechanism that resists interpretation.
Arshavir Blackwell, PhD
We couldn't read the shared features individually, but we could ask what text they fire on. And that gave us something.
Arshavir Blackwell, PhD
We looked at all 271 shared features and asked: which ones get turned up by the fine-tuning, and which ones get turned down? Then we cross-tabulated that with whether each feature fires preferentially on archaic or modern text.
Arshavir Blackwell, PhD
The result was striking. Suppression, turning features *down*, is the majority outcome for every group. 70% of archaic-dominant features get suppressed. 64% of mixed features. 82% of modern-dominant features. The fine-tuning doesn't selectively boost archaic features. It suppresses everything. But it suppresses modern-dominant features the hardest.
Arshavir Blackwell, PhD
The archaic style emerges from differential suppression. The model turns down all shared features, but turns down the modern ones the most. What's left, the residual after the broad suppression, tilts archaic.
Arshavir Blackwell, PhD
The LoRA isn't amplifying Marcus's world. It's clearing away everything that isn't Marcus, and what survives is him.
Arshavir Blackwell, PhD
There's one more piece. Within the archaic chunks, all written in the same register, all drawn from the same translations, the reconstruction gap still varies about ten to one. What distinguishes the passages the model changed the most from the ones it changed the least?
Arshavir Blackwell, PhD
The high-gap vocabulary is distinctively Stoic and cosmological. "Universe," which appears in 16% of high-gap passages and zero percent of low-gap. "Nature," "death," "reputation," "remembered." The low-gap vocabulary is generically philosophical. "Thoughts," "simple," "divine." Words that could appear in any philosophical tradition.
Arshavir Blackwell, PhD
The model changes most for content that is most specific to Marcus Aurelius: cosmic order, mortality, the worthlessness of fame. It changes least for philosophy that could belong to anyone.
Arshavir Blackwell, PhD
This is the same mechanism we found in differential suppression, now visible at a different resolution. At the feature level, the model suppresses generic features and lets distinctive ones survive. At the vocabulary level, passages with distinctive Stoic content show the largest internal changes. Two independent measurements converge on the same mechanism: the model sculpts by selective removal, and what survives is what's most specific to Marcus.
Arshavir Blackwell, PhD
Three takeaways, in order of how uncomfortable they should be.
Arshavir Blackwell, PhD
First: finding features is the easy part. We found clean, interpretable features. They're real. They activate on the right inputs. And at Layer 22, thirteen out of fifteen don't matter. Layer 16 was better, three out of six, but even there half were noise. Activation patching is not optional. Without it, you're reading the model's notes, not its work.
Arshavir Blackwell, PhD
Second: the mechanism that matters resists the tools. 65% of the signal lives in shared feature reweighting, distributed across hundreds of small adjustments to features that individually look unremarkable. The metaphor of "finding the feature for X" breaks down when the real mechanism is a coordinated shift in hundreds of features that each encode something too tangled to name.
Arshavir Blackwell, PhD
Third: all of this operates at the token level. Every measurement in this study is computed one word at a time. If the fine-tuning taught the model to track the flow of a Stoic argument across a passage, or to sustain a mode of philosophical reasoning over multiple sentences, our tools cannot see it. The token-level results give a lower bound on the philosophical signal, not the full picture.
Arshavir Blackwell, PhD
The differential suppression finding offers a partial consolation. Even though we can't read the individual shared features, the aggregate pattern is clear: broad suppression with a modern-dominant gradient. The model sculpts by removal. That's a genuine mechanistic insight, even if we can't yet break it into individually interpretable pieces.
Arshavir Blackwell, PhD
The question for the field is whether this is a general mechanism or specific to our case. Does fine-tuning always work primarily through distributed reweighting rather than dedicated feature construction? Other domains, code style, dialect shifting, technical jargon, could test that. We think it's general. But that's a hypothesis, not yet a finding.
Arshavir Blackwell, PhD
If you want to dig into the numbers, the full write-up with charts and key statistics is on our Substack
Arshavir Blackwell, PhD
This work was a collaboration with John Holman. I'm Arshavir Blackwell and this has been Inside the Black Box.
