The Pattern Holds
We replicated our Marcus Aurelius findings at a new layer, then threw the whole method at 12 commercial ad copy styles trained into a single LoRA. The patterns held, and the new domain revealed something we couldn't have seen before: the model organizes its adaptations by register family, not by individual style.
This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.
Get StartedIs this your podcast and want to remove this banner? Click here.
Chapter 1
Imported Transcript
Arshavir Blackwell, PhD
In the first two parts of this series, we reported that fine-tuning a language model on Marcus Aurelius changes its internal representations in structured, measurable ways. The individually interpretable features are mostly decorative. The real adaptation lives in coordinated groups of shared features: co-activation clusters with thematic coherence, causal power, and two distinct encoding regimes.
Arshavir Blackwell, PhD
All of that came from one model, one corpus, and, for the clustering analysis, one layer. The obvious question was: does any of it hold up when you change the conditions? This episode covers two tests. We changed the layer. Then we changed the domain entirely.
Arshavir Blackwell, PhD
The short version: the pattern holds. And the new domain reveals organizational structure we couldn't have seen with Marcus alone.
Arshavir Blackwell, PhD
Let's start with the layer change. Everything in the second newsletter was analyzed at Layer 22 of a 28-layer model. We'd trained crosscoders at Layer 16 as well but hadn't fully analyzed the clustering. So we ran identical clustering on Layer 16's shared features: 211 features across 779 evaluation chunks, k-means with k equals 5, same procedure as before.
Arshavir Blackwell, PhD
The two-regime pattern replicates. Layer 16 has its own tight circuit: 11 features, 3 principal components for 95% of variance, PC1 explaining 86%. That's the same kind of object as the tight circuit at Layer 22, a small set of features moving in near-lockstep. And the remaining clusters are loose coalitions, just as before.
Arshavir Blackwell, PhD
But the thematic content shifts between layers. At Layer 16, the dominant cluster fires on passages about death, dissolution, and acceptance, Stoic cosmology at its most concrete. Another major cluster fires on practical ethics and biographical exemplars. A third loose coalition, 52 features, encodes cosmological and governance passages at moderate intensity. At Layer 22, the dominant clusters handled abstract reasoning and obligation. The model distributes different philosophical material to different depths. Concrete and practical at mid-depth. Abstract and principled in the late layers.
Arshavir Blackwell, PhD
There's also an interesting difference in the individual LoRA-specific features. In the first newsletter, we reported that at Layer 22, 13 of 15 LoRA-specific features were causally inert. At Layer 16, the picture is different. Four of six show real signal under activation patching. The strongest, Feature 9868, peaks at 0.207, comparable to the best individual feature at Layer 22.
Arshavir Blackwell, PhD
But, and this is important, even F9868's 0.207 is dwarfed by the Layer 16 cluster effects. The strongest cluster peaks at 1.449, roughly seven times larger. So the core finding holds at both layers: co-activation clusters produce causal effects an order of magnitude beyond what individual features achieve. The difference is that at Layer 16, the individual features aren't entirely decorative. They carry modest but real signal before the cluster-level amplification kicks in.
Arshavir Blackwell, PhD
Layer 16 also has its own version of the causally inert cluster. Eight features, mean delta essentially zero. On average, patching this cluster changes nothing. But it's less clean-cut than at Layer 22. The peak delta reaches 0.191, which is close to where individual features start showing real signal. So the average says inert, but the best single chunk is borderline. Small sample, only 8 features and 15 chunks, so we call it "near-inert" and move on. The representation-without-function pattern appears at both depths, though more clearly at L22.
Arshavir Blackwell, PhD
That's the layer replication. Reassuring, but it's still the same corpus. What happens when our analyses encounter something radically different?
Arshavir Blackwell, PhD
We fine-tuned the same base model on 1,200 synthetic advertising chunks spanning 12 commercial styles. Gen-Z social media. Rugged Americana heritage brands. Legal Boilerplate. Mid-Century Populuxe. Fitness Motivation. Children's Whimsical copy. Old-Money Luxury. Hollywood golden-age glamour. Tech Startup manifestos. Southern Hospitality. Wellness Spirituality. And Suburban Dad Energy. Twelve distinct styles, trained simultaneously into a single LoRA.
Arshavir Blackwell, PhD
The first result: our analyses detect more structure, not less. Cross-reconstruction gaps on held-out text came in at 6.4 times median at Layer 16 and 7.7 times median at Layer 22. Marcus was about 5 times at both layers. So our analyses find moderately more LoRA-specific structure in commercial writing than in philosophical text. That makes intuitive sense. The distance between base-model English and "bestie we need to talk about your jewelry situation" is larger than the distance between base-model English and Stoic prose. But the aggregate comparison understates what's really going on. The per-style variation is the more revealing dimension.
Arshavir Blackwell, PhD
The layer pattern also shifts. In Marcus, Layer 16 and Layer 22 were nearly equal. In ad copy, Layer 22 pulls clearly ahead. Different fine-tuning tasks produce different depth profiles. That's a finding we couldn't have made from Marcus alone.
Arshavir Blackwell, PhD
But the most interesting result is what happens when you break the signal down by individual style. The per-style cross-reconstruction gaps on held-out text at Layer 22 range from 5.3 times for Hollywood to 15.2 times for Whimsical. Every single style exceeds Marcus, but there's a nearly three-times spread within ad copy itself.
Arshavir Blackwell, PhD
The ordering makes intuitive sense. Hollywood and Dad Energy use registers closest to standard English — warm but conventional storytelling. The model was already nearly there. Whimsical Children's copy with emoji and rhyming patterns and magical language is maximally distant from anything in the base model's training data. Our analyses are measuring how hard the model has to work for each style. And that measurement could become a practical tool. If you're training a multi-style LoRA, the per-style gap tells you where to invest more training data.
Arshavir Blackwell, PhD
The ad copy LoRA also recruits far more dedicated features. Marcus had 15 LoRA-specific features at Layer 22, about 5% of all active features. Ad copy has 81, about 12%. When the task is more complex, 12 styles instead of one, the model builds more specialized machinery.
Arshavir Blackwell, PhD
Now, the clustering. We ran co-activation clustering on the ad copy features at Layer 22, where the cross-reconstruction gap was strongest, using the same method as the Marcus study. And this is where the new domain reveals something we couldn't have seen before.
Arshavir Blackwell, PhD
The shared features don't organize by individual style. They organize by what I'm calling register families. In linguistics, a register is the variety of language you use in a particular context: formal versus casual, technical versus conversational, warm versus detached. A register family is a group of styles that share an underlying register despite differing in surface content.
Arshavir Blackwell, PhD
The largest cluster, 134 features, groups Southern, Americana, Hollywood, Dad Energy, and Wellness. These are the warm, narrative, voice-driven styles. The second-largest, 107 features, groups Highend, Legal, Populuxe, Fitnessbro, and Hollywood. The formal and structured styles. A third cluster groups Whimsical, Generic Ad Copy, Southern, and Dad Energy. The playful and casual registers.
Arshavir Blackwell, PhD
The key insight: the LoRA doesn't maintain 12 separate toolkits for 12 styles. It maintains a smaller number of register-level toolkits and combines them. Southern copy draws from the warm cluster and the playful cluster. Hollywood draws from the warm cluster and the formal cluster. It's a combinatorial strategy: the model blends register-level resources to produce individual styles rather than building a dedicated circuit for each one.
Arshavir Blackwell, PhD
One pairing was completely unexpected. At both k equals 5 and k equals 12, Tech Startup copy and Whimsical Children's copy consistently cluster together. SaaS manifestos and toy advertisements. They seem unrelated until you look at structure. "We built what we wished existed" and "Maybe you'll find faraway lands" share short declarative sentences, list-like structure, and direct address. The model groups by form, not meaning.
Arshavir Blackwell, PhD
Now for the ad-specific features, the 81 features the LoRA built from scratch. These cluster dramatically better than the shared features. The silhouette score, which measures how cleanly separated the clusters are, is 0.439 for ad-specific versus 0.061 for shared. That's seven times higher. And this comparison is within the ad copy analysis: same corpus, same layer, same method. The only thing that changes is whether the features were borrowed from the base model or created during fine-tuning. Marcus's shared-feature silhouette of 0.119 falls between the two.
Arshavir Blackwell, PhD
Why the gap? Shared features were originally trained to represent general English. The LoRA is repurposing them for ad copy, so their co-activation patterns are noisy. A feature that encodes formal sentence structure might fire on Legal Boilerplate and Highend and Populuxe, because all three use formal syntax. That overlap blurs the cluster boundaries. Ad-specific features were created during fine-tuning specifically to handle what the base model couldn't already do. They were born specialized, so they cluster cleanly.
Arshavir Blackwell, PhD
The poster child: one cluster of 8 ad-specific features maps Whimsical copy at 100 out of 100 chunks. Perfect recall. It's not perfectly precise: Southern at 7 chunks and Tech Startup at 6 also activate the cluster, so precision is about 88%. But for an unsupervised method with no style labels, that's striking. The LoRA built a near-dedicated circuit specifically for children's advertising.
Arshavir Blackwell, PhD
The tight-circuit versus loose-coalition pattern replicates as well, with a new structural finding. In the shared features, it's the same ratio as Marcus: one tight cluster out of five. But in the ad-specific features, the ratio flips. Three of five clusters are tight. When the LoRA creates its own features, it tends to create compact, low-dimensional circuits. When it borrows features from the base model, it assembles them into high-dimensional coalitions. Feature origin predicts encoding strategy.
Arshavir Blackwell, PhD
And yes, the ad-specific features produce their own inert cluster. Eleven features, PC1 of 97%, mean delta of negative 0.007. Tight, confident, causally dead. The shared-feature tight cluster is also near-inert, seven features, mean delta of negative 0.002. So representation without function now appears in Marcus at Layer 22, Marcus at Layer 16, and in both the shared and purpose-built features of the ad copy LoRA. Whatever gradient descent is doing when it creates these clusters, it does it reliably across corpora, layers, and feature types.
Arshavir Blackwell, PhD
Five takeaways.
Arshavir Blackwell, PhD
One: the right unit of analysis is the group, and that finding generalizes. It now holds across two corpora and two layers, on tasks ranging from ancient philosophy to children's advertising.
Arshavir Blackwell, PhD
Two: different tasks produce different depth profiles. Marcus distributes evenly between layers. Ad copy concentrates in the late layer. The depth distribution of LoRA adaptation varies with the task.
Arshavir Blackwell, PhD
Three: the LoRA adapts its strategy to the task's complexity. Marcus is a single philosophical voice, and the LoRA handled it mostly by redirecting features the base model already had: only about 5% of active features were LoRA-specific. Ad copy is 12 styles at once, and the LoRA built far more of its own machinery: over 12% of active features are LoRA-specific, more than double the proportion. Those purpose-built features also cluster seven times more cleanly than the borrowed ones, because they were specialized from the start. The adaptation strategy scales with how much the target task diverges from what the base model already knows.
Arshavir Blackwell, PhD
Four: the model organizes by form, not content. The LoRA builds toolkits per register and combines them. Tech Startup copy and Children's copy share a cluster because they share syntactic structure, not meaning. The internal organization of stylistic adaptation is fundamentally structural, not semantic.
Arshavir Blackwell, PhD
Five: representation without function is universal. Every corpus, at every layer tested, in both shared and purpose-built features, produces at least one cluster that fires confidently and contributes nothing causally. This is too consistent to be accidental.
Arshavir Blackwell, PhD
The most pressing open question: does this structure hold in a completely different text within the same philosophical tradition? We've shown it replicates across layers and across domains. The within-tradition comparison, same type of content, different author, different voice, is the test we're running next.
Arshavir Blackwell, PhD
If you want the full numbers, all the charts, and the per-style breakdowns, the complete write-up is on our Substack.
Arshavir Blackwell, PhD
This work was a collaboration with John Holman. I'm Arshavir Blackwell and this has been Inside the Black Box.
