Arshavir Blackwell, PhD

Inside the Black Box: Cracking AI and Deep Learning

TechnologyEducation

Listen

All Episodes

Fine Tuning Lora: It's Not What You Think

When you fine-tune an AI model, what changes inside doesn't predict what changes outside. This week on Inside the Black Box, I break down why — and what it means for anyone auditing or regulating these systems.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Is this your podcast and want to remove this banner? Click here.


Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

Here's something most people assume about fine-tuning AI models: if you change what's happening inside the model, you change what it does. And if you change what it does, something must have changed inside. Internal change and output change — they go together. It's intuitive. It's also wrong.

Arshavir Blackwell, PhD

I'm Arshavir Blackwell, PhD, and this is Inside the Black Box. Today I'm going to walk you through a finding from our latest research — work I've been doing with John Holman — that has direct implications for anyone auditing, deploying, or regulating fine-tuned language models.

Arshavir Blackwell, PhD

The short version: what a model learns inside does not predict what it does outside. And at the scales where models are actually deployed, these two things can move in opposite directions.

Arshavir Blackwell, PhD

Let me show you what I mean.

Arshavir Blackwell, PhD

We ran twelve experiments. We took Llama models at three scales — 3 billion, 8 billion, and 70 billion parameters — and fine-tuned each one on four very different text corpora using LoRA adapters. Same configuration every time: rank 16, attention-only, 50 training steps. The only variable was the training data.

Arshavir Blackwell, PhD

The four corpora were chosen to be as different from each other as possible. Marcus Aurelius' *Meditations* — dense philosophical prose. Seneca's *Letters* — similar domain, different register, more epistolary. DTC ad copy — commercial product descriptions, the kind of language you see selling sneakers and skincare. And Alpaca — Stanford's instruction-following dataset, which teaches the model to respond in a structured question-and-answer format.

Arshavir Blackwell, PhD

For each of the twelve resulting models, we measured two things completely independently. First: how much did the model's internal representations change? We used crosscoder feature analysis — sparse autoencoders trained on paired activations from the base and fine-tuned models — to decompose MLP-layer changes into individual features. Second: how much did the model's *outputs* change? We measured KL divergence on next-token predictions — basically, how differently does the fine-tuned model predict the next word compared to the base model?

Arshavir Blackwell, PhD

Then we plotted one against the other. Internal change on the x-axis. Output change on the y-axis. Twelve points, one per experiment.

Arshavir Blackwell, PhD

If internal change predicted output change, you'd see a diagonal — points climbing from lower-left to upper-right. More change inside, more change outside. A clean, boring, expected correlation.

Arshavir Blackwell, PhD

That is not what we see.

Arshavir Blackwell, PhD

At 3B and 8B, everything clusters in the lower-left. Modest changes on both axes, not much separation between the corpora. Nothing interesting yet.

Arshavir Blackwell, PhD

At 70B, the picture explodes. The four corpora scatter into three completely distinct regimes. And within just those four 70B data points, the Spearman rank correlation between internal change and output change is negative 0.80. The corpus with the *most* internal change has the *least* output change. And vice versa.

Arshavir Blackwell, PhD

Let me walk through each one.

Arshavir Blackwell, PhD

DTC ad copy produces the most extreme internal change of any experiment. 340 times — that's the peak ratio of cross-reconstruction error to self-reconstruction error. The model is building massive new internal machinery for commercial language. Entirely new representational geometry that didn't exist in the base model.

Arshavir Blackwell, PhD

But here's the thing: the outputs barely move. KL divergence of 0.478. Top-token agreement with the base model is still 79 percent. If you ran a standard output evaluation — which is what most auditing frameworks do today — you would classify this model as essentially unchanged.

Arshavir Blackwell, PhD

An output-only audit misses the most internally transformed model in the entire study. That's the danger of latent divergence: the restructuring is real, it's massive, but it doesn't propagate to outputs in a way that surface-level testing can detect.

Arshavir Blackwell, PhD

Alpaca does the exact opposite. It has the *lowest* internal change at 70B — 39.8 times, modest by 70B standards — yet it produces the *highest* output divergence of any experiment. KL of 1.41. Only 41 percent top-token agreement with the base model. The outputs have fundamentally shifted.

Arshavir Blackwell, PhD

So what's happening? The adapter isn't building new features. A 70-billion parameter model already knows how to follow instructions — that capability is already encoded in the weights. What the adapter does is *reweight* existing circuitry. It redirects features the model already has toward instruction-following formatting. It changes what the model *does* without changing what it *knows*.

Arshavir Blackwell, PhD

We confirmed this with CKA analysis — Centered Kernel Alignment, a completely different measurement tool that doesn't use sparse autoencoders at all. CKA shows Alpaca has the highest late-layer activation displacement, concentrated in the final 30 percent of layers. That's exactly the output-generation zone. The adapter is reaching in at the end and steering existing representations toward a new behavioral target.

Arshavir Blackwell, PhD

An internal-only audit would classify Alpaca as the least modified model. The outputs tell a completely different story.

Arshavir Blackwell, PhD

Marcus and Seneca both land in the upper right — high internal change, high output change — but they get there differently.

Arshavir Blackwell, PhD

Marcus builds internal change steadily across scales. 12.5 times at 3B. 29.3 times at 8B. 77.1 times at 70B. Meanwhile, its outputs barely move through 3B and 8B. The model is quietly accumulating internal restructuring — building representational geometry for Stoic philosophy — and then at 70B, the outputs finally catch up. KL jumps to 0.72. The internal change was there all along; the behavioral manifestation was delayed.

Arshavir Blackwell, PhD

Seneca couples earlier. By 8B it's already showing substantial output divergence — KL of 0.49. At 70B, both metrics are high: 99.5 times internal change, KL 1.24 on outputs. These are adapters that genuinely *construct* new representational geometry — novel features, new internal circuitry — and that construction eventually shows up in what the model generates.

Arshavir Blackwell, PhD

The decorrelation isn't noise. It reflects two fundamentally different adaptation mechanisms.

Arshavir Blackwell, PhD

When a model encounters a domain it lacks the internal machinery for — the stylistic texture of Stoic philosophy, the register of commercial copywriting — the LoRA adapter has to *construct* new features. Novel representational geometry that didn't exist in the base model. This registers as high internal change. Whether it shows up in outputs depends on whether those new features get used in generation, or remain latent internal structure.

Arshavir Blackwell, PhD

When a model already has the right machinery — and a 70-billion parameter model has encountered plenty of instruction-following data in pretraining — the adapter just *reweights* existing circuitry. It doesn't need to build anything new. It redirects what's already there. This produces modest internal change but can produce large output shifts.

Arshavir Blackwell, PhD

Construction versus reweighting. Two different mechanisms, and each one is invisible to the wrong measurement axis.

Arshavir Blackwell, PhD

We confirmed this causally, not just observationally. When we clamped the crosscoder-identified features back toward base-model values in Alpaca 70B, it partially reversed the output divergence. When we ran matched controls — random same-cardinality features, magnitude-matched Gaussian noise — nothing happened. The features the crosscoder identifies are carrying real behavioral load.

Arshavir Blackwell, PhD

Here's a prediction that sounds right but is wrong: models should build the most new features for text they find most surprising. High base-model perplexity — bad token predictions — should mean the model needs more new machinery.

Arshavir Blackwell, PhD

At 70B, it's inverted. Marcus has the *lowest* base-model perplexity — 4.4, the model predicts Stoic philosophy tokens quite well — yet triggers the *highest* novel-feature surge. DTC has the highest perplexity — 14.5 — but doesn't produce a proportionate surge. The correlation between perplexity and novel features is negative 0.80.

Arshavir Blackwell, PhD

This tells you something important: perplexity measures whether the model can predict the next token. Feature analysis measures whether the model has the right internal *structure* to represent the domain. A model can be good at predicting tokens while still lacking the specialized representational geometry that fine-tuning carves. These are genuinely different questions.

Arshavir Blackwell, PhD

And I want to emphasize: this decorrelation only shows up at scale. At 3B and 8B, everything looks correlated. If you compute the Spearman correlation across all twelve experiments, you get positive 0.71 — which looks like internal change and output change go together. But that's a scale confound. Larger models produce larger values on both axes simply because they have more capacity.

Arshavir Blackwell, PhD

Hold scale constant at 70B, and the correlation flips to negative 0.80. The regimes separate. The mechanisms diverge. This is not a smooth extrapolation from smaller models — it's a qualitative phase transition in how fine-tuning operates.

Arshavir Blackwell, PhD

Any auditing methodology validated only at small scale will miss it.

Arshavir Blackwell, PhD

The practical consequence is direct: single-axis auditing fails.

Arshavir Blackwell, PhD

An output-only evaluation — which is standard practice today — correctly characterizes Alpaca and Seneca, but misclassifies DTC as unchanged. An internal-only evaluation correctly identifies DTC's massive restructuring, but misclassifies Alpaca as minimally modified. Neither axis alone captures the full picture.

Arshavir Blackwell, PhD

This matters for regulatory frameworks. The EU AI Act, Article 25, shifts governance obligations when a deployer "substantially modifies" a vendor model. What constitutes substantial modification? Our evidence says you can't answer that question with outputs alone, or with internals alone. A model that rewrites its internal feature geometry while maintaining stable outputs — that's substantial modification, invisible to output testing. A model that shifts behavior dramatically through minimal internal changes — that's also substantial modification, invisible to internal analysis.

Arshavir Blackwell, PhD

Our recommendation: auditing at deployment scale requires both measurements. Internal feature analysis *and* output divergence evaluation, at a minimum of two network depths. The evidence package must include both axes.

Arshavir Blackwell, PhD

One more thing. We verified that the decorrelation isn't an artifact of the crosscoder methodology by running a completely independent measurement — linear CKA, Centered Kernel Alignment — which compares activations without any sparse autoencoder involved. At mid-network layers, CKA agrees with the crosscoder metric. At late layers, they diverge — and the divergence is informative. CKA sees Alpaca as having the highest late-layer displacement. The crosscoder sees it as having the lowest sparse feature change. Both are right. Alpaca moves activations globally without restructuring the sparse feature geometry. Two different tools, same mechanistic story.

Arshavir Blackwell, PhD

So here's the takeaway. If you're auditing fine-tuned models, evaluating them for deployment, or building governance frameworks around them — the internal–output decorrelation means you need to look at both dimensions. What a model learns inside does not predict what it does outside. The mechanism of adaptation varies by corpus. And each mechanism is invisible to the wrong measurement axis.

Arshavir Blackwell, PhD

Responsible deployment requires looking at both.

Arshavir Blackwell, PhD

This is Arshavir Blackwell, PhD, for Inside the Black Box. Thanks for listening.