All EpisodesNovember 5, 2025

Can We Fix It?

Arshavir Blackwell takes you on a journey inside the black box of large language models, showing how cutting-edge methods help researchers identify, understand, and even fix the inner quirks of AI. Through concrete case studies, he demonstrates how interpretability is evolving from an arcane art to a collaborative science—while revealing the daunting puzzles that remain. This episode unpacks the step-by-step workflow and surprising realities of mechanistically mapping model cognition.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.

Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

Welcome to Inside the Black Box, where we peer into the hidden machinery of modern AI and try to understand how meaning and reasoning emerge from math. I’m Arshavir Blackwell. Today, we’re taking things in a slightly different direction. Up to now, we’ve been peering into large language models— mapping their circuits, tracing attention heads, and asking what really goes on inside. But there’s a tougher question underneath all that curiosity: Once you can see what’s happening under the hood, can you actually fix it? Or are we just standing by, watching strange behaviors march past? Today’s episode is about that next step— debugging a model from the inside out.

Arshavir Blackwell, PhD

This is where interpretability turns into something closer to engineering. It’s not just about observing what the model thinks. It’s about intervening when it stumbles—and learning how to set it right.

Arshavir Blackwell, PhD

Take a few quick examples. A model that takes everything too literally: You say, “Lend me a hand,” and it starts imagining limb-sharing. Say “kick the bucket,” and it confidently explains a leg exercise routine. It’s funny—until it happens in medicine or law. These aren’t random slips. They reveal how the model organizes meaning. With idioms, internal detectors light up for kick and bucket separately, but nothing binds them into “to die.” That’s weak multi-token binding. Sparse autoencoders or induction head analysis can pinpoint where that connection fails.

Arshavir Blackwell, PhD

Metaphors show the same weakness. “The stock market caught a cold.” Instead of mapping that to “the market’s weak,” some models report, “Influenza detected among traders.” The bridge between health and economics—gone. And the classics: “Can you pass the salt?” “Yes, I can pass the salt.” Syntax understood; social intent missed. Or: “The trophy doesn’t fit into the suitcase because it’s too small.” Flip small to big, and many models still swap the referent. Once you start looking, these misfires are everywhere. But they’re not noise—they’re clues. They show us how the internal machinery is wired, and where it breaks down.

Arshavir Blackwell, PhD

So how do you debug a model? That’s where mechanistic interpretability shows its power. Step one: elicit the failure. Collect examples. Find the pattern. Step two: locate the relevant features. Sparse autoencoders act like microscopes for activations, revealing clusters that light up on idioms or metaphors. Then comes activation patching. Tools like ACDC—Automated Circuit Discovery and Correction— let you transplant activations from successful cases into failed ones. If swapping activations from layer 8, head 3 makes the model interpret “kick the bucket” correctly, you’ve likely found a circuit for figurative language.

Arshavir Blackwell, PhD

But correlation isn’t causation. That’s where Redwood Research’s Causal Scrubbing comes in. You deliberately disrupt that circuit and see whether the behavior breaks— while everything else stays stable. If it does, you’ve found the cause. Once you know the culprit, you can edit directly. Other techniques let you adjust weights surgically, repairing associations without retraining the entire model.

Arshavir Blackwell, PhD

Of course, real debugging isn’t tidy. Circuits overlap. Causal effects blur. Fixes that work in one domain collapse in another. Interpretability takes rigor, but it also takes intuition. And not every odd output is an error. Sometimes it’s a sign of creativity. When a model rephrases a garden-path sentence to make it clearer, that’s not failure—it’s linguistic insight.

Arshavir Blackwell, PhD

Here’s where it gets fascinating. A transformer isn’t one big mind. It’s a network of subnetworks. Each subnetwork contributes its own computation— syntax parsing, factual recall, metaphor resolution, pronoun tracking. They’re distinct but interconnected, like specialized teams in a vast organization. Each can be studied, visualized, even conceptualized on its own terms. Seen this way, a transformer isn’t a single black box— it’s a federation of smaller intelligences working together.

Arshavir Blackwell, PhD

Anthropic’s Transformer Circuits project made this vividly clear. Grammar heads, recall heads, reasoning heads— they form coherent internal communities. The model becomes less of a monolith and more of an ecosystem of cooperating modules.

Arshavir Blackwell, PhD

And the ecosystem studying those modules is expanding fast. Sparse autoencoders now extract stable, interpretable features. Neuronpedia and SAEBench let researchers annotate and share discoveries— a public atlas of model internals. DeepMind’s Gemma Scope puts billion-parameter interpretability on a laptop. Debugging is no longer the privilege of a few elite labs. It’s becoming a discipline open to anyone with curiosity and persistence.

Arshavir Blackwell, PhD

Still, the big puzzles remain. Superposition and drift—features overlap and shift during training. Translation—each model speaks its own internal dialect. And large-scale editing—how to tweak trillion-parameter systems without setting off unpredictable ripples elsewhere. These issues persist because neural networks weren’t designed to be legible. We’re reverse-engineering emergent intelligence— a structure that evolved its own logic. Until architectures are transparent by design, that sense of digital archaeology isn’t going away.

Arshavir Blackwell, PhD

But that’s exactly what makes the work exciting. Mechanistic interpretability is where curiosity becomes control— where we move from “let’s see what happens” to “let’s make it happen.” And beyond the lab, this matters for everyone. The more transparent these systems become, the more we can trust them— and the more they’ll teach us about intelligence itself. Artificial or human, the patterns are starting to rhyme.

Arshavir Blackwell, PhD

If you’re experimenting with any of these techniques, or others, please share your findings. This field advances through collaboration. Every circuit mapped, every subnetwork clarified, brings us one step closer to understanding— and shaping—the minds we’ve built.

Arshavir Blackwell, PhD

Thanks for listening to Inside the Black Box. If you enjoyed this episode, consider subscribing on Substack or your favorite podcast app. You’ll find show notes, research links, and visual diagrams at arshavirblackwell.substack.com. I’m Arshavir Blackwell. See you next time— when we’ll open another corner of the machine mind and see what’s thinking inside.