Episodes (27)
This episode explores fluency-as-validity: the way polished AI responses can make us feel like the work of judgment is already done. It also looks at why large language models are so effective at creating the sensation of clarity, and why mechanistic interpretability may be a way to push back against that enchantment.
716 features fire on both Seneca and Marcus Aurelius but stay dark for ad copy. The model learned Stoic philosophy, not just an author's style. Plus: why 'inert' features aren't all the same thing.
We trained a fresh LoRA on the letters of Seneca and ran the same analysis pipeline we used on Marcus Aurelius and advertising copy. Every structural finding replicated. The model organizes its adaptation into five clusters: one tight (features moving in lockstep) and four loose (features cooperating more independently). Seneca produced the cleanest clustering we've measured and the strongest workhorse cluster, a group of 141 features encoding philosophical argumentation with a causal effect more than three times stronger than anything in Marcus. Done in collaboration with John Holman.
We replicated our Marcus Aurelius findings at a new layer, then threw the whole method at 12 commercial ad copy styles trained into a single LoRA. The patterns held, and the new domain revealed something we couldn't have seen before: the model organizes its adaptations by register family, not by individual style.
We opened the 65%. The features that resisted interpretation one at a time turned out to organize into five co-activation clusters with clear thematic identities and causal effects nearly ten times stronger than any individual feature. Second in a series with John Holman.
This episode of Inside the Black Box: Cracking AI and Deep Learning tells the story of an unexpected convergence in the history of language and AI. In 1995, Peter Bensch noticed that Zelig Harris, a mid‑century structural linguist, and Jeff Elman, a pioneer of simple recurrent networks, had independently uncovered the same deep insight about language: structure lives in patterns of use.
Arshavir Blackwell, PhD, guides listeners through Harris’s world of distributional linguistics and operator grammar—where you infer structure from where words can substitute for one another—and contrasts it with Elman’s tiny recurrent neural networks that learn to predict the next word. Along the way, we see how these very different traditions arrive at the same place: hidden geometric structure in how language is used.
From there, the episode bridges to today’s large language models and mechanistic interpretability, asking a deceptively simple question: what counts as "structure" inside a model? We explore how patterns, clusters, and features relate to genuine internal organization, and why Harris and Elman’s convergence still shapes how we think about circuits, features, and the geometry of meaning in modern AI.
In this episode of Inside the Black Box: Cracking AI and Deep Learning, Arshavir Blackwell, PhD, takes engineers and researchers inside the practical mechanics of LoRA, low‑rank adaptation methods that make it possible to fine‑tune multi‑billion‑parameter language models on a single GPU.
This episode dives into why judging AI by behavior alone falls short of proving true intelligence. We explore how insights from mechanistic interpretability and cognitive science reveal what’s really happening inside AI models. Join us as we challenge the limits of behavioral tests and rethink what intelligence means for future AI.
Explore how BERT’s attention heads reveal an emergent understanding of language structure without explicit supervision. Discover the role of attention as a form of memory and what it means for the future of AI language models.
Dive into how we naturally explain neural networks with folk interpretability and why these simple stories fall short. Discover the journey toward mechanistic understandability in AI and what that means for how we talk about and trust large language models.
Explore how sparse autoencoders and transcoders unveil the inner workings of GPT-2 by revealing functional features and computational circuits. Discover breakthrough methods that shift from observing raw network activations to mapping the model's actual computation, making AI behavior more interpretable than ever.
Explore how attention heads uncover patterns through learned queries and keys, revealing emergent behaviors shaped by optimization. Dive into parallels with natural selection and psycholinguistics to understand how meaning arises not by design but through experience in both machines and brains.
Explore how GPT-2 balances fleeting factual recall with generic responses through internal competition among candidate answers. Discover parallels with human cognition and how larger models navigate indirect recall to reveal hidden knowledge beneath suppression.

























