Episodes (27)

April 27, 2026

Why Your Brain Believes the Model

The Heuristic Loop You Can't Break from Inside

April 20, 2026

When Polished Answers Feel Finished

This episode explores fluency-as-validity: the way polished AI responses can make us feel like the work of judgment is already done. It also looks at why large language models are so effective at creating the sensation of clarity, and why mechanistic interpretability may be a way to push back against that enchantment.

April 12, 2026

What Seneca Teaches Us that Marcus Couldn't

716 features fire on both Seneca and Marcus Aurelius but stay dark for ad copy. The model learned Stoic philosophy, not just an author's style. Plus: why 'inert' features aren't all the same thing.

April 4, 2026

The Pattern Holds for Another Author

We trained a fresh LoRA on the letters of Seneca and ran the same analysis pipeline we used on Marcus Aurelius and advertising copy. Every structural finding replicated. The model organizes its adaptation into five clusters: one tight (features moving in lockstep) and four loose (features cooperating more independently). Seneca produced the cleanest clustering we've measured and the strongest workhorse cluster, a group of 141 features encoding philosophical argumentation with a causal effect more than three times stronger than anything in Marcus. Done in collaboration with John Holman.

March 30, 2026

The Pattern Holds

We replicated our Marcus Aurelius findings at a new layer, then threw the whole method at 12 commercial ad copy styles trained into a single LoRA. The patterns held, and the new domain revealed something we couldn't have seen before: the model organizes its adaptations by register family, not by individual style.

March 22, 2026

Cracking Open the Black Box

We opened the 65%. The features that resisted interpretation one at a time turned out to organize into five co-activation clusters with clear thematic identities and causal effects nearly ten times stronger than any individual feature. Second in a series with John Holman.

March 12, 2026

Inside a Fine-Tuned Language Model

A concise, single-segment episode of Inside the Black Box: Cracking AI and Deep Learning where Arshavir Blackwell explains, in one continuous narrative, what neural networks are, how their simple units combine into powerful systems, and how learning by backpropagation sculpts their behavior. This short episode is designed as an elegant, one-paragraph-style monologue that introduces listeners to neural nets without equations or jargon.

March 6, 2026

What Counts as Structure? From Harris and Elman to Today’s Neural Nets

This episode of Inside the Black Box: Cracking AI and Deep Learning tells the story of an unexpected convergence in the history of language and AI. In 1995, Peter Bensch noticed that Zelig Harris, a mid‑century structural linguist, and Jeff Elman, a pioneer of simple recurrent networks, had independently uncovered the same deep insight about language: structure lives in patterns of use.

Arshavir Blackwell, PhD, guides listeners through Harris’s world of distributional linguistics and operator grammar—where you infer structure from where words can substitute for one another—and contrasts it with Elman’s tiny recurrent neural networks that learn to predict the next word. Along the way, we see how these very different traditions arrive at the same place: hidden geometric structure in how language is used.

From there, the episode bridges to today’s large language models and mechanistic interpretability, asking a deceptively simple question: what counts as "structure" inside a model? We explore how patterns, clusters, and features relate to genuine internal organization, and why Harris and Elman’s convergence still shapes how we think about circuits, features, and the geometry of meaning in modern AI.

February 27, 2026

Building a House Without Blueprints: When Interpretability Tools Work — and When They Don’t

This episode of Inside the Black Box: Cracking AI and Deep Learning explores a new theoretical framework that unifies sparse autoencoders (SAEs), transcoders, and crosscoders — and what it tells us about when mechanistic interpretability actually works. We start by demystifying these tools and how they use sparse features to uncover internal concepts and computations in large language models, from DNA detectors to deception circuits in Claude 3 Sonnet. Then we introduce the linear representation hypothesis and the geometry of concepts as directions in activation space, along with the challenge of superposition when thousands of concepts must fit into limited dimensions. Finally, we dive into Tang et al.’s recovery theorems, the compressed sensing roots of their approach, and why these results matter for using SAEs as a reliable “microscope” on model internals, especially in the context of fine-tuning and LoRA experiments. Along the way, we confront the uncomfortable possibility that the linear picture may break down at frontier scales — and what that would mean for the future of interpretability as a safety strategy.

February 19, 2026

I Told My LLM Not to Say "Empower"

In this episode of Inside the Black Box: Cracking AI and Deep Learning, Arshavir Blackwell, PhD, takes engineers and researchers inside the practical mechanics of LoRA, low‑rank adaptation methods that make it possible to fine‑tune multi‑billion‑parameter language models on a single GPU.

February 9, 2026

Beyond the Surface of AI Intelligence

This episode dives into why judging AI by behavior alone falls short of proving true intelligence. We explore how insights from mechanistic interpretability and cognitive science reveal what’s really happening inside AI models. Join us as we challenge the limits of behavioral tests and rethink what intelligence means for future AI.

February 3, 2026

Unlocking BERTs Hidden Grammar

Explore how BERT’s attention heads reveal an emergent understanding of language structure without explicit supervision. Discover the role of attention as a form of memory and what it means for the future of AI language models.

January 28, 2026

Cracking the Code of AI Interpretation

Dive into how we naturally explain neural networks with folk interpretability and why these simple stories fall short. Discover the journey toward mechanistic understandability in AI and what that means for how we talk about and trust large language models.

January 26, 2026

Decoding GPTs Hidden Circuits

Explore how sparse autoencoders and transcoders unveil the inner workings of GPT-2 by revealing functional features and computational circuits. Discover breakthrough methods that shift from observing raw network activations to mapping the model's actual computation, making AI behavior more interpretable than ever.

January 14, 2026

Decoding Attention and Emergence in AI

Explore how attention heads uncover patterns through learned queries and keys, revealing emergent behaviors shaped by optimization. Dive into parallels with natural selection and psycholinguistics to understand how meaning arises not by design but through experience in both machines and brains.

January 7, 2026

When Knowledge Battles Noise in GPT Models

Explore how GPT-2 balances fleeting factual recall with generic responses through internal competition among candidate answers. Discover parallels with human cognition and how larger models navigate indirect recall to reveal hidden knowledge beneath suppression.

January 1, 2026

Inside Circuits: How Large Language Models Understand

Dive into the world of neural circuits within large language models. In this episode, Arshavir Blackwell unpacks how transformer circuits, attention mechanisms, and high-dimensional geometry combine to create the magic—and limits—of modern AI language systems.

December 29, 2025

Hallucinations, Interpretability, and the Seahorse Mirage

This episode dives into why advanced language models still generate hallucinations, how interpretability tools help us uncover their hidden workings, and what the seahorse emoji teaches us about model and human reasoning. Arshavir connects groundbreaking research, practical business importance, and the statistical quirks that shape AI's version of 'truth.'

December 19, 2025

How Transformers Stack Meaning Like Finnish Words

Explore how large language models build up meaning in ways strikingly similar to the layered grammar of Finnish. Arshavir Blackwell reveals why understanding Finnish morphology offers a powerful analogy for interpreting the compositional logic inside modern AI systems.

December 14, 2025

The Mandela Effect in AI: Why Language Models Misremember

Dive into how and why large language models like ChatGPT mirror the human Mandela Effect, reproducing our collective false memories and misquotations. Arshavir Blackwell examines the science behind errors in models and minds, and explores how new techniques can counteract these uncanny AI confabulations.

December 5, 2025

Bridging Circuits and Concepts in Large Language Models

How do millions of computations inside large language models add up to something like understanding? This episode explores the latest breakthroughs in mechanistic interpretability, showing how tools like representational geometry, circuit decomposition, and compression theory illuminate the missing middle between circuits and meaning. Join Arshavir Blackwell as he opens the black box and challenges what we really mean by 'understanding' in machines.

November 28, 2025

How Transformers Turn Words Into Meaning

Embark on a step-by-step journey through the inner workings of transformer models like those powering ChatGPT. Arshavir Blackwell breaks down how context, attention, and high-dimensional geometry turn isolated tokens into fluent, meaningful language—revealing the mathematics of understanding inside the black box.

November 19, 2025

Can Smaller Language Models Be Smarter?

Today we explore whether mechanistic interpretability could hold the key to building leaner, more transparent—and perhaps even smarter—large language models. From knowledge distillation and pruning to low-rank adaptation, we examine cutting-edge strategies to make AI models both smaller and more explainable. Join Arshavir as he breaks down the surprising challenges of making models efficient without sacrificing understanding.

November 13, 2025

The Weird Geometry That Makes AI Think

Explore how large language models use high-dimensional geometry to produce intelligent behavior. We peer into the mathematical wilderness inside transformers, revealing how intuition fails, and meaning emerges.

November 5, 2025

Can We Fix It?

Arshavir Blackwell takes you on a journey inside the black box of large language models, showing how cutting-edge methods help researchers identify, understand, and even fix the inner quirks of AI. Through concrete case studies, he demonstrates how interpretability is evolving from an arcane art to a collaborative science—while revealing the daunting puzzles that remain. This episode unpacks the step-by-step workflow and surprising realities of mechanistically mapping model cognition.

November 2, 2025

Using Symbolic AI to Explain LLMs

Delve into the mysterious world of neural circuits within large language models. We’ll dismantle the jargon, connect these abstract ideas to real examples, and discuss how circuits help bridge the gap between machine learning and human cognition.

November 2, 2025

Peering Inside the Black Box

Mechanistic interpretability and artificial psycholinguistics are transforming our understanding of large language models. In this episode, Arshavir Blackwell explores how probing neural circuits, behavioral tests, and new tools are unraveling the mysteries of AI reasoning.