Arshavir Blackwell, PhD

Inside the Black Box: Cracking AI and Deep Learning

TechnologyEducation

Listen

All Episodes

The Weird Geometry That Makes AI Think

Explore how large language models use high-dimensional geometry to produce intelligent behavior. We peer into the mathematical wilderness inside transformers, revealing how intuition fails, and meaning emerges.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.


Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

Welcome to Inside the Black Box. I’m Arshavir Blackwell, and today we continue our exploration of that persistent question: how do large language models actually think? Last time, we examined debugging and circuit-mapping. This episode goes one step further—into the strange geometry that underlies these models.

Arshavir Blackwell, PhD

Inside a transformer—systems like ChatGPT or Claude—information doesn’t live in the familiar three-dimensional world of our senses. It unfolds in vector spaces with thousands, even tens of thousands, of dimensions. Ordinary intuition fails here. It’s an alien landscape—one where human reasoning loses its footing, and yet where these models find theirs.

Arshavir Blackwell, PhD

In such spaces, two random directions are almost perfectly orthogonal—nearly at right angles. Each random vector has its own direction, with almost no overlap. Imagine a crowded stadium in which every person somehow has complete personal space.

Arshavir Blackwell, PhD

But the vectors in a trained neural network aren’t random. They’re shaped by learning, allowing the model to represent similarity and meaning. Related concepts overlap just enough for the model to capture relationships. That overlap is how the model works.

Arshavir Blackwell, PhD

Almost all of the volume in these high-dimensional spaces lies near the surface, not the center. In five thousand dimensions, more than 99.999% of the volume of a hypersphere is concentrated in a thin shell at the edge. The computations in language models happen out there—where mathematical intuition breaks down.

Arshavir Blackwell, PhD

Modern large language models have residual streams—the superhighways information travels on. These are thousands of dimensions wide—roughly ten to twelve thousand. You might think that means one feature per dimension, neatly separated. But that’s far from true.

Arshavir Blackwell, PhD

Through superposition, models pack far more features than dimensions allow. Features overlap, share space, and sometimes interfere—like an overbooked hotel where multiple guests share rooms. The model constantly disentangles these signals as it thinks.

Arshavir Blackwell, PhD

And this is where geometry gets tricky. Both direction and magnitude matter. Cosine similarity tells us whether two vectors point the same way—often a clue to semantic similarity—while the length of a vector can encode confidence or importance. In high dimensions, distances themselves start to blur: random points all lie roughly the same distance apart. The model has to reason through this concentrated geometry.

Arshavir Blackwell, PhD

So the model is juggling overlapping features, interference, direction, magnitude, and the odd behavior of distance in high-dimensional space. Understanding this geometry is key to reverse-engineering what’s happening inside.

Arshavir Blackwell, PhD

This isn’t just academic. It matters for AI safety and control. If we can’t interpret what happens inside these networks, we can’t reliably steer them. Adversarial attacks prove the point: tiny changes—imperceptible to humans—can send the model in completely different directions. Those vulnerabilities live in dimensions we can’t see.

Arshavir Blackwell, PhD

Alignment has geometric roots too. Recent research on steering vectors shows that we can influence model behavior by nudging activation space itself—making models more truthful or consistent. But to steer well, we first have to understand the terrain.

Arshavir Blackwell, PhD

Let’s make this concrete. Imagine the model processing “The cat sat on the…” When cat enters, it’s already a vector with thousands of coordinates. Its meaning isn’t stored in discrete slots like “animal = dimension 47.” The features—animal, pet, furry, domesticated—are spread across many dimensions in overlapping patterns. As that vector moves through the network, each layer transforms it.

Arshavir Blackwell, PhD

Attention mechanisms look at context—other words nearby—and mix information accordingly. Feed-forward layers reshape the vector itself, adjusting the weight of each feature. Every transformation rotates, stretches, or repositions patterns in this high-dimensional space. The model learned these transformations from training, but at inference time, it’s just geometry in motion.

Arshavir Blackwell, PhD

Now take the word sat. The model needs to know what earlier tokens matter. The representation of sat forms a query vector, asking a question. Each previous token carries a key vector, a possible answer. The model compares them by measuring how much they align. The key from cat scores high—not because of grammar rules, but because the model learned that subjects and verbs share this geometric relationship. It’s pattern matching in vector space.

Arshavir Blackwell, PhD

Different layers specialize in different things. Early layers tend to capture syntax—who’s the subject, what’s the verb. Middle layers encode relationships: who’s doing what to whom. Later layers handle meaning and prediction. By the time we reach the final the, the representation has entered an attractor region—a zone where likely completions like “mat,” “couch,” or “rug” cluster together. The geometry does the work.

Arshavir Blackwell, PhD

This is where mechanistic interpretability—MI—comes in. MI is about reverse-engineering these circuits. Researchers trace which components activate for which features, mapping specific computations to specific mechanisms. Take induction heads—discovered by Anthropic’s interpretability team. These are small circuits that detect repeated patterns. If the model sees “A B … A,” the induction head learns to predict “B.” It’s a clear, mechanical behavior.

Arshavir Blackwell, PhD

More recently, sparse autoencoders have helped us unpack these entangled representations. Instead of each neuron doing ten things at once, SAEs reveal directions that correspond to individual concepts. It’s like putting on glasses that let us see structure that was always there, just hidden.

Arshavir Blackwell, PhD

And this connects to what we call LLM-ology—studying models empirically, almost like cognitive psychology. MI shows us the wiring; LLM-ology shows the behavior. Together, they reveal how meaning moves through the network.

Arshavir Blackwell, PhD

When a model processes a sentence, its internal representation literally travels through high-dimensional space. Sentences with numbers, pronouns, or causal words often trace similar paths. The model has computational tendencies—habits of motion we’re only beginning to chart.

Arshavir Blackwell, PhD

We can locate some structure with sparse autoencoders, but much remains hidden—subtle signals like irony, humor, or moral tone may live in remote corners of this space. This is where cognitive science meets geometry, and where interpretability meets mystery.

Arshavir Blackwell, PhD

Taken together, MI reveals the static wiring, while LLM-ology follows the dynamics—the motion of meaning through space. Transformers don’t store rules; they sculpt and navigate geometry.

Arshavir Blackwell, PhD

And that geometry, as powerful as it is, brings fragility. Most points in these vast spaces are meaningless noise. The meaningful regions—the semantic manifolds—occupy only a thin sliver of the hypersphere. That’s the curse of dimensionality: the expressiveness that makes these models so capable also makes them precarious.

Arshavir Blackwell, PhD

Even visualization struggles here. When we project thousands of dimensions down to two or three—using PCA or t-SNE—the true relationships blur. The result is an aesthetic map, not a faithful one. Even with modern tools, our view of these models is still partial—a sketch of a landscape we can never fully see.

Arshavir Blackwell, PhD

And yet, we keep mapping. Every new method—mechanistic interpretability, sparse autoencoders, the emerging science of LLM-ology—takes us a step closer to understanding how these systems think. I’m Arshavir Blackwell, and this has been Inside the Black Box.