Arshavir Blackwell, PhD

Inside the Black Box: Cracking AI and Deep Learning

TechnologyEducation

Listen

All Episodes

What Counts as Structure? From Harris and Elman to Today’s Neural Nets

This episode of Inside the Black Box: Cracking AI and Deep Learning tells the story of an unexpected convergence in the history of language and AI. In 1995, Peter Bensch noticed that Zelig Harris, a mid‑century structural linguist, and Jeff Elman, a pioneer of simple recurrent networks, had independently uncovered the same deep insight about language: structure lives in patterns of use.

Arshavir Blackwell, PhD, guides listeners through Harris’s world of distributional linguistics and operator grammar—where you infer structure from where words can substitute for one another—and contrasts it with Elman’s tiny recurrent neural networks that learn to predict the next word. Along the way, we see how these very different traditions arrive at the same place: hidden geometric structure in how language is used.

From there, the episode bridges to today’s large language models and mechanistic interpretability, asking a deceptively simple question: what counts as "structure" inside a model? We explore how patterns, clusters, and features relate to genuine internal organization, and why Harris and Elman’s convergence still shapes how we think about circuits, features, and the geometry of meaning in modern AI.

This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.

Get Started

Is this your podcast and want to remove this banner? Click here.


Chapter 1

Imported Transcript

Arshavir Blackwell, PhD

Today I want to talk about a question that I think is quietly sitting at the center of mechanistic interpretability — and that almost nobody is asking explicitly. The question is: what counts as structure? I'm going to get there through an unusual route. I want to start with a paper from 1995 — a paper most people in MI have never heard of — and show why it's more relevant to the current state of the field than almost anything published this year.

Arshavir Blackwell, PhD

The paper is by Peter Bensch. It's called "Neo-structuralism," and it appeared in the CRL Newsletter at UC San Diego. What Bensch noticed was a striking convergence between two thinkers who came from completely different intellectual worlds. The first is Zelig Harris — a structural linguist working from the 1940s through the 1960s. Harris argued that you could derive the structure of language purely from distributional evidence.

Arshavir Blackwell, PhD

You don't need to know what words mean. You don't need native speaker intuitions. You just look at what appears where — what substitutes for what, what co-occurs with what — and the categories emerge from the patterns. The second is Jeff Elman — a linguist and cognitive scientist who, in 1990, introduced the Simple Recurrent Network. Elman trained this network to predict the next word in a sequence. No grammar rules. No labeled categories.

Arshavir Blackwell, PhD

Just raw sequential input. And when he opened the network up and looked at the hidden unit activations, he found that the internal representations had organized into clusters that corresponded to grammatical and semantic distinctions. Nouns separated from verbs. Animate entities grouped together.

Arshavir Blackwell, PhD

Transitive and intransitive verbs formed distinct regions of the network's state space. Nobody told the network what a noun was. It figured it out because nouns appear in similar contexts. Bensch's insight was that despite the vast differences in methodology — algebraic substitution patterns on one hand, vector dynamics in a neural network on the other — Harris and Elman were doing the same thing.

Arshavir Blackwell, PhD

They were both uncovering linguistic structure through distributional regularities. Now, you might be thinking — that's interesting history, but what does a 1995 paper about a linguist and a simple recurrent network have to do with mechanistic interpretability in 2026? Think about what modern MI tools do.

Arshavir Blackwell, PhD

Sparse autoencoders recover directions in activation space that correspond to interpretable patterns. Months of the year. Indirect objects. Programming languages. Stylistic registers. At minimum, what an SAE gives you is an extensional grouping — a set of contexts that co-vary in high-dimensional space.

Arshavir Blackwell, PhD

That is exactly what Harris was doing. Two elements belong to the same category if they share distributional environments. The method has scaled up enormously — from corpora of English sentences to the internal activations of billion-parameter models — but the underlying logic is the same.

Arshavir Blackwell, PhD

You're finding things that pattern together. And when Elman analyzed the hidden states of his SRN, he was doing something we'd now call mechanistic interpretability. He didn't just evaluate the network's predictions. He opened it up. He asked what internal organization had formed. He was doing MI before MI had a name. Here's where it gets interesting — and where I think the field has a problem it hasn't fully reckoned with.

Arshavir Blackwell, PhD

The word "structure" gets used constantly in MI. We talk about uncovering structure in activations, identifying structural features, mapping a model's internal ontology. But the word is doing triple duty, and the three senses are not the same. First sense: extensional structure. A sparse feature clusters tokens by co-activation. Things that show up in similar contexts end up nearby in activation space.

Arshavir Blackwell, PhD

That's a pattern of similarity in distribution. Second sense: causal structure. A feature that, when you intervene on it, predictably changes the model's output. That tells you the feature is playing a role in computation. That's a stronger claim. Third sense: computational primitive. A feature that the system can't easily reorganize away without degrading performance.

Arshavir Blackwell, PhD

That's the strongest claim — you've found something the model actually needs. These three things are related, but they're not the same. And here's the problem: in practice, MI discourse often jumps straight from the first to the third. Someone finds a sparse direction that activates for a coherent set of tokens — say, deceptive-sounding text — and the framing becomes "the model has a feature for deception," as if we've identified a fundamental building block of the model's internal organization. But all we've really shown is that certain tokens cluster together in activation space. That's extensional structure.

Arshavir Blackwell, PhD

Going from there to an ontological claim — a claim that something actually exists inside the model as a real building block of its computation — that's a leap, and it's a leap the field makes routinely without flagging it. This is where Harris actually complicates things, but in a way I find productive. By Harris's own criteria, MI has already found structure. Distributional clustering is structure in the Harris sense. If tokens pattern together, they form a category.

Arshavir Blackwell, PhD

That's all there is to say. Harris wasn't asking whether speakers had mental boxes corresponding to his categories. He was describing distributional regularities and treating them as sufficient. But MI doesn't actually operate within Harris's framework — even though it inherited his methods. The field implicitly wants something stronger. It wants to know not just that tokens cluster, but that the model uses those clusters as units of its own computation.

Arshavir Blackwell, PhD

That they have an effect on output. That computation routes through them. That they're genuine components of the model's processing, not statistical artifacts that happen to be visible from the outside. That's a fundamentally different kind of claim. It's the difference between describing a pattern in the data and asserting something about the architecture of the system that uses it. The field's tools find patterns. Its claims are about what the model is. That mismatch — and the fact that nobody talks about it — is where a lot of the confusion comes from. The tools come from one tradition. The claims are aimed at something that tradition never attempted.

Arshavir Blackwell, PhD

Until the field makes that distinction explicit, it's going to keep oscillating between two standards of evidence without recognizing that they answer different questions. So what would count as structure? Here's my position: an individual feature, no matter how cleanly recovered, is not yet structure. Structure begins when features stand in systematic, compositional relationships with one another — when the directions form a relational algebra, not just a dictionary. Harris understood this. His categories were never defined in isolation.

Arshavir Blackwell, PhD

A noun wasn't a noun because of what it meant — it was a noun because of the entire lattice of environments it shared with other nouns and the operators that selected for it. The category only existed as part of a relational system. Elman's network showed the same thing geometrically.

Arshavir Blackwell, PhD

The interesting finding wasn't that nouns ended up near other nouns in state space. It was that the noun region bore a systematic geometric relationship to the verb region — and that relationship mirrored how nouns and verbs actually interact in language. The structure was in how the neighborhoods related to each other, not in any single neighborhood. This is what MI hasn't fully grappled with. The field has built powerful tools for recovering individual features.

Arshavir Blackwell, PhD

But a catalogue of features, however precise, is a parts list. It's not a grammar. Structure requires an account of how the parts compose — how features combine, constrain, and select for one another across layers and contexts. Let me put this differently, because I think there's a clean way to frame what's missing.

Arshavir Blackwell, PhD

The field's current tools are lexical. Sparse autoencoders and transcoders recover individual features — the vocabulary of the model's internal representations. Even cross-layer transcoders, which are a genuine advance because they decompose computation rather than static activations, still operate at the level of individual units.

Arshavir Blackwell, PhD

What's missing are tools that operate at the sentence level. And I don't mean that literally — I don't mean tools that process sentences as input. I mean tools that capture compositional relationships between features the way a sentence captures compositional relationships between words. A sentence isn't a list of words. It's a system of relationships that determines which combinations are grammatical and which aren't.

Arshavir Blackwell, PhD

The same distinction applies inside a model. Knowing the vocabulary is necessary. But it's the compositional grammar — how features bind into coherent, productive patterns of computation — that would constitute an actual structural theory. Harris worked at this level. His distributional analysis was never just about which words appeared in which contexts. It was about the substitutional and combinatorial relations that organized those contexts into a system. Elman's network learned at this level — it internalized not just word categories but the selectional relationships between them.

Arshavir Blackwell, PhD

The question for MI is whether its tools can be extended to operate at this level as well. I started with a 1995 paper about a linguist and a connectionist. The reason I keep coming back to it is that the lineage from Harris through Elman to modern MI is not a story of simple inheritance. It's a story of recurring problems. The distributional tradition asked: how does structure arise from patterns of use? Connectionism asked: what internal organization does a learner develop? Mechanistic interpretability asks both of these questions at once, at a scale neither tradition imagined.

Arshavir Blackwell, PhD

But the most important question in that lineage — when do the relationships among elements constitute a compositional system, and when are they just statistical regularities that happen to cluster? — remains open. If MI is going to mature into a theory of learned computation, rather than a collection of increasingly refined tools, that's the question it will have to answer. I'm Arshavir Blackwell and this has been Inside the Black Box.