Unlocking BERTs Hidden Grammar
Explore how BERT’s attention heads reveal an emergent understanding of language structure without explicit supervision. Discover the role of attention as a form of memory and what it means for the future of AI language models.
This show was created with Jellypod, the AI Podcast Studio. Create your own podcast with Jellypod today.
Get StartedIs this your podcast and want to remove this banner? Click here.
Chapter 1
Introduction
Arshavir Blackwell, PhD
I’m Arshavir Blackwell, and you're listening to Inside the Black Box. Today, we're opening up the BERT transformer, one of the first language models where the question switched from, "how does it perform?" to, "what does it process, and how does it process it?" Our last episode discussed folk interpretability, using a models self-reports to generate hypotheses about internal states, In this episode, we look deeper at mechanism. Attention heads---do they form a kind of memory for language, and if so, how? The more you look at these models the weirder they get!
Chapter 2
Early Discoveries of Attention Heads
Arshavir Blackwell, PhD
It’s 2019. This is ancient history by AI time. Clark, Khandelwal, Levy, Manning publish "What Does BERT Look At?" There are no sparse autoencoders or causal tracing yet, things that are now normal tools of model analysis. There was just looking at raw attention maps. They discovered all heads are not created equal. Some heads have narrow focus and point at very specific positions in the input, such as always looking one word backwards or two words forwards. Others spread their attention across the entire sentence, kind of like a "bag-of-words" average. Some heads focus on punctuation. This is obvious to us today in 2026 but at the time these were novel findings.
Arshavir Blackwell, PhD
Let’s look at what happens with the ‘SEP’ token. This is the sentence boundary marker, essentially a period. It absorbs a lot of attention from certain heads, especially in the middle layers. If you picture the attention matrix, you see this concentrated spot pulling in stray attention, even though [SEP] means almost nothing, just ‘sentence boundary.’ They called it an "attention sink". Because the softmax trick that normalizes attention means weights have to add up to one, and if a head has nowhere real to point its attention, it'll point the leftovers onto [SEP]. It’s neural garbage collection, enforced by math.
Arshavir Blackwell, PhD
This attention sink isn't just a BERT phenomenon. Later, in autoregressive models, papers like Xiao et al. found similar "beginning tokens" pulling in attention for no meaning, just because the math makes them act as these structural no-ops. Attention, in that sense, is not entirely about meaning. It also has these mechanical side effects, these artefacts that models exploit.
Chapter 3
Grammar Emergence through Attention Heads
Arshavir Blackwell, PhD
How does grammar emerge in these systems without explicit coaching? Clark and team asked this: if you look at certain attention heads, are they tracking actual grammatical relationships? Take head 8-10. When it's sitting on a verb, say "chased" in "the cat chased the mouse," does it point its attention at the direct object, "mouse"? Turns out, almost 87% of the time, yes. No one told BERT what a direct object is. This isn't in its training labels, it's just learning whatever helps it predict missing words, and yet it invents its own parser on the fly.
Arshavir Blackwell, PhD
There are many instances of this. Head 8-11 tracks determiners with close to 94% accuracy-so, "the" points at "cat" or "mouse," or whatever noun it's attached to. Heads pick up prepositions, possessives, and even pull off some coreference. Given that classic systems struggled with pronouns, this is impressive. Head 9-6 was getting prepositions right more than three-quarters of the time, and head 4-10 tracked passives with 82.5% accuracy. Even "her" tracking "book" or "hand" through head 7-6, that's around 80%. Distributional learning alone reproduces centuries of syntactic research.
Chapter 4
Multiple Heads Working Together
Arshavir Blackwell, PhD
Let’s now turn to an examination of multiple heads working in concert. Clark, et al., looked at pairs and at clusters. Many heads perform the same job, especially in the same layer. That redundancy contributes to robustness. Prune a head, the system’s performance often shows little or no effect. This ensemble memory may contribute to transformer reliability.
Arshavir Blackwell, PhD
This structural organization occurs by layer: early heads care about surface stuff, like position and punctuation. Middle layers focus in tight on syntax, and higher layers shift to meaning or semantics. This should sound familiar to anyone who remembers the "layers of processing" metaphor from old cognitive science. But there’s more redundancy and fewer rules. But remember, just because the structure is there in the attention heads doesn't mean the model uses it every time it has to make a prediction. Clark et al. didn't claim attention explains BERT's choices. It’s simply that the patterns echo grammar. That's enough to show something grammar-like is being internalized, even if other mechanisms are in play.
Chapter 5
Attention as Memory and Causality
Arshavir Blackwell, PhD
Does attention actually function as memory in a causal sense? Or are these heads just echoing grammatical relations? Early skeptics like Jain and Wallace showed you can scramble attention and sometimes get the same answer, suggesting attention might be a trace, not a driver. But (spoilers from after this 2019 paper), we now know that techniques like activation patching and head ablation permit researchers to intervene directly. The evidence now does support a causal role. But the picture is nuanced, as attention patterns alone don't fully explain behavior (nor should we expect them to). But they're not decorative either.
Arshavir Blackwell, PhD
At the time, those "positional offset heads" Clark et al. found, the ones mechanically attending to the previous token or next token… their utility was unclear. But skipping ahead to Olsson and colleagues' work in 2022, it turns out those heads have significant value. They form the backbone of in-context learning. Induction circuits give models the ability to generalize from a few examples. One head finds where in the context a piece of text appeared before, another copies what followed. It's a sort of neural pattern search, "last time I saw 'cat', 'sat' came next, so maybe do it again." These mechanical heads are supporting some of the model's most humanlike learning.
Chapter 6
Historical Context and Summary
Arshavir Blackwell, PhD
Early work on simple recurrent networks showed echoes of this pattern of grammatical structure emergent without it being explicitly taught to the network. Prediction pressure makes structure the most efficient way to compress input. This is statistics stacking up until regularities fall out, not magic. Three decades later, it's the same story, but of course the models now have much more complexity and many more free parameters.
Arshavir Blackwell, PhD
Zooming out…attention heads aren't the only part of the story, of course. They are a solid window into how models extract structure. Their traces echo grammar, their redundancies buffer mistakes, and, together with simple positional mechanisms, they lay the groundwork for more sophisticated learning. I’m Arshavir Blackwell. Thanks for joining me inside the black box.
