MAY 14, 2026

Attention is just weighted memory, and memory is everything

NLP · Explainer · 4 min read

I wrote this to understand it myself. If it helps you understand it too, that's a bonus.

Here's a sentence: "The trophy didn't fit in the suitcase because it was too big."

Read it fast. You know immediately that "it" refers to the trophy, not the suitcase. You didn't consciously parse the grammar. Something in your reading process figured it out, pulled the right referent from memory, weighted it correctly, resolved the ambiguity without effort.

That something is roughly what self-attention is doing. And the "weighted" part is doing all the work.

What is self-attention?

The formal definition is this: self-attention computes a weighted sum of value vectors, where the weights are determined by the compatibility between query vectors and key vectors.

That sentence is accurate but almost useless for intuition. Let me try to explain it another way.

Imagine you're reading a sentence one word at a time, but you have perfect memory of every word you've already seen, and you can selectively attend to different words with different intensities based on what you're currently trying to figure out.

When you reach the word "it" in that trophy sentence, you need to figure out what "it" refers to. Your attention mechanism asks: which earlier words are most relevant to resolving this ambiguity? It computes a kind of compatibility score between "it" (the query) and every previous word (the keys). "Trophy" scores high. It's a noun, it's the right syntactic type, it came right before the verb phrase we're unpacking. "Suitcase" scores somewhat high too. "The", "didn't", "fit" score low.

You then blend the meanings of all the previous words together, weighted by those scores. "Trophy" contributes a lot to your understanding of "it". "The" contributes almost nothing. The output is a new representation of "it" that's been enriched by the context of the whole sentence, selectively, not uniformly.

This is self-attention. Query, key, value. Ask a question (query), match it against all available memories (keys), retrieve a weighted blend (values).

The "self" in self-attention just means the sequence is attending to itself. Every word can look at every other word in the same sequence. This is distinct from the attention in older encoder-decoder models, where the decoder attended to encoder outputs (attending to a different sequence). In a transformer, every token at every layer is computing attention over every other token in the same input.

This is also why transformers handle long-range dependencies so well compared to RNNs. An RNN processes sequences step-by-step and has to carry information forward through a fixed-size hidden state. The farther back something is, the harder it is to remember. Self-attention has no such constraint. Every token can directly attend to every other token regardless of distance. The trophy can be directly connected to the "it" forty tokens later with no degradation, just a learned attention weight.

Weighted memory is everything

The "weighted memory" framing also explains why multi-head attention matters.

A single attention head computes one set of query/key/value projections, one way of determining what's relevant. But language has multiple dimensions of relevance simultaneously. The word "bank" might be syntactically relevant to the verb nearby, semantically relevant to "river" three positions back, and coreferentially relevant to a pronoun two sentences ahead. These are different types of relevance that a single attention head might not capture simultaneously.

Multi-head attention runs several attention operations in parallel (eight heads in the original "Attention Is All You Need" paper, though this varies by model), each with its own learned projections. Each head learns to attend to a different type of relationship. One head might specialize in syntactic dependencies, another in semantic similarity, another in positional relationships. The outputs get concatenated and projected back down. The model learns which heads matter for which tasks.

Attention is learned

The part that took me the longest to really sit with: attention is learned, not programmed.

You don't tell the model "resolve coreference using these rules." The query, key, and value projections are weight matrices trained via backpropagation. The model learns, from data, which attention patterns are useful for minimizing the training loss. The trophy-"it" coreference resolution emerges not from an explicit rule but from a weight matrix that learned that nouns before pronouns are often their antecedents, because that pattern was useful for predicting the next token across billions of sentences.

This is what I find genuinely remarkable about self-attention: it's a completely general mechanism for contextual blending. The structure (query/key/value, softmax weighting, weighted sum) is fixed. The content of what gets attended to, what the model considers relevant, is entirely learned.

Memory, in humans, is associative. You don't retrieve memories by address; you retrieve them by similarity to a cue. Something in the present moment activates something in the past based on how compatible they are. Attention is weighted memory in exactly this sense: the query is the cue, the keys are the stored representations, the softmax over dot products is the associative retrieval, and the weighted sum of values is the blended memory you get back.

The math is different from how neurons actually work. The intuition, that context is built by selectively blending the past weighted by relevance, is as old as cognition itself.

I've read the Attention paper three times now. It keeps getting clearer and stranger simultaneously.