The Attention Mechanism

From dot products to hybrid architectures — 8 chapters

A complete deep dive into attention: why it exists, how it works mathematically, and every modern variant from RoPE and GQA to MLA, sparse attention, and hybrid linear architectures.

The problem attention solves

Why Attention?

Before transformers, sequence models processed tokens one at a time. RNNs accumulated context into a fixed-size hidden state — a bottleneck that couldn't hold everything for long sequences. Attention breaks this: every token looks directly at every other token in a single step.

RNNs have O(N) sequential operations, preventing parallelism and creating vanishing gradients across long distances. The 'information bottleneck' means a model processing 512 tokens must compress all context into one vector before generating. Attention has O(1) maximum path length between any two positions — a fact that transforms what models can learn.

RNN path length

O(N)

long-range

Attention path length

O(1)

any distance

Attention compute

O(N²)

per layer

RNN — sequential, one token at a time

The

cat

sat

the

mat

hidden state h_t (fixed-size bottleneck)

Token 0 ("The") must survive 5 compression steps to reach the output.
Vanishing gradients. Information lost at distance.

Attention — all tokens in parallel, O(1) path length

	The	cat	sat	on	the	mat
The
cat
sat
on
the
mat

Every token directly attends to every other in one step.
O(1) path length. Full parallelism. No bottleneck.

The fundamental tradeoff

✗RNNs: O(N) sequential steps, vanishing gradients, information bottleneck at long range

✓Attention: O(1) path length between any two tokens, fully parallel, direct gradient flow

△Tradeoff: attention is O(N²) compute and memory — solved by FlashAttention, sparse variants, and KV cache optimizations

1 / 8

Continue learning

Inside the Transformer

See attention in context — follow a prompt through tokenization, embeddings, attention, sampling, and generation.

→

The Inference Engine

KV caching, PagedAttention, and serving optimizations — how attention runs fast at scale.

→

Modern Techniques

MoE, reasoning, and long-context techniques that build on top of the attention mechanism.

→