Embeddings

From lookup tables to semantic search — 7 chapters

How token IDs become vectors, why similar words cluster in space, how TF-IDF, Word2Vec, and transformer models learn representations, and how sentence embeddings power semantic search and RAG.

From token ID to dense vector

The Lookup Table

Every token in the vocabulary maps to a row in the embedding matrix. When the model sees token ID 5432, it reads that row — a dense vector of 768 to 8192 floats. This lookup is just an index operation: no multiplication, no activation.

The embedding matrix has shape [vocab_size × d_model]. For GPT-2: 50,257 × 768 = 38.6M parameters in this one table. In PyTorch: torch.nn.Embedding(vocab_size, d_model). Acting on indices rather than one-hot vectors makes it memory efficient — effectively a linear layer without the matmul. d_model ranges from 768 (GPT-2) to 8192 (LLaMA 3 70B). Weights start random and are learned end-to-end.

Embedding matrix size

vocab size

50,257

d_model

768

parameters

38.6M

The embedding table is one of the largest weight matrices in the model — just for one lookup.

Click a token to look it up

IDtoken

0[PAD]

1[BOS]

2[EOS]

5432cat

5433dog

5434bank

5435river

5436money

The embedding layer is just a weight matrix used as a lookup table. The weights are learned through backpropagation — the model discovers which directions in this space encode useful features.

1 / 7

Continue learning

Inside the Transformer

See embeddings in context — the full forward pass from tokenization through attention to generation.

→

Tokenizer Training

How BPE builds the vocabulary that the embedding table maps — the step before embeddings.

→

The Attention Mechanism

How transformers update token embeddings layer by layer using attention.

→