Embeddings
From lookup tables to semantic search — 7 chapters
How token IDs become vectors, why similar words cluster in space, how TF-IDF, Word2Vec, and transformer models learn representations, and how sentence embeddings power semantic search and RAG.
The Lookup Table
Every token in the vocabulary maps to a row in the embedding matrix. When the model sees token ID 5432, it reads that row — a dense vector of 768 to 8192 floats. This lookup is just an index operation: no multiplication, no activation.
The embedding matrix has shape [vocab_size × d_model]. For GPT-2: 50,257 × 768 = 38.6M parameters in this one table. In PyTorch: torch.nn.Embedding(vocab_size, d_model). Acting on indices rather than one-hot vectors makes it memory efficient — effectively a linear layer without the matmul. d_model ranges from 768 (GPT-2) to 8192 (LLaMA 3 70B). Weights start random and are learned end-to-end.
The embedding table is one of the largest weight matrices in the model — just for one lookup.
The embedding layer is just a weight matrix used as a lookup table. The weights are learned through backpropagation — the model discovers which directions in this space encode useful features.