Training & Fine-tuning

How models learn - 6 chapters

From pre-training on trillions of tokens to RLHF, GRPO, DPO, and evaluation - how raw neural networks become helpful assistants.

Learning from the internet

Pre-training

Pre-training is where a model learns language itself. Fed trillions of tokens from books, websites, and code, the model learns to predict the next token - and in doing so, acquires grammar, facts, reasoning patterns, and world knowledge.

Training a frontier model costs tens of millions of dollars in compute. The data pipeline is critical: deduplication, quality filtering, toxicity removal, and domain mixing. Scaling laws (Chinchilla) tell us the optimal ratio of parameters to training tokens. A 70B model might train on 15 trillion tokens across thousands of GPUs for months.

Training data streams

Web

Books

Code

Academic

→ Model

Data mix

60%

15%

10%

WebBooksCodeAcademic

step 0 / 100

highlow

Train loss5.300

LR scheduler0.000

Grad norm0.730

Random init

○Grammar emerges

○Facts retained

○Reasoning improves

○Converged

Compute scaling

20B tok

$5K

1T tok

$100K

70B

2T tok

$3M

400B

15T tok

$100M+

1 / 6

Now try it yourself

Continue learning

Inside the Transformer

The attention mechanism and generation pipeline that training optimizes.

→

Modern Techniques

Chain-of-thought and reasoning — extensions of RLHF at inference time.

→