Skip to main content
← Home

Training & Fine-tuning

How models learn - 6 chapters

From pre-training on trillions of tokens to RLHF, GRPO, DPO, and evaluation - how raw neural networks become helpful assistants.

01
Learning from the internet

Pre-training

Pre-training is where a model learns language itself. Fed trillions of tokens from books, websites, and code, the model learns to predict the next token - and in doing so, acquires grammar, facts, reasoning patterns, and world knowledge.

Training a frontier model costs tens of millions of dollars in compute. The data pipeline is critical: deduplication, quality filtering, toxicity removal, and domain mixing. Scaling laws (Chinchilla) tell us the optimal ratio of parameters to training tokens. A 70B model might train on 15 trillion tokens across thousands of GPUs for months.

Training data streams
Web
Books
Code
Academic
→ Model
Data mix
60%
15%
15%
10%
WebBooksCodeAcademic
step 0 / 100
highlow
Train loss5.300
LR scheduler0.000
Grad norm0.730
Random init
Grammar emerges
Facts retained
Reasoning improves
Converged
Compute scaling
1B
20B tok
$5K
7B
1T tok
$100K
70B
2T tok
$3M
400B
15T tok
$100M+
1 / 6

Now try it yourself

forwardpass.dev

An interactive educational project visualizing how LLM inference, training, and deployment work - from raw text to generated response.

Further reading

  • "Attention Is All You Need" - Vaswani et al., 2017
  • "Language Models are Few-Shot Learners" - Brown et al., 2020
  • "The Illustrated Transformer" - Jay Alammar
  • "Neural Networks: Zero to Hero" - Andrej Karpathy

Built with

  • Next.js + TypeScript
  • Framer Motion
  • Tailwind CSS
  • js-tiktoken
Everything runs in your browser - no data is sent to any server.