Skip to main content
← Home

Tokenizer Training

Building the vocabulary - 6 chapters

How BPE builds a vocabulary from raw bytes. Merge rules, byte-level encoding, vocabulary design, and why tokenizer choices shape everything downstream.

01
The vocabulary problem

Why Build a Tokenizer?

A language model can't process raw text - it needs a fixed vocabulary of discrete tokens. The tokenizer is trained BEFORE the model, and its choices permanently shape everything downstream: what the model can represent efficiently, how long sequences become, and which languages work well.

A bad tokenizer can make your model waste capacity. If 'unhappiness' is one token in English but 6 tokens in another language, the model needs 6x more positions (and attention computation) for the same concept. GPT-2's tokenizer was trained mostly on English web text, which is why it tokenizes non-English text and code inefficiently. LLaMA 3 and Qwen 2.5 trained on multilingual data, resulting in 128K-152K token vocabularies that handle many languages well.

The Vocabulary Problem
Same meaning - "The cat sat on the mat" - across languages
EnglishThe cat sat on the mat
6
SpanishEl gato se sentó en la alfombra
10
Japanese猫がマットの上に座った
18
Arabicجلست القطة على الحصيرة
22
Korean고양이가 매트 위에 앉았다
16
Chinese猫坐在垫子上
12
Token Count Comparison
English - "The cat sat on the mat"6 tokens
Japanese - same meaning18 tokens
More tokens = more compute = slower inference. GPT-2's English-biased vocabulary fragments non-Latin scripts into many small tokens.
1 / 6

forwardpass.dev

An interactive educational project visualizing how LLM inference, training, and deployment work - from raw text to generated response.

Further reading

  • "Attention Is All You Need" - Vaswani et al., 2017
  • "Language Models are Few-Shot Learners" - Brown et al., 2020
  • "The Illustrated Transformer" - Jay Alammar
  • "Neural Networks: Zero to Hero" - Andrej Karpathy

Built with

  • Next.js + TypeScript
  • Framer Motion
  • Tailwind CSS
  • js-tiktoken
Everything runs in your browser - no data is sent to any server.