Skip to main content
← Home

The Inference Engine

Serving at scale - 11 chapters

Explore the infrastructure layer that makes LLMs fast and efficient: batching, KV caching, PagedAttention, FlashAttention, speculative decoding, and production serving.

01
Two very different phases

Prefill vs Decode

Inference has two distinct phases. Prefill processes the entire prompt in parallel — it's compute-bound and fast. Decode generates tokens one at a time — it's memory-bandwidth-bound and slow. Everything else in this journey is an attempt to work around this fundamental constraint.

During prefill, all prompt tokens are processed simultaneously through attention and FFN layers — the GPU's compute cores are fully utilized. During decode, each new token requires reading the entire KV cache from memory but only does a tiny amount of computation — the GPU spends most of its time waiting for memory reads. This is why decode throughput is measured in tokens/second and is limited by memory bandwidth (GB/s), not FLOPS. Some systems (like Splitwise and DistServ) run prefill and decode on separate hardware optimized for each phase.

Prompt length512 tokens
Prefill
All 512 tokens at once
GPU util
85%
Mem BW
30%
Time: 10ms
Compute-bound
Decode
one by one
1 token per forward pass
GPU util
15%
Mem BW
90%
Per token: 25ms
Memory-bandwidth-bound

The GPU is a supercomputer that spends most of decode waiting for memory reads. This is why memory bandwidth (GB/s) matters more than FLOPS for token generation, and why batching multiple requests together helps - it amortizes the memory reads.

1 / 11

Now try it yourself

forwardpass.dev

An interactive educational project visualizing how LLM inference, training, and deployment work - from raw text to generated response.

Further reading

  • "Attention Is All You Need" - Vaswani et al., 2017
  • "Language Models are Few-Shot Learners" - Brown et al., 2020
  • "The Illustrated Transformer" - Jay Alammar
  • "Neural Networks: Zero to Hero" - Andrej Karpathy

Built with

  • Next.js + TypeScript
  • Framer Motion
  • Tailwind CSS
  • js-tiktoken
Everything runs in your browser - no data is sent to any server.