embed4096tokenattnFFNnormQ·Ksoftmaxlogit0.92sampleembed4096tokenattnFFNnormQ·Ksoftmaxlogit0.92sample

[BOS]The▁cat▁sat→mat.[EOS]decodestream[BOS]The▁cat▁sat→mat.[EOS]decodestream

loss∇θlr=3e-4batchstepepochLoRAr=16SFTRLHFloss∇θlr=3e-4batchstepepochLoRAr=16SFTRLHF

INT4AWQGPTQ70B35GBquantFP16cacheKVpageINT4AWQGPTQ70B35GBquantFP16cacheKVpage

P(x)top-pT=0.7argmaxsamplebeamgreedynucleus0.23nextP(x)top-pT=0.7argmaxsamplebeamgreedynucleus0.23next

layer32head64dim128ropecossinposlayer32head64dim128ropecossinpos

vLLMbatchTP=4GPUA100NVLinkallreduceshardservereqvLLMbatchTP=4GPUA100NVLinkallreduceshardservereq

PPOrewardβ=0.2KLpolicyrefDPOpreferrankGRPOPPOrewardβ=0.2KLpolicyrefDPOpreferrankGRPO

evalMMLU0.68GSM8KhumanarenaELObenchscoretestevalMMLU0.68GSM8KhumanarenaELObenchscoretest

BPEvocab128Kmergebyte▁unhappynesssubwordBPEvocab128Kmergebyte▁unhappynesssubword

Architecture Explorer

Watch data flow through real architectures. Click any layer to pause the flow and see internal operations animate step by step.

GPT-2

124M - 12 layers - MHA - GELU - Learned absolute

residual

x12

residual

residual

-Post-norm (original transformer style)
-Learned position embeddings (not RoPE)
-Weight tying between input embeddings and output head