Skip to main contentembed4096tokenattnFFNnormQ·Ksoftmaxlogit0.92sampleembed4096tokenattnFFNnormQ·Ksoftmaxlogit0.92sample
[BOS]The▁cat▁sat→mat.[EOS]decodestream[BOS]The▁cat▁sat→mat.[EOS]decodestream
loss∇θlr=3e-4batchstepepochLoRAr=16SFTRLHFloss∇θlr=3e-4batchstepepochLoRAr=16SFTRLHF
INT4AWQGPTQ70B35GBquantFP16cacheKVpageINT4AWQGPTQ70B35GBquantFP16cacheKVpage
P(x)top-pT=0.7argmaxsamplebeamgreedynucleus0.23nextP(x)top-pT=0.7argmaxsamplebeamgreedynucleus0.23next
layer32head64dim128ropecossinposlayer32head64dim128ropecossinpos
vLLMbatchTP=4GPUA100NVLinkallreduceshardservereqvLLMbatchTP=4GPUA100NVLinkallreduceshardservereq
PPOrewardβ=0.2KLpolicyrefDPOpreferrankGRPOPPOrewardβ=0.2KLpolicyrefDPOpreferrankGRPO
evalMMLU0.68GSM8KhumanarenaELObenchscoretestevalMMLU0.68GSM8KhumanarenaELObenchscoretest
BPEvocab128Kmergebyte▁unhappynesssubwordBPEvocab128Kmergebyte▁unhappynesssubword
← HomeArchitecture Explorer
Watch data flow through real architectures. Click any layer to pause the flow and see internal operations animate step by step.
GPT-2
124M - 12 layers - MHA - GELU - Learned absolute
x12
- -Post-norm (original transformer style)
- -Learned position embeddings (not RoPE)
- -Weight tying between input embeddings and output head