Where every FLOP, byte, and VRAM byte lives.
Five sections, ~17 lesson cards. Each card answers "what does this operator cost, and why?" against the scenario above.
- Section 1
Vocabulary
The model's text↔vector edge. Token IDs, embedding lookups, and the LM head — every operator that interacts with the vocab dimension.
- Section 2
Norm
RMSNorm and LayerNorm: tiny per-token rescalings that fire N× per layer. Cheap on their own, cumulative across the stack.
- Section 3
Attention
Q/K/V projection, the Q@Kᵀ quadratic, softmax, attn@V (the KV-cache reader), and output projection. Where context length pays its bill.
- Section 4
FFN
Where the parameters live. Gate / up / down projections own ~70 % of weight bytes in a dense block; MoE makes it conditional.
- Section 5
Stack × autoregress
Why decode reads everything every step. The KV cache as cumulative state; the autoregressive loop as the cost engine.
Five questions, in order.
Each stage starts from one hook question and ends with a synthesis you can articulate. Walked in order, the path goes from "I think LLM costs are arbitrary" to "I can derive the price card from a hardware spec."
- Stage 1 · 12 min skim
Why does GPT-4 cost $10 per million output tokens but $2.50 per million input?
Output tokens cost more than input tokens at every LLM provider. The asymmetry isn't a markup — it's forced by how transformers run on a GPU. Derive the price from first principles.
- Stage 2 · 14 min skim
Why is a 70B model more than 8× slower per token than an 8B model?
Decode runs once per token and reads every weight from HBM each step. Compute hardly enters. Derive the per-token speed from a spec sheet.
- Stage 3 · 14 min skim
Why does an 80 GB GPU only serve 30 concurrent users?
Weights fit in a third of the VRAM. The rest fills with KV cache faster than anyone expects. Derive the capacity cliff that limits every long-context deployment.
- Stage 4 · 8 min skim
Which lever do you reach for?
An organised catalogue of inference levers — quantisation, attention variants, MoE, parallelism, batching, engines — grouped by what they actually move in the decode equation.
- Stage 5 · 20 min skim
Given my SLO and budget, what should I run?
Take a real product spec and walk it to a defensible deployment. The decode and capacity equations from earlier stages now do the work — every choice is one line of arithmetic, not a guess.