Learn · Tokenomy

FLOPs / token · decode

Bytes / token · decode

VRAM held (weights)

0 / 11 · Start

Try it in the calculator Why does this cost what it does?

Atlas sections ↓

Atlas — token economics

Where every FLOP, byte, and VRAM byte lives.

Five sections, ~17 lesson cards. Each card answers "what does this operator cost, and why?" against the scenario above.

Section 1
Vocabulary
The model's text↔vector edge. Token IDs, embedding lookups, and the LM head — every operator that interacts with the vocab dimension.
Section 2
Norm
RMSNorm and LayerNorm: tiny per-token rescalings that fire N× per layer. Cheap on their own, cumulative across the stack.
Section 3
Attention
Q/K/V projection, the Q@Kᵀ quadratic, softmax, attn@V (the KV-cache reader), and output projection. Where context length pays its bill.
Section 4
FFN
Where the parameters live. Gate / up / down projections own ~70 % of weight bytes in a dense block; MoE makes it conditional.
Section 5
Stack × autoregress
Why decode reads everything every step. The KV cache as cumulative state; the autoregressive loop as the cost engine.

Spine — cost forensics

Five questions, in order.

Each stage starts from one hook question and ends with a synthesis you can articulate. Walked in order, the path goes from "I think LLM costs are arbitrary" to "I can derive the price card from a hardware spec."

Where every FLOP, byte, and VRAM byte lives.

Vocabulary

Norm

Attention

FFN

Stack × autoregress

Five questions, in order.

Why does GPT-4 cost $10 per million output tokens but $2.50 per million input?

Why is a 70B model more than 8× slower per token than an 8B model?

Why does an 80 GB GPU only serve 30 concurrent users?

Which lever do you reach for?

Given my SLO and budget, what should I run?